SlideShare a Scribd company logo
1 of 32
1 © Hortonworks Inc. 2011–2018. All rights reserved
DISCOVER with Data Steward Studio
Understanding and unlocking the value of data
in hybrid enterprise data lake environments
Srikanth Venkat Hemanth Yamijala
Senior Director, Product Management Principal Engineer
Hortonworks Inc. Hortonworks Inc.
2 © Hortonworks Inc. 2011–2018. All rights reserved
Presenters
Hemanth Yamijala
Principal Engineer, Hortonworks
Hortonworks DataPlane Services, Data Steward Studio, Apache
Atlas
Srikanth Venkat
Senior Director of Product Management,
Hortonworks Inc.
Security & Governance portfolio products & services
Apache Ranger, Apache Atlas, Apache Knox, Platform Security, & Hortonworks DataPlane
Service – Data Steward Studio(DSS)
@srikvenk https://www.linkedin.com/in/srikanthvenkat/
@yhemanth https://www.linkedin.com/in/yhemanth/
3 © Hortonworks Inc. 2011–2018. All rights reserved
HDF HDP
Next Generation Data Problems
My Data Is Spread Across Multiple
Clusters and Data Sources
I Store & Analyze Data From
ERP/CRM, Systems, IoT/ Mobile
Devices, Social Media, Geo
Location etc.
Some of my data is on-premise,
some is in the cloud. I move my data
from cloud to on-premise & vice
versa between different clouds
™ ®
4 © Hortonworks Inc. 2011–2018. All rights reserved
Forrester Calls It Data Fabric
“Bringing together disparate big data sources automatically, intelligently,
and securely and processing them in a big data platform technology, using
data lakes, Hadoop, and Apache Spark to deliver a unified, trusted, and
comprehensive view of customer and business data.”
5 © Hortonworks Inc. 2011–2018. All rights reserved
Data Steward Studio
In the Cloud
On Premises
Data Steward Studio
Global Intelligent Data Catalog & Business Glossary
Curate & Organize Assets, Asset 360 Single View, Data & Metadata Security
(Structured
)
(Structured
)
Cluster 1 Dublin
(Unstructured)
(Structured
)
(Unstructured)
Cluster 2 Las Vegas
(Unstructured)
(Structured
)
(Structured
)
Cluster 3 Bangkok
Apache Ranger
(Structured
)
Apache Ranger Apache Ranger
(Unstructured)
(Structured
)
(Unstructured)
(Structured
)
Apache Ranger Apache Ranger
6 © Hortonworks Inc. 2011–2018. All rights reserved
Dataplane: Bring All Data Under Management
From the edge, through movement, to rest
Hortonworks DataPlane Service
a foundational platform for the delivery of data
solutions that will:
• Support enterprise hybrid deployment strategy
and adoption of cloud
• Common Metadata, Security and Governance
across all deployments
• Simplified enterprise data asset management
• Extensible to new services: Services enablement
layer for rapidly bringing new solutions to market
HORTONWORKS
DATAPLANE
SERVICE
MULTIPLE CLUSTERS AND SOURCES
MULTIHYBRID
Manage, Secure, Govern
DATA AT REST
Hortonworks
Data Platform
DATA IN MOTION
Hortonworks
Data Flow
7 © Hortonworks Inc. 2011–2018. All rights reserved.
Data Governance: It’s a team sport!
Implements business data
requirements
Data CuratorData Steward
Manages business requirements
for data sharing
Sponsor
Champions data governance
across enterprise
Data Owner
Accountable for all data
generated by an agency
Supports the Data Steward in
data related activities
Business Data SME
Coordinate cross-agency data
management activities
Data Council
11 © Hortonworks Inc. 2011–2018. All rights reserved.
Goals
12 © Hortonworks Inc. 2011–2018. All rights reserved
Data Steward Studio Demo
13 © Hortonworks Inc. 2011–2018. All rights reserved.
Hortonworks DataPlane Service (DPS)
14 © Hortonworks Inc. 2011–2018. All rights reserved.
Organize Your Data Assets as Collections
• Data Asset Collections - Organizational
construct for assets based on business
definition for grouping heterogenous data
• Create asset collections and attach
metadata
• Contextual attributes: Name,
Description, Owner, Datalake
• System attributes: - Created-on,
Modified-on, Modified-by, Created-by,
Version
• Search for assets using attribute facets or
free text
• View personalized dashboard of asset
collections
• Delete/update data asset collections
• Asset 360 view for assets in collection
Asset Collections
15 © Hortonworks Inc. 2011–2018. All rights reserved.
Discover and Fingerprint your Data Assets
• Computes Profile for data assets
as they are ingested or created
within the platform. Automatically
determines types of columns
based on data values
• Generates key metrics for data in
each column. Various
visualizations can be utilized (Box
plots, Histograms, Pie charts) to
view metrics
• Persists profile information in
cluster
• As more data is added, profilers
can be scheduled for execution for
updating the profile metadata for
the asset.
Data Profiler
Column Statistical Profiler
16 © Hortonworks Inc. 2011–2018. All rights reserved.
Know your Sensitive Data
• Automatically detect and
profile sensitive & personal
data
• Attach classification
annotations for sensitivity
• Manual approval and curation
of sensitive data
classifications
• Leverage classification based
data protection
• Sensitive data dashboard on
Asset 360
Sensitive Data Profiling
17 © Hortonworks Inc. 2011–2018. All rights reserved.
Track your Sensitive Data
• IBAN (27 EU Countries)
• Credit Card Numbers
• Email
• Telephone (AMER, EU)
• IP Address
• URL
• Passport (12 EU Countries)
• National ID (19 EU Countries)
• Australian Drivers License
• Australian Passport
• Australian National ID
Sensitive Data Types
18 © Hortonworks Inc. 2011–2018. All rights reserved.
Track Your Data Asset – Lineage and Impact
• Consolidated Upstream lineage and
downstream impact
• Detailed click-through to asset properties
Data Lineage and Impact
19 © Hortonworks Inc. 2011–2018. All rights reserved.
View Security Policies for your Data Assets
• View security policies on
data assets
• View classification based
policies on assets
Security Policies
20 © Hortonworks Inc. 2011–2018. All rights reserved.
Monitor Usage of your Data Assets
• Dashboard for access patterns and
trends for each asset
• Examples:
• Count of Access Events
• Top N Users over Time
• Most recent trail of access audit
events
Audit and Monitoring
21 © Hortonworks Inc. 2011–2018. All rights reserved.
Data Steward Studio Technical Overview
22 © Hortonworks Inc. 2011–2018. All rights reserved
The DPS Ecosystem
DPS PLATFORM
DATA
LIFECYCLE
MANAGER
DATA
STEWARD
STUDIO*
DATA
ANALYTICS
STUDIO*
STREAMS
MESSAGING
MANAGER
DATA PLANE SERVICES
Authentication, Role-based access, Service lifecycle management,
Cluster registration, Cluster Service discovery and access
HDP/HDF Cluster
DLM Engine
Profiler
Service
Profile Manager
Data Analytics
Studio Service
23 © Hortonworks Inc. 2011–2018. All rights reserved
DSS Fit into DPS / HDP Ecosystem
DPS PLATFORM
DATA
STEWARD
STUDIO*
DATA PLANE SERVICES
DATAPLANE
HDP/HDF Cluster
Hive
Metastore
Atlas Ranger
P R O F I L E R S E R V I C E
Spark/Livy HDFS
24 © Hortonworks Inc. 2011–2018. All rights reserved
DSS Architecture
Postgres DB
Data Steward Studio
Play Web
Application
Angular UI
Knox
Profiler Service
Hive / HDFS Data
Livy Batch
Profilers
ProfilersProfilers
<Spark Jobs>
Atlas
Ranger
Summary
Files on
HDFS
Livy Interactive
Postgres /
MySQL DB
25 © Hortonworks Inc. 2011–2018. All rights reserved
Profiler Agent
● A generic framework supporting the registration, scheduling and management of
data processing jobs that help with data discovery, classification and analysis
● APIs exposed for:
○ Registering and management of profilers and profiler instances
○ Configuring and scheduling profilers
○ Monitoring status
○ Managing serving of interactive metrics
● Goals for profiler agent
○ Extensibility (support new profilers, new asset types etc)
○ Performance and scalability
26 © Hortonworks Inc. 2011–2018. All rights reserved
Profiler Agent
Profiler Agent Design
Asset
Source
Queue stats
Profile Queue
Asset
Selector
Job
Scheduler
Priority
rules
Asset
Align left
Asset
Filters
Profiler Job
Profiler
Metrics
27 © Hortonworks Inc. 2011–2018. All rights reserved
DSS - Basic profilers
Profiler Purpose
Sensitivity
● Identify sensitive information in Hive tables based on regexes &
column Headers
● Aimed at GDPR like use cases
Table Stats Profiler
● Calculates statistical summaries on Hive table columns for
understanding the ‘shape’ of data
Audit Profiler
● Aggregates usage statistics of tables by different facets like user,
asset name, access status, time, etc.
Hive Metastore ● Provides metadata about all tables
28 © Hortonworks Inc. 2011–2018. All rights reserved
Sensitive Data Profiling
● 70+ Regexes covering National ID numbers, PII information (email, credit card etc) are shipped and loaded
into HDFS
● Sensitive data profiler runs on a sample of hive tables against these regexes
● 85% weight to data, 15% to name of the column
● Tag workflow
○ Matched columns are tagged in Atlas with an attribute state ‘suggested’
○ In Asset-360 view, Data steward can ‘accept’ or ‘reject’ tags - the state is written back to Atlas
○ Can define Ranger policies against these tags (with attribute state = ‘accepted’) for governance of these columns
29 © Hortonworks Inc. 2011–2018. All rights reserved
Asset collections
● Metadata search using Atlas
attributes using the Atlas DSL API for
all the queries. E.g.:
○ All hive tables with name ‘customer’
○ All hive tables created before date
○ All hive tables created before date and
owned by ‘admin’
● Search -> Gather -> Save workflow
● Collaboration features
○ Comments, Ratings, Favorites
● All data stored in Dataplane DB
30 © Hortonworks Inc. 2011–2018. All rights reserved
Datalake & Asset collection Dashboards
● Three profilers contribute information:
○ Metastore profiler: Capture list of hive tables
created on a daily basis
○ Sensitivity profiler: Summary of tables with
sensitive information for generating aggregates
○ Audit profiler: Summary of Ranger audit logs
stored in HDFS
● All profiler information is stored as serialized
files in HDFS
● Queries for dashboard widgets are served
by Livy interactive sessions reading this
information.
● Livy session management is done by profiler
agent.
31 © Hortonworks Inc. 2011–2018. All rights reserved
Asset 360
● Like a ‘Facebook profile’
page for a Hive table
○ Metadata (number of
rows/columns/sensitivity
distribution)
○ Lineage
○ Access summaries
○ Schema & Data
distributions - shape of
data
○ Applicable Ranger
policies
○ Ranger audit logs
32 © Hortonworks Inc. 2011–2018. All rights reserved
Ranger Audit Profiler
● Needs some pre-requisites:
○ Ranger audit logging to HDFS should be enabled for Hive
○ Ranger policy allowing dpprofiler user access to the ranger audit directory is required.
■ This is added by the product automatically at MPack install
● 2 jobs are run:
○ First one runs to analyze logs of last day’s data
○ Second one runs to analyze logs of current day’s data at a faster interval
● Provides info about:
○ Top users
○ Top assets
○ Authorized vs Unauthorized accesses
○ Above is joined with other information to get:
■ Top asset collections
■ Top sensitive assets being used, ...
32
33 © Hortonworks Inc. 2011–2018. All rights reserved.
Data Steward Studio recap
Assets & Collections
Audit and Monitoring Dashboard
Security Policies
Data Lineage and Impact
Data Profiler
Data Steward Studio (DSS)
Capabilities
• Profile Data for understanding shape and
structure
• Organize and curate data for e.g. by
domains they belong to or data usage
• Identify sensitive data
• Collaborate with broader teams on how
data needs to be used and by who and
provide community ratings for
crowdsourcing knowledge
• Monitor ongoing usage, visualize chain of
custody and trustworthiness for longer
term use, understand data protection
DSS provides the “tooling” part of the People, Processes,
and Technology required for Hybrid Data Lake Governance
34 © Hortonworks Inc. 2011–2018. All rights reserved.
What Is New In Apache Atlas 1.0?
When: Wednesday June 20, 11:00 AM - 11:40 AM
Where: Grand Ballroom 220B
Overview of New Features in Apache Ranger
When: Wednesday June 20, 2:00 PM - 2:40 PM
Where: Executive Ballroom 210B/F
GDPR Crash Course
When: Wednesday June 20, 3:00PM -
6:00PM
Where: Meeting Room 212C/D
Birds of a Feather: Security &
Governance
When: Wednesday June 20, 5:40 PM - 6:50
PM
Where: Executive Ballroom 210B/F
GDPR-Focused Partner Community
Showcase for Apache Ranger and Apache
Atlas
When: Thursday June 21, 9:30 AM - 10:10
AM
Where: Meeting Room 230A
Check Out These Sessions:
35 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you!

More Related Content

More from DataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentDataWorks Summit
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteDataWorks Summit
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraDataWorks Summit
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachDataWorks Summit
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsDataWorks Summit
 

More from DataWorks Summit (20)

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Data Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake EnvironmentData Protection in Hybrid Enterprise Data Lake Environment
Data Protection in Hybrid Enterprise Data Lake Environment
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science Institute
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
 

Recently uploaded

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

DISCOVER with Data Steward Studio: Understanding and unlocking the value of data in hybrid enterprise data lake environments

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved DISCOVER with Data Steward Studio Understanding and unlocking the value of data in hybrid enterprise data lake environments Srikanth Venkat Hemanth Yamijala Senior Director, Product Management Principal Engineer Hortonworks Inc. Hortonworks Inc.
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Presenters Hemanth Yamijala Principal Engineer, Hortonworks Hortonworks DataPlane Services, Data Steward Studio, Apache Atlas Srikanth Venkat Senior Director of Product Management, Hortonworks Inc. Security & Governance portfolio products & services Apache Ranger, Apache Atlas, Apache Knox, Platform Security, & Hortonworks DataPlane Service – Data Steward Studio(DSS) @srikvenk https://www.linkedin.com/in/srikanthvenkat/ @yhemanth https://www.linkedin.com/in/yhemanth/
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved HDF HDP Next Generation Data Problems My Data Is Spread Across Multiple Clusters and Data Sources I Store & Analyze Data From ERP/CRM, Systems, IoT/ Mobile Devices, Social Media, Geo Location etc. Some of my data is on-premise, some is in the cloud. I move my data from cloud to on-premise & vice versa between different clouds ™ ®
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Forrester Calls It Data Fabric “Bringing together disparate big data sources automatically, intelligently, and securely and processing them in a big data platform technology, using data lakes, Hadoop, and Apache Spark to deliver a unified, trusted, and comprehensive view of customer and business data.”
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Data Steward Studio In the Cloud On Premises Data Steward Studio Global Intelligent Data Catalog & Business Glossary Curate & Organize Assets, Asset 360 Single View, Data & Metadata Security (Structured ) (Structured ) Cluster 1 Dublin (Unstructured) (Structured ) (Unstructured) Cluster 2 Las Vegas (Unstructured) (Structured ) (Structured ) Cluster 3 Bangkok Apache Ranger (Structured ) Apache Ranger Apache Ranger (Unstructured) (Structured ) (Unstructured) (Structured ) Apache Ranger Apache Ranger
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Dataplane: Bring All Data Under Management From the edge, through movement, to rest Hortonworks DataPlane Service a foundational platform for the delivery of data solutions that will: • Support enterprise hybrid deployment strategy and adoption of cloud • Common Metadata, Security and Governance across all deployments • Simplified enterprise data asset management • Extensible to new services: Services enablement layer for rapidly bringing new solutions to market HORTONWORKS DATAPLANE SERVICE MULTIPLE CLUSTERS AND SOURCES MULTIHYBRID Manage, Secure, Govern DATA AT REST Hortonworks Data Platform DATA IN MOTION Hortonworks Data Flow
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved. Data Governance: It’s a team sport! Implements business data requirements Data CuratorData Steward Manages business requirements for data sharing Sponsor Champions data governance across enterprise Data Owner Accountable for all data generated by an agency Supports the Data Steward in data related activities Business Data SME Coordinate cross-agency data management activities Data Council
  • 8. 11 © Hortonworks Inc. 2011–2018. All rights reserved. Goals
  • 9. 12 © Hortonworks Inc. 2011–2018. All rights reserved Data Steward Studio Demo
  • 10. 13 © Hortonworks Inc. 2011–2018. All rights reserved. Hortonworks DataPlane Service (DPS)
  • 11. 14 © Hortonworks Inc. 2011–2018. All rights reserved. Organize Your Data Assets as Collections • Data Asset Collections - Organizational construct for assets based on business definition for grouping heterogenous data • Create asset collections and attach metadata • Contextual attributes: Name, Description, Owner, Datalake • System attributes: - Created-on, Modified-on, Modified-by, Created-by, Version • Search for assets using attribute facets or free text • View personalized dashboard of asset collections • Delete/update data asset collections • Asset 360 view for assets in collection Asset Collections
  • 12. 15 © Hortonworks Inc. 2011–2018. All rights reserved. Discover and Fingerprint your Data Assets • Computes Profile for data assets as they are ingested or created within the platform. Automatically determines types of columns based on data values • Generates key metrics for data in each column. Various visualizations can be utilized (Box plots, Histograms, Pie charts) to view metrics • Persists profile information in cluster • As more data is added, profilers can be scheduled for execution for updating the profile metadata for the asset. Data Profiler Column Statistical Profiler
  • 13. 16 © Hortonworks Inc. 2011–2018. All rights reserved. Know your Sensitive Data • Automatically detect and profile sensitive & personal data • Attach classification annotations for sensitivity • Manual approval and curation of sensitive data classifications • Leverage classification based data protection • Sensitive data dashboard on Asset 360 Sensitive Data Profiling
  • 14. 17 © Hortonworks Inc. 2011–2018. All rights reserved. Track your Sensitive Data • IBAN (27 EU Countries) • Credit Card Numbers • Email • Telephone (AMER, EU) • IP Address • URL • Passport (12 EU Countries) • National ID (19 EU Countries) • Australian Drivers License • Australian Passport • Australian National ID Sensitive Data Types
  • 15. 18 © Hortonworks Inc. 2011–2018. All rights reserved. Track Your Data Asset – Lineage and Impact • Consolidated Upstream lineage and downstream impact • Detailed click-through to asset properties Data Lineage and Impact
  • 16. 19 © Hortonworks Inc. 2011–2018. All rights reserved. View Security Policies for your Data Assets • View security policies on data assets • View classification based policies on assets Security Policies
  • 17. 20 © Hortonworks Inc. 2011–2018. All rights reserved. Monitor Usage of your Data Assets • Dashboard for access patterns and trends for each asset • Examples: • Count of Access Events • Top N Users over Time • Most recent trail of access audit events Audit and Monitoring
  • 18. 21 © Hortonworks Inc. 2011–2018. All rights reserved. Data Steward Studio Technical Overview
  • 19. 22 © Hortonworks Inc. 2011–2018. All rights reserved The DPS Ecosystem DPS PLATFORM DATA LIFECYCLE MANAGER DATA STEWARD STUDIO* DATA ANALYTICS STUDIO* STREAMS MESSAGING MANAGER DATA PLANE SERVICES Authentication, Role-based access, Service lifecycle management, Cluster registration, Cluster Service discovery and access HDP/HDF Cluster DLM Engine Profiler Service Profile Manager Data Analytics Studio Service
  • 20. 23 © Hortonworks Inc. 2011–2018. All rights reserved DSS Fit into DPS / HDP Ecosystem DPS PLATFORM DATA STEWARD STUDIO* DATA PLANE SERVICES DATAPLANE HDP/HDF Cluster Hive Metastore Atlas Ranger P R O F I L E R S E R V I C E Spark/Livy HDFS
  • 21. 24 © Hortonworks Inc. 2011–2018. All rights reserved DSS Architecture Postgres DB Data Steward Studio Play Web Application Angular UI Knox Profiler Service Hive / HDFS Data Livy Batch Profilers ProfilersProfilers <Spark Jobs> Atlas Ranger Summary Files on HDFS Livy Interactive Postgres / MySQL DB
  • 22. 25 © Hortonworks Inc. 2011–2018. All rights reserved Profiler Agent ● A generic framework supporting the registration, scheduling and management of data processing jobs that help with data discovery, classification and analysis ● APIs exposed for: ○ Registering and management of profilers and profiler instances ○ Configuring and scheduling profilers ○ Monitoring status ○ Managing serving of interactive metrics ● Goals for profiler agent ○ Extensibility (support new profilers, new asset types etc) ○ Performance and scalability
  • 23. 26 © Hortonworks Inc. 2011–2018. All rights reserved Profiler Agent Profiler Agent Design Asset Source Queue stats Profile Queue Asset Selector Job Scheduler Priority rules Asset Align left Asset Filters Profiler Job Profiler Metrics
  • 24. 27 © Hortonworks Inc. 2011–2018. All rights reserved DSS - Basic profilers Profiler Purpose Sensitivity ● Identify sensitive information in Hive tables based on regexes & column Headers ● Aimed at GDPR like use cases Table Stats Profiler ● Calculates statistical summaries on Hive table columns for understanding the ‘shape’ of data Audit Profiler ● Aggregates usage statistics of tables by different facets like user, asset name, access status, time, etc. Hive Metastore ● Provides metadata about all tables
  • 25. 28 © Hortonworks Inc. 2011–2018. All rights reserved Sensitive Data Profiling ● 70+ Regexes covering National ID numbers, PII information (email, credit card etc) are shipped and loaded into HDFS ● Sensitive data profiler runs on a sample of hive tables against these regexes ● 85% weight to data, 15% to name of the column ● Tag workflow ○ Matched columns are tagged in Atlas with an attribute state ‘suggested’ ○ In Asset-360 view, Data steward can ‘accept’ or ‘reject’ tags - the state is written back to Atlas ○ Can define Ranger policies against these tags (with attribute state = ‘accepted’) for governance of these columns
  • 26. 29 © Hortonworks Inc. 2011–2018. All rights reserved Asset collections ● Metadata search using Atlas attributes using the Atlas DSL API for all the queries. E.g.: ○ All hive tables with name ‘customer’ ○ All hive tables created before date ○ All hive tables created before date and owned by ‘admin’ ● Search -> Gather -> Save workflow ● Collaboration features ○ Comments, Ratings, Favorites ● All data stored in Dataplane DB
  • 27. 30 © Hortonworks Inc. 2011–2018. All rights reserved Datalake & Asset collection Dashboards ● Three profilers contribute information: ○ Metastore profiler: Capture list of hive tables created on a daily basis ○ Sensitivity profiler: Summary of tables with sensitive information for generating aggregates ○ Audit profiler: Summary of Ranger audit logs stored in HDFS ● All profiler information is stored as serialized files in HDFS ● Queries for dashboard widgets are served by Livy interactive sessions reading this information. ● Livy session management is done by profiler agent.
  • 28. 31 © Hortonworks Inc. 2011–2018. All rights reserved Asset 360 ● Like a ‘Facebook profile’ page for a Hive table ○ Metadata (number of rows/columns/sensitivity distribution) ○ Lineage ○ Access summaries ○ Schema & Data distributions - shape of data ○ Applicable Ranger policies ○ Ranger audit logs
  • 29. 32 © Hortonworks Inc. 2011–2018. All rights reserved Ranger Audit Profiler ● Needs some pre-requisites: ○ Ranger audit logging to HDFS should be enabled for Hive ○ Ranger policy allowing dpprofiler user access to the ranger audit directory is required. ■ This is added by the product automatically at MPack install ● 2 jobs are run: ○ First one runs to analyze logs of last day’s data ○ Second one runs to analyze logs of current day’s data at a faster interval ● Provides info about: ○ Top users ○ Top assets ○ Authorized vs Unauthorized accesses ○ Above is joined with other information to get: ■ Top asset collections ■ Top sensitive assets being used, ... 32
  • 30. 33 © Hortonworks Inc. 2011–2018. All rights reserved. Data Steward Studio recap Assets & Collections Audit and Monitoring Dashboard Security Policies Data Lineage and Impact Data Profiler Data Steward Studio (DSS) Capabilities • Profile Data for understanding shape and structure • Organize and curate data for e.g. by domains they belong to or data usage • Identify sensitive data • Collaborate with broader teams on how data needs to be used and by who and provide community ratings for crowdsourcing knowledge • Monitor ongoing usage, visualize chain of custody and trustworthiness for longer term use, understand data protection DSS provides the “tooling” part of the People, Processes, and Technology required for Hybrid Data Lake Governance
  • 31. 34 © Hortonworks Inc. 2011–2018. All rights reserved. What Is New In Apache Atlas 1.0? When: Wednesday June 20, 11:00 AM - 11:40 AM Where: Grand Ballroom 220B Overview of New Features in Apache Ranger When: Wednesday June 20, 2:00 PM - 2:40 PM Where: Executive Ballroom 210B/F GDPR Crash Course When: Wednesday June 20, 3:00PM - 6:00PM Where: Meeting Room 212C/D Birds of a Feather: Security & Governance When: Wednesday June 20, 5:40 PM - 6:50 PM Where: Executive Ballroom 210B/F GDPR-Focused Partner Community Showcase for Apache Ranger and Apache Atlas When: Thursday June 21, 9:30 AM - 10:10 AM Where: Meeting Room 230A Check Out These Sessions:
  • 32. 35 © Hortonworks Inc. 2011–2018. All rights reserved Thank you!