1 © Hortonworks Inc. 2011–2018. All rights reserved.
Hortonworks confidential and proprietary information
Understanding Your Crown Jewels:
Finding, Organizing, and Profiling Sensitive Data Across Data Lakes
Ashwin Rajeev, Engineering
Srikanth Venkat, Senior Director Product Management
© Hortonworks Inc. 2011–2018. All rights reserved.
Hortonworks confidential and proprietary information
HDF HDP
Next Generation Data Problems
My Data Is Spread Across Multiple
Clusters and Data Sources
I Store & Analyze Data From
ERP/CRM, Systems, IoT/ Mobile
Devices, Social Media, Geo
Location etc.
Some of my data is on-premise,
some is in the cloud. I move my data
from cloud to on-premise & vice
versa between different clouds
™ ®
© Hortonworks Inc. 2011–2018. All rights reserved.
Hortonworks confidential and proprietary information
What If…
In the Cloud
On Premises
Aware of
Data Sources
Enable
New Services
Unified
Security &
Governance
Model
Cluster 2
(Unstructured)
Cluster 1
(Structured)
Cluster 2
(Unstructured)
Cluster 1
(Structured)
Cluster 3
(Structured)
Data Center Dublin
Cluster 2
(Unstructured)
Cluster 1
(Structured)
Cluster 3
(Structured)
Cluster 4
(Unstructured)
Data Center Las Vegas
Cluster 2
(Unstructured)
Cluster 1
(Structured)
Cluster 3
(Structured)
Data Center Bangkok
Cluster 1
(Unstructured)
Cluster 2
(Structured)
© Hortonworks Inc. 2011–2018. All rights reserved.
Hortonworks confidential and proprietary information
Forrester Calls It Data Fabric
“Bringing together disparate big data sources automatically, intelligently, and
securely and processing them in a big data platform technology, using data
lakes, Hadoop, and Apache Spark to deliver a unified, trusted, and
comprehensive view of customer and business data.”
© Hortonworks Inc. 2011–2018. All rights reserved.
Hortonworks confidential and proprietary information
Hortonworks DataPlane Service
a platform with extensible data management
services for:
 Addressing compliance and regulatory requirements for
enterprise
 Providing consistent security & governance across data
landscape
 Enabling centralized management of data assets
 Responsible data sharing and collaboration
What is Hortonworks DataPlane Service?
6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Data Steward Studio (DSS)
Suite of capabilities that allows users to understand,
secure, and govern data across enterprise data lakes
Ensure consistent security and governance
for data assets across tiers
• Curate, discover and organize data assets
based on business classifications, purpose, protections, relevance, etc.
• Govern proper usage and lineage of data assets
to identify schema, classification and view lineage/data supply chain
• Understand and audit data asset security and use
for anomaly detection, forensic audit/compliance & proper control
mechanisms
…all across multiple types and tiers of data
Technical Preview Available
Hortonworks DataPlane Service: Extensible Services
DATA STEWARD STUDIODSS
Discover &
Fingerprint
Data
Smart
Enterprise
Search
Data & Metadata
Security
Data Lineage &
Impact Analysis
Enterprise
Data
Catalog
Organize &
Curate Data
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Steward Studio (DSS)
Suite of capabilities that allows users to understand, secure, and govern data across enterprise data lakes
Organize and curate data globally
• Organize data based on business classifications, purpose, protections needed, etc.
• Promote responsible collaboration across enterprise data workers
Understand where relevant data is located
• Cataloging and searching to locate relevant data of interest:
– sensitive data (e.g GDPR), highly used, high risk data etc.
Understand how data is interpreted for use
• Basic descriptions: Schema, classifications (business cataloging), encodings
• Statistical models and parameters
• User annotations, Wrangling scripts, View definitions etc.
Understand how data is created and modified
• Visualize ‘Upstream’ and ‘downstream’ lineage
• How does schema or data evolve?
• View and understand data supply chain (pipelines, versioning, evolution)
Understand how data access is secured/protected and audit usage
• Who can see which data and metadata (e.g. based on business classifications) under what
conditions (security policies, data protection, anonymization)?
• Who has accessed what data from a forensic audit/compliance perspective?
• Visualize access patterns and identify anomalies
Goals
DATA STEWARD STUDIODSS
Discover &
Fingerprint
Data
Smart
Enterprise
Search
Data & Metadata
Security
Data Lineage &
Impact Analysis
Enterprise
Data
Catalog
Organize &
Curate Data
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSS v1.0 Capabilities
Asset Collections
 Organizational construct for grouping heterogenous assets
based on a business definition
 Create asset collections with metadata
 Contextual attributes: name, description, owner, datalake
 System attributes: - Created-on, Modified-on, Modified-by, Created-by, Version
 Search for assets using attribute facets or free text
 View personalized dashboard of asset collections
 Delete data asset collections
 Asset 360 view for assets in collection
Asset Organization
&
Cataloging
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSS v1.0 Capabilities
Metadata & Structure for Hive Tables
 View details of a data asset
 Compute data profile (univariate column statistics) for columns of an asset
Show relevant visualizations for profiled data (Histogram, box plot, pie charts)
 Schedule computing data profiles for all assets
 Persist profile information in-cluster
 View data asset classifications
 Ability to show technical and contextual metadata
Lineage & Impact
 Consolidated upstream lineage & downstream impact
 Detailed “click-through” to asset properties
Asset 360
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSS v1.0 Capabilities
Security policies
 View specific security policies on data assets
 View classification based policies on assets
Audit & monitoring
 Ability to show most recent tail of access audit events
 Ability to provide a dashboard of access patterns for asset
 Timeline of access
 Top K users
 Count of Access Events (Permitted/blocked)
 Top Policies that fired
Asset 360
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSS v1.0 Capabilities
Collaboration
 Rate asset collections
 Favorite/ personalize Data Assets
 Share Asset Collections
 Threaded discussion on Asset Collections
Collaboration
Deep Dive
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Plane
Core Services
Ambari
HDFS Hive
DSS
Engine
Cluster 1
…
Data Plane DB
Knox
LDAP / AD
Ambari
HDFS Hive
DSS
Engine
Cluster 2
Knox
Ambari
HDFS Hive
DSS
Engine
Cluster N
Knox
DSS Single User Store
High Level Architecture of DP/DSS
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Plane core services
 Set of loosely coupled collaborative services delivered as Docker containers.
 Service API’s allow applications such as DSS to be built using RESTFul API’s
which hide the complexity of talking to different services in each cluster
 Core services provide cross-cutting concerns such as
 Service registration and discovery
 An API gateway with Fault tolerance and various load balancers
 User identity federation and RBAC
 Access cluster data using Apache Knox
 Audit and Logging
 Monitoring and health checks.
 Allows developers to pick and choose API’s to deliver business applications eg:
DLM Engine (DLM) Atlas, Ranger(DSS)
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Plane Applications
 Self contained web applications which use a related set of Data plane core
API’s to deliver business value.
 Delivered as Docker containers.
 Independent and isolated from each other, SSO between applications
 Convention driven programming model allows applications to be programmed
in any tech stack.
Knox
Atlas
Data Plane
Core Services
Ranger
DSS
Hive
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
DSS Architecture
 DSS uses Atlas, Ranger and Knox as foundational services
 DSS installs a component in each cluster to enable data profiling.
DSS
Data plane services
Profiler Engine
Cluster
Data
profiler
Data
profiler
Data
profiler
Profiler compute
Hive
DEMO
© Hortonworks Inc. 2011–2018. All rights reserved.
Hortonworks confidential and proprietary information
Scenario – HortoniaBank
Marketing
Demographics
Electronic
medical records
CRM
POS
(Structured)(Structured) (Structured) (Structured) (Structured)
Cluster 1: Dublin Cluster 2: San Francisco
(Unstructured)(Unstructured)(Unstructured)
Cluster 3: Prague
(Structured)
On Premise Data Lakes
(Unstructured)(Structured) (Unstructured) (Structured)
Cloud Data Lakes
Social
Weblogs & Feeds
Transactional
Mobile
IoT
Personal Data
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You !

Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive Data Across Data Lakes

  • 1.
    1 © HortonworksInc. 2011–2018. All rights reserved. Hortonworks confidential and proprietary information Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive Data Across Data Lakes Ashwin Rajeev, Engineering Srikanth Venkat, Senior Director Product Management
  • 2.
    © Hortonworks Inc.2011–2018. All rights reserved. Hortonworks confidential and proprietary information HDF HDP Next Generation Data Problems My Data Is Spread Across Multiple Clusters and Data Sources I Store & Analyze Data From ERP/CRM, Systems, IoT/ Mobile Devices, Social Media, Geo Location etc. Some of my data is on-premise, some is in the cloud. I move my data from cloud to on-premise & vice versa between different clouds ™ ®
  • 3.
    © Hortonworks Inc.2011–2018. All rights reserved. Hortonworks confidential and proprietary information What If… In the Cloud On Premises Aware of Data Sources Enable New Services Unified Security & Governance Model Cluster 2 (Unstructured) Cluster 1 (Structured) Cluster 2 (Unstructured) Cluster 1 (Structured) Cluster 3 (Structured) Data Center Dublin Cluster 2 (Unstructured) Cluster 1 (Structured) Cluster 3 (Structured) Cluster 4 (Unstructured) Data Center Las Vegas Cluster 2 (Unstructured) Cluster 1 (Structured) Cluster 3 (Structured) Data Center Bangkok Cluster 1 (Unstructured) Cluster 2 (Structured)
  • 4.
    © Hortonworks Inc.2011–2018. All rights reserved. Hortonworks confidential and proprietary information Forrester Calls It Data Fabric “Bringing together disparate big data sources automatically, intelligently, and securely and processing them in a big data platform technology, using data lakes, Hadoop, and Apache Spark to deliver a unified, trusted, and comprehensive view of customer and business data.”
  • 5.
    © Hortonworks Inc.2011–2018. All rights reserved. Hortonworks confidential and proprietary information Hortonworks DataPlane Service a platform with extensible data management services for:  Addressing compliance and regulatory requirements for enterprise  Providing consistent security & governance across data landscape  Enabling centralized management of data assets  Responsible data sharing and collaboration What is Hortonworks DataPlane Service?
  • 6.
    6 © HortonworksInc. 2011 – 2018. All Rights Reserved Data Steward Studio (DSS) Suite of capabilities that allows users to understand, secure, and govern data across enterprise data lakes Ensure consistent security and governance for data assets across tiers • Curate, discover and organize data assets based on business classifications, purpose, protections, relevance, etc. • Govern proper usage and lineage of data assets to identify schema, classification and view lineage/data supply chain • Understand and audit data asset security and use for anomaly detection, forensic audit/compliance & proper control mechanisms …all across multiple types and tiers of data Technical Preview Available Hortonworks DataPlane Service: Extensible Services DATA STEWARD STUDIODSS Discover & Fingerprint Data Smart Enterprise Search Data & Metadata Security Data Lineage & Impact Analysis Enterprise Data Catalog Organize & Curate Data
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Steward Studio (DSS) Suite of capabilities that allows users to understand, secure, and govern data across enterprise data lakes Organize and curate data globally • Organize data based on business classifications, purpose, protections needed, etc. • Promote responsible collaboration across enterprise data workers Understand where relevant data is located • Cataloging and searching to locate relevant data of interest: – sensitive data (e.g GDPR), highly used, high risk data etc. Understand how data is interpreted for use • Basic descriptions: Schema, classifications (business cataloging), encodings • Statistical models and parameters • User annotations, Wrangling scripts, View definitions etc. Understand how data is created and modified • Visualize ‘Upstream’ and ‘downstream’ lineage • How does schema or data evolve? • View and understand data supply chain (pipelines, versioning, evolution) Understand how data access is secured/protected and audit usage • Who can see which data and metadata (e.g. based on business classifications) under what conditions (security policies, data protection, anonymization)? • Who has accessed what data from a forensic audit/compliance perspective? • Visualize access patterns and identify anomalies Goals DATA STEWARD STUDIODSS Discover & Fingerprint Data Smart Enterprise Search Data & Metadata Security Data Lineage & Impact Analysis Enterprise Data Catalog Organize & Curate Data
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved DSS v1.0 Capabilities Asset Collections  Organizational construct for grouping heterogenous assets based on a business definition  Create asset collections with metadata  Contextual attributes: name, description, owner, datalake  System attributes: - Created-on, Modified-on, Modified-by, Created-by, Version  Search for assets using attribute facets or free text  View personalized dashboard of asset collections  Delete data asset collections  Asset 360 view for assets in collection Asset Organization & Cataloging
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved DSS v1.0 Capabilities Metadata & Structure for Hive Tables  View details of a data asset  Compute data profile (univariate column statistics) for columns of an asset Show relevant visualizations for profiled data (Histogram, box plot, pie charts)  Schedule computing data profiles for all assets  Persist profile information in-cluster  View data asset classifications  Ability to show technical and contextual metadata Lineage & Impact  Consolidated upstream lineage & downstream impact  Detailed “click-through” to asset properties Asset 360
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved DSS v1.0 Capabilities Security policies  View specific security policies on data assets  View classification based policies on assets Audit & monitoring  Ability to show most recent tail of access audit events  Ability to provide a dashboard of access patterns for asset  Timeline of access  Top K users  Count of Access Events (Permitted/blocked)  Top Policies that fired Asset 360
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved DSS v1.0 Capabilities Collaboration  Rate asset collections  Favorite/ personalize Data Assets  Share Asset Collections  Threaded discussion on Asset Collections Collaboration
  • 12.
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Plane Core Services Ambari HDFS Hive DSS Engine Cluster 1 … Data Plane DB Knox LDAP / AD Ambari HDFS Hive DSS Engine Cluster 2 Knox Ambari HDFS Hive DSS Engine Cluster N Knox DSS Single User Store High Level Architecture of DP/DSS
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Plane core services  Set of loosely coupled collaborative services delivered as Docker containers.  Service API’s allow applications such as DSS to be built using RESTFul API’s which hide the complexity of talking to different services in each cluster  Core services provide cross-cutting concerns such as  Service registration and discovery  An API gateway with Fault tolerance and various load balancers  User identity federation and RBAC  Access cluster data using Apache Knox  Audit and Logging  Monitoring and health checks.  Allows developers to pick and choose API’s to deliver business applications eg: DLM Engine (DLM) Atlas, Ranger(DSS)
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Data Plane Applications  Self contained web applications which use a related set of Data plane core API’s to deliver business value.  Delivered as Docker containers.  Independent and isolated from each other, SSO between applications  Convention driven programming model allows applications to be programmed in any tech stack. Knox Atlas Data Plane Core Services Ranger DSS Hive
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved DSS Architecture  DSS uses Atlas, Ranger and Knox as foundational services  DSS installs a component in each cluster to enable data profiling. DSS Data plane services Profiler Engine Cluster Data profiler Data profiler Data profiler Profiler compute Hive
  • 17.
  • 18.
    © Hortonworks Inc.2011–2018. All rights reserved. Hortonworks confidential and proprietary information Scenario – HortoniaBank Marketing Demographics Electronic medical records CRM POS (Structured)(Structured) (Structured) (Structured) (Structured) Cluster 1: Dublin Cluster 2: San Francisco (Unstructured)(Unstructured)(Unstructured) Cluster 3: Prague (Structured) On Premise Data Lakes (Unstructured)(Structured) (Unstructured) (Structured) Cloud Data Lakes Social Weblogs & Feeds Transactional Mobile IoT Personal Data
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Thank You !