Metadata Management
in Big Data
Data Management Challenges
@ezzibdeh
Tariq Ezzibdeh
Aim
• Outline some perspective on metadata management principles that
apply in the big data space and beyond
• Provide some data governance foundations in the data space that
essentially would outlast the actual technologies to serve the needs
of the future
• Discuss technologies and solutions currently in the market
Big Data - Overview
• Big Data 5 V’s:
• Volume
• Velocity
• Variety
• Veracity
• Platform of today – set of relatively split up components
• Data is stored on HDFS – File system
• Catalogue of Data and its schema is maintained in another service – TBC!
• Query front ends – Query engines based on different requirements
=> Value
Platform Architecture – Modern Architecture
DataSources
Acquisition
DataSystems
Staging Zone
ETL and data
standardization
Pristine Archive
Compressed
Gzip etc.
Data Warehouse
Immutable data
Analytics Zone
Allocated data changes
Schema Catalogue
Well-define reference to data structures and attributes
Data Ledger
Track data and its access with lineage and operations
BigDataPlatform Data Marts UI/API
Apps
Source: Hortonworks
Why do we need to manage metadata for Big
data platforms?
Large volumes of data
landing in Hadoop/Big Data
Growing users working
with the data
The need for effective
control & consumption
of Data
The implementation needs to:
• Offer good data visibility across your cluster
• Capture data lineage across source systems and in the platform
• Audit and record operations that are performed in the platform
• Enforce policies that are defined by the platform stewards
• Help reduce data redundancy on the platform
Source: Cloudera
Metadata in Action
Metadata – What is it?
Data about Data!
• Business Metadata
 Supplies the business context around data, such as the business term’s
name, definition, owners or stewards, and associated reference data
• Technical Metadata
 provides technical information about the data, such as the name of the
source table, the source table column name, and the data type (e.g.,
string, integer)
• Operational Metadata
 furnishes information about the use of the data, such as date last updated,
number of times accessed, or date last accessed
Source: Informatica
Why do I need all this metadata?
• Data lake will contain all types of data – log streams - kafka,
DBS – sqoop… don’t make your lake turn into a swamp!
• Consistency of definitions - To reconcile the difference in terminology such as
"clients" and "customers," "revenue" and "sales”
• Clarity of data lineage – About origins of a data set and can be granular enough to
define information at the attribute level, including operations on it
• To understand data usage on your cluster
• Optimize queries and views
• Compliance and Regulatory
• Compliance -Capture, store and move data – Sarbanes-Oxley, HIPAA,
Basel II
• Security - Authorization, Authentication – Handling sensitive data
• Auditing - Recoding every attempt to access
• Archive & Retention - Data life cycle policies Source: Teradata/Tech
target
Metadata System Architecture
Topologically, metadata repository architecture defines one of the following
three styles:
• Centralized Metadata repository
 Efficient access and adaptability, scalability and high performance
 Single point of failure and continuous synchronization
• Distributed Metadata repository
 Access to metadata repo in real-time, up-to date metadata
 Overhead in maintaining the configuration of the source system changes and
HA
• Federated or Hybrid Metadata repository
 Central definition storage with references to the proper locations of the
accurate definitions
Source: Techtarget
Use-cases for the need for Metadata
Use Cases – Analytics
1. Finding the Data: Data Scientists spend a lot time finding the
correct columns for variable selection
• Around 80% of the data scientist’s time on column investigation with SMEs
2. Profile of Data: Reduce the number of time spent on data profiling
by the ad-hoc queries
• ~78% of the queries run on the cluster are profiling queries
3. Track the transformation: Data Scientists would like to understand
how the data sets are derived
• Not fully tracked except at a high level
Source: Aetna
1. Finding the data: Challenges
• Hive requires relatively manual traversal of the schema to find the
table and columns
• HDFS also requires traversal of the directory listing to find a file
• Any documentation (external to the system) become outdated and
are not always reliable
• No simple way to add business metadata
Source: Aetna
HDFS/Hive Architecture
hadoop.apache.org
Ben Lever -Slideshare
Source:
1. Finding the data: Solutions
• Run-time capture of metadata of hive and HDFS, and store in
repository
• Provide an API to query the metadata and search across it
• Provide an API or other ways to enrich the data with its business
context
Business Metadata
Technical/Physical Metadata
Hive
HDFS
Ingestion/
Sqoop
Apache Atlas
Source: Aetna
2. Profile of Data: Challenges and Solutions
• Access to hive metastore will
introduce latency in production
• Lack of comprehensive information
provided by the hive metastore
78%
18%
4%
Average Daily Query
Profiling Exploratory Production
• Provide a system with business,
technical data that are cross referenced
• Have a framework for the data scientist
to accommodate additional profiling
Source: Aetna
3. Track the transformation: Challenges and
Solutions
• Documenting transformation is manual
and difficult to scale
• Mechanism for auditing data pipeline
still lacking
• Data quality and provenance is too
manual
• Leverage metadata already captured to
construct transformations
• Provide an API to query transformations
• Provide a visualization for the
transformations
Source: Aetna
What do we need?
1. A Searchable platform for all the data types for business and technical
metadata
2. Data profile store with basic metrics of the data
• Min
• Max
• Column distribution
3. Visual lineage for the data flow from the source system to different
components within the platform
• ETL operations – HL view
• Analytics queries
4. Automated Metadata driven data ingestion and thus management
• The Data Lake concept relies on capturing a robust set of attributes for every piece of content
within the lake
• Maintaining this metadata requires a highly-automated metadata extraction, capture, and
tracking facility.
Solutions for Hadoop
Apache Atlas – deep dive
• Apache Atlas Capabilities: Overview
• Data Classification
• Import or define taxonomy business-oriented annotations
for data
• Define, annotate, and automate capture of relationships
between data sets
• Export metadata to third-party systems
• Centralized Auditing
• Capture security access information
• Capture the operational information for execution, steps,
and activities
• Search & Lineage (Browse)
• Text-based search features locates relevant data and audit
event across Data Lake quickly and accurately
• Browse visualization of data set lineage allowing users to
drill-down into operational, security, and provenance
related information
• Security & Policy Engine
• Rationalize compliance policy at runtime based on data
classification schemes
Source: Hortonworks
Open-source Incubator project
Demo
Apache Atlas in action!
Possible solutions for other platforms
Netflix – Managing Data Platforms
Source: Netflix
Possible Solutions for other Platforms
Metacat
• Apply Metadata management
on Service layer
• Federated metadata catalog for
the whole data platform
• Proxy service to different
metadata sources
• Data metrics, data usage,
ownership, categorization and
retention policy …
• Common interface for tools to
interact with metadata
Tracking Data Difference
• Apply Metadata management
on Service layer
• Track the changes to
documents/entities
• Custom code tracking through
logs collected as Mongo, or use
a module called MongoID
Netflix OSS
Where Else?
{ "Description": "A containerized foobar",
"Usage": "docker run --rm example/foobar [args]",
"License": "GPL",
"Version": "0.0.1-beta",
"aBoolean": true,
"aNumber" : 0.01234,
"aNestedArray": ["a", "b", "c"] } <meta name=”description” content=”155
characters of message matching text
with a call to action goes here”>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.JOSA.Meta</groupId>
<artifactId>project</artifactId> <version>1.0</version>
</project>
Notes - Summary
• Consider the different types of metadata you need to manage
• Build a robust descriptive dictionary for the data
• Manage metadata as a team effort. It has a lot of benefit so make it
Agile but effective.
Finally…remember that
One’s Metadata – d/dx – is someone else’s Data!
Resources
• HDP 2.3 Preview Sandbox VM: (Hortonworks)
– http://hortonworks.com/hdp/whats-new/
• Apache Atlas:
– http://atlas.incubator.apache.org/
– http://incubator.apache.org/projects/atlas.html
– https://git-wip-us.apache.org/repos/asf/incubator-atlas.gi
• Metadata Management (General)
– https://www.informatica.com/content/dam/informatica-
com/global/amer/us/collateral/white-paper/metadata-management-data-
governance_white-paper_2163.pdf
tariqzibdeh@gmail.com
Tariq Ezzibdeh
Questions..?
Contact info:

JOSA TechTalk: Metadata Management
in Big Data

  • 1.
    Metadata Management in BigData Data Management Challenges @ezzibdeh Tariq Ezzibdeh
  • 2.
    Aim • Outline someperspective on metadata management principles that apply in the big data space and beyond • Provide some data governance foundations in the data space that essentially would outlast the actual technologies to serve the needs of the future • Discuss technologies and solutions currently in the market
  • 3.
    Big Data -Overview • Big Data 5 V’s: • Volume • Velocity • Variety • Veracity • Platform of today – set of relatively split up components • Data is stored on HDFS – File system • Catalogue of Data and its schema is maintained in another service – TBC! • Query front ends – Query engines based on different requirements => Value
  • 4.
    Platform Architecture –Modern Architecture DataSources Acquisition DataSystems Staging Zone ETL and data standardization Pristine Archive Compressed Gzip etc. Data Warehouse Immutable data Analytics Zone Allocated data changes Schema Catalogue Well-define reference to data structures and attributes Data Ledger Track data and its access with lineage and operations BigDataPlatform Data Marts UI/API Apps Source: Hortonworks
  • 5.
    Why do weneed to manage metadata for Big data platforms? Large volumes of data landing in Hadoop/Big Data Growing users working with the data The need for effective control & consumption of Data The implementation needs to: • Offer good data visibility across your cluster • Capture data lineage across source systems and in the platform • Audit and record operations that are performed in the platform • Enforce policies that are defined by the platform stewards • Help reduce data redundancy on the platform Source: Cloudera
  • 6.
  • 7.
    Metadata – Whatis it? Data about Data! • Business Metadata  Supplies the business context around data, such as the business term’s name, definition, owners or stewards, and associated reference data • Technical Metadata  provides technical information about the data, such as the name of the source table, the source table column name, and the data type (e.g., string, integer) • Operational Metadata  furnishes information about the use of the data, such as date last updated, number of times accessed, or date last accessed Source: Informatica
  • 8.
    Why do Ineed all this metadata? • Data lake will contain all types of data – log streams - kafka, DBS – sqoop… don’t make your lake turn into a swamp! • Consistency of definitions - To reconcile the difference in terminology such as "clients" and "customers," "revenue" and "sales” • Clarity of data lineage – About origins of a data set and can be granular enough to define information at the attribute level, including operations on it • To understand data usage on your cluster • Optimize queries and views • Compliance and Regulatory • Compliance -Capture, store and move data – Sarbanes-Oxley, HIPAA, Basel II • Security - Authorization, Authentication – Handling sensitive data • Auditing - Recoding every attempt to access • Archive & Retention - Data life cycle policies Source: Teradata/Tech target
  • 9.
    Metadata System Architecture Topologically,metadata repository architecture defines one of the following three styles: • Centralized Metadata repository  Efficient access and adaptability, scalability and high performance  Single point of failure and continuous synchronization • Distributed Metadata repository  Access to metadata repo in real-time, up-to date metadata  Overhead in maintaining the configuration of the source system changes and HA • Federated or Hybrid Metadata repository  Central definition storage with references to the proper locations of the accurate definitions Source: Techtarget
  • 10.
    Use-cases for theneed for Metadata
  • 11.
    Use Cases –Analytics 1. Finding the Data: Data Scientists spend a lot time finding the correct columns for variable selection • Around 80% of the data scientist’s time on column investigation with SMEs 2. Profile of Data: Reduce the number of time spent on data profiling by the ad-hoc queries • ~78% of the queries run on the cluster are profiling queries 3. Track the transformation: Data Scientists would like to understand how the data sets are derived • Not fully tracked except at a high level Source: Aetna
  • 12.
    1. Finding thedata: Challenges • Hive requires relatively manual traversal of the schema to find the table and columns • HDFS also requires traversal of the directory listing to find a file • Any documentation (external to the system) become outdated and are not always reliable • No simple way to add business metadata Source: Aetna
  • 13.
  • 14.
    1. Finding thedata: Solutions • Run-time capture of metadata of hive and HDFS, and store in repository • Provide an API to query the metadata and search across it • Provide an API or other ways to enrich the data with its business context Business Metadata Technical/Physical Metadata Hive HDFS Ingestion/ Sqoop Apache Atlas Source: Aetna
  • 15.
    2. Profile ofData: Challenges and Solutions • Access to hive metastore will introduce latency in production • Lack of comprehensive information provided by the hive metastore 78% 18% 4% Average Daily Query Profiling Exploratory Production • Provide a system with business, technical data that are cross referenced • Have a framework for the data scientist to accommodate additional profiling Source: Aetna
  • 16.
    3. Track thetransformation: Challenges and Solutions • Documenting transformation is manual and difficult to scale • Mechanism for auditing data pipeline still lacking • Data quality and provenance is too manual • Leverage metadata already captured to construct transformations • Provide an API to query transformations • Provide a visualization for the transformations Source: Aetna
  • 17.
    What do weneed? 1. A Searchable platform for all the data types for business and technical metadata 2. Data profile store with basic metrics of the data • Min • Max • Column distribution 3. Visual lineage for the data flow from the source system to different components within the platform • ETL operations – HL view • Analytics queries 4. Automated Metadata driven data ingestion and thus management • The Data Lake concept relies on capturing a robust set of attributes for every piece of content within the lake • Maintaining this metadata requires a highly-automated metadata extraction, capture, and tracking facility.
  • 18.
  • 19.
    Apache Atlas –deep dive • Apache Atlas Capabilities: Overview • Data Classification • Import or define taxonomy business-oriented annotations for data • Define, annotate, and automate capture of relationships between data sets • Export metadata to third-party systems • Centralized Auditing • Capture security access information • Capture the operational information for execution, steps, and activities • Search & Lineage (Browse) • Text-based search features locates relevant data and audit event across Data Lake quickly and accurately • Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information • Security & Policy Engine • Rationalize compliance policy at runtime based on data classification schemes Source: Hortonworks Open-source Incubator project
  • 20.
  • 21.
    Possible solutions forother platforms
  • 22.
    Netflix – ManagingData Platforms Source: Netflix
  • 23.
    Possible Solutions forother Platforms Metacat • Apply Metadata management on Service layer • Federated metadata catalog for the whole data platform • Proxy service to different metadata sources • Data metrics, data usage, ownership, categorization and retention policy … • Common interface for tools to interact with metadata Tracking Data Difference • Apply Metadata management on Service layer • Track the changes to documents/entities • Custom code tracking through logs collected as Mongo, or use a module called MongoID Netflix OSS
  • 24.
    Where Else? { "Description":"A containerized foobar", "Usage": "docker run --rm example/foobar [args]", "License": "GPL", "Version": "0.0.1-beta", "aBoolean": true, "aNumber" : 0.01234, "aNestedArray": ["a", "b", "c"] } <meta name=”description” content=”155 characters of message matching text with a call to action goes here”> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.JOSA.Meta</groupId> <artifactId>project</artifactId> <version>1.0</version> </project>
  • 25.
    Notes - Summary •Consider the different types of metadata you need to manage • Build a robust descriptive dictionary for the data • Manage metadata as a team effort. It has a lot of benefit so make it Agile but effective. Finally…remember that One’s Metadata – d/dx – is someone else’s Data!
  • 26.
    Resources • HDP 2.3Preview Sandbox VM: (Hortonworks) – http://hortonworks.com/hdp/whats-new/ • Apache Atlas: – http://atlas.incubator.apache.org/ – http://incubator.apache.org/projects/atlas.html – https://git-wip-us.apache.org/repos/asf/incubator-atlas.gi • Metadata Management (General) – https://www.informatica.com/content/dam/informatica- com/global/amer/us/collateral/white-paper/metadata-management-data- governance_white-paper_2163.pdf
  • 27.

Editor's Notes

  • #4 How many of you use hadoop? and in production?
  • #9 Value and governance itself Predictive powers like entity resolution etc. Value related to cluster health Regulatory
  • #14 Find sources
  • #26 For the platform of today welle bg ba3ed hadoop I guess