Sharing metadata across the data lake and streams

Sharing Metadata
Across the Data Lake
and Streams
Alan F. Gates
Co-founder Hortonworks,
Member Apache Hive PMC
19 April 2018

2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Motivating Use Cases

ETL
HDFS/S3
Spark
Hive
on Tez
HMS Atlas
Ranger

Data Warehousing
HDFS/S3
Hive
LLAP
HMS Atlas
Ranger

Streaming
Kafka
Spark
HWX Schema
Registry

Issues
 If you are using Hive Metastore (HMS) with non-Hive system, you still have to install Hive
 No ability to share metadata between streaming and batch
– HMS does not know what is in Kafka
– Schema Registry does not know what is in HDFS/S3
 Admins are required to maintain two separate metadata repositories, one for batch and
one for streaming

Grand Vision
HDFS/S3 Kafka
Hive
LLAP
Spark
HMS Atlas
RangerSR
Hive
on Tez

Between Us and the Grand Vision
 Make HMS separable from Hive
 Unify HMS and Schema Registry so batch and streaming can see each other’s data
– Also reduces the number of metadata systems admins have to install and maintain

Making the Metastore Standalone

Breaking out the Metastore
 HMS is already widely used beyond Hive: Impala, Presto, Spark to name a few
– Want to make it easier for these and other systems to use HMS
 In Hive 3.0 the Metastore will be released as a separate module
 Can be installed and run without the rest of Hive
– A few features missing when Hive not present: e.g. the compactor
– These will be added in the future
 Backwards compatibility maintained for clients
– A few small changes for server hook implementations
 Intent is to make it a separate Apache project
– Enables better collaboration with non-Hive projects
– Still in discussion with the Hive PMC on this

Is this HCatalog 2.0?
 Didn’t we do this before? Wasn’t it called HCatalog? No, HCatalog is different
 HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other
applications
– Includes metadata access
– Also includes data access (serdes, object inspectors, and input/output formats)
 Metastore stores metadata, including which serdes etc. to use; but does not provide
readers and writers
 HCatalog stays with Hive in this split, it does not go with the Metastore
– Because it includes the data access

Beyond SQL Use Cases

Introduction to Hortonworks Schema Registry
 Provides a central repository for messages’ metadata
– Works with Apache Kafka, Apache NiFi
 Every schema has a name: e.g. temp_sensor_data
– Schema is generally tied to a Kafka topic
 Schemas can have one or more versions
– Different messages in a topic may have different versions of the schema
– Compatibility between schema versions can be none, backwards, forwards, or both
 Schema defined in JSON text
 Java/REST API for programs, UI for humans
 Apache licensed, working on contributing to Metastore now that it is separate

Schema Registry

Warning: Slideware ahead

Schema Registry Perspective
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• A stream userEvents
• An application that flags users who have called support in the last 24 hours
Hive table support_calls
userid long
calltime timestamp
summary string
supportCalls
Schema:
{ "group": "hive",
"fields": [{
"userid": "long",
"calltime": "timestamp",
"summary" : "string"
}]
}
• App can cache this table every hour
• Do a join as events arrive to flag users who need extra attention
• Because HMS and SR are unified, streaming apps can view this as an SR Schema
Use Case: Stream processing applications need access to Hive tables
Example:
• Hive has record of support calls, Kafka does not

Hive Perspective
Use Case: Hive needs to access Kafka topics
Hive table user_events,
partitioned by event_hour
user_id long
event_type varchar(256)
event_hour datetime
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• Hive table user_events is loaded every hour from Kafka topic userEvents
Example:
• Because HMS and SR are unified, Hive can view Kafka topic as partition of its table
Hive table user_events,
partition event_hour='latest'
• Hive queries can now read Kafka topic userEvents as a partition of user_events
• Would like to be able to read latest events from Kafka rather than wait until it loads into Hive

Notice File
 Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig,
Apache Ranger, Apache Spark, and Apache Tez are Apache Software Foundation projects
– All are referred to herein without “Apache” for brevity
 HDFS and MapReduce are components of Apache Hadoop

Thank You

Sharing metadata across the data lake and streams

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Sharing metadata across the data lake and streams

Similar to Sharing metadata across the data lake and streams (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Sharing metadata across the data lake and streams

Editor's Notes