More Related Content Similar to Sharing metadata across the data lake and streams (20) More from DataWorks Summit (20) Sharing metadata across the data lake and streams2. 2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Motivating Use Cases
3. 3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
ETL
HDFS/S3
Spark
Hive
on Tez
HMS Atlas
Ranger
4. 4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Data Warehousing
HDFS/S3
Hive
LLAP
HMS Atlas
Ranger
5. 5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Streaming
Kafka
Spark
HWX Schema
Registry
6. 6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Issues
If you are using Hive Metastore (HMS) with non-Hive system, you still have to install Hive
No ability to share metadata between streaming and batch
– HMS does not know what is in Kafka
– Schema Registry does not know what is in HDFS/S3
Admins are required to maintain two separate metadata repositories, one for batch and
one for streaming
7. 7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Grand Vision
HDFS/S3 Kafka
Hive
LLAP
Spark
HMS Atlas
RangerSR
Hive
on Tez
8. 8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Between Us and the Grand Vision
Make HMS separable from Hive
Unify HMS and Schema Registry so batch and streaming can see each other’s data
– Also reduces the number of metadata systems admins have to install and maintain
9. 9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Making the Metastore Standalone
10. 10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Breaking out the Metastore
HMS is already widely used beyond Hive: Impala, Presto, Spark to name a few
– Want to make it easier for these and other systems to use HMS
In Hive 3.0 the Metastore will be released as a separate module
Can be installed and run without the rest of Hive
– A few features missing when Hive not present: e.g. the compactor
– These will be added in the future
Backwards compatibility maintained for clients
– A few small changes for server hook implementations
Intent is to make it a separate Apache project
– Enables better collaboration with non-Hive projects
– Still in discussion with the Hive PMC on this
11. 11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Is this HCatalog 2.0?
Didn’t we do this before? Wasn’t it called HCatalog? No, HCatalog is different
HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other
applications
– Includes metadata access
– Also includes data access (serdes, object inspectors, and input/output formats)
Metastore stores metadata, including which serdes etc. to use; but does not provide
readers and writers
HCatalog stays with Hive in this split, it does not go with the Metastore
– Because it includes the data access
13. 13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Introduction to Hortonworks Schema Registry
Provides a central repository for messages’ metadata
– Works with Apache Kafka, Apache NiFi
Every schema has a name: e.g. temp_sensor_data
– Schema is generally tied to a Kafka topic
Schemas can have one or more versions
– Different messages in a topic may have different versions of the schema
– Compatibility between schema versions can be none, backwards, forwards, or both
Schema defined in JSON text
Java/REST API for programs, UI for humans
Apache licensed, working on contributing to Metastore now that it is separate
15. 15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Warning: Slideware ahead
16. 16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Schema Registry Perspective
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• A stream userEvents
• An application that flags users who have called support in the last 24 hours
Hive table support_calls
userid long
calltime timestamp
summary string
supportCalls
Schema:
{ "group": "hive",
"fields": [{
"userid": "long",
"calltime": "timestamp",
"summary" : "string"
}]
}
• App can cache this table every hour
• Do a join as events arrive to flag users who need extra attention
• Because HMS and SR are unified, streaming apps can view this as an SR Schema
Use Case: Stream processing applications need access to Hive tables
Example:
• Hive has record of support calls, Kafka does not
17. 17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Hive Perspective
Use Case: Hive needs to access Kafka topics
Hive table user_events,
partitioned by event_hour
user_id long
event_type varchar(256)
event_hour datetime
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• Hive table user_events is loaded every hour from Kafka topic userEvents
Example:
• Because HMS and SR are unified, Hive can view Kafka topic as partition of its table
Hive table user_events,
partition event_hour='latest'
• Hive queries can now read Kafka topic userEvents as a partition of user_events
• Would like to be able to read latest events from Kafka rather than wait until it loads into Hive
18. 18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Notice File
Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig,
Apache Ranger, Apache Spark, and Apache Tez are Apache Software Foundation projects
– All are referred to herein without “Apache” for brevity
HDFS and MapReduce are components of Apache Hadoop
Editor's Notes Note, picture isn’t perfect because if you are using Spark without Hive you still have to install Hive to get the metastore. Note: HMS et al replaced by Schema Registry