SlideShare a Scribd company logo
1 of 19
Sharing Metadata
Across the Data Lake
and Streams
Alan F. Gates
Co-founder Hortonworks,
Member Apache Hive PMC
19 April 2018
2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Motivating Use Cases
3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
ETL
HDFS/S3
Spark
Hive
on Tez
HMS Atlas
Ranger
4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Data Warehousing
HDFS/S3
Hive
LLAP
HMS Atlas
Ranger
5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Streaming
Kafka
Spark
HWX Schema
Registry
6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Issues
 If you are using Hive Metastore (HMS) with non-Hive system, you still have to install Hive
 No ability to share metadata between streaming and batch
– HMS does not know what is in Kafka
– Schema Registry does not know what is in HDFS/S3
 Admins are required to maintain two separate metadata repositories, one for batch and
one for streaming
7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Grand Vision
HDFS/S3 Kafka
Hive
LLAP
Spark
HMS Atlas
RangerSR
Hive
on Tez
8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Between Us and the Grand Vision
 Make HMS separable from Hive
 Unify HMS and Schema Registry so batch and streaming can see each other’s data
– Also reduces the number of metadata systems admins have to install and maintain
9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Making the Metastore Standalone
10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Breaking out the Metastore
 HMS is already widely used beyond Hive: Impala, Presto, Spark to name a few
– Want to make it easier for these and other systems to use HMS
 In Hive 3.0 the Metastore will be released as a separate module
 Can be installed and run without the rest of Hive
– A few features missing when Hive not present: e.g. the compactor
– These will be added in the future
 Backwards compatibility maintained for clients
– A few small changes for server hook implementations
 Intent is to make it a separate Apache project
– Enables better collaboration with non-Hive projects
– Still in discussion with the Hive PMC on this
11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Is this HCatalog 2.0?
 Didn’t we do this before? Wasn’t it called HCatalog? No, HCatalog is different
 HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other
applications
– Includes metadata access
– Also includes data access (serdes, object inspectors, and input/output formats)
 Metastore stores metadata, including which serdes etc. to use; but does not provide
readers and writers
 HCatalog stays with Hive in this split, it does not go with the Metastore
– Because it includes the data access
12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Beyond SQL Use Cases
13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Introduction to Hortonworks Schema Registry
 Provides a central repository for messages’ metadata
– Works with Apache Kafka, Apache NiFi
 Every schema has a name: e.g. temp_sensor_data
– Schema is generally tied to a Kafka topic
 Schemas can have one or more versions
– Different messages in a topic may have different versions of the schema
– Compatibility between schema versions can be none, backwards, forwards, or both
 Schema defined in JSON text
 Java/REST API for programs, UI for humans
 Apache licensed, working on contributing to Metastore now that it is separate
14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Schema Registry
15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Warning: Slideware ahead
16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Schema Registry Perspective
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• A stream userEvents
• An application that flags users who have called support in the last 24 hours
Hive table support_calls
userid long
calltime timestamp
summary string
supportCalls
Schema:
{ "group": "hive",
"fields": [{
"userid": "long",
"calltime": "timestamp",
"summary" : "string"
}]
}
• App can cache this table every hour
• Do a join as events arrive to flag users who need extra attention
• Because HMS and SR are unified, streaming apps can view this as an SR Schema
Use Case: Stream processing applications need access to Hive tables
Example:
• Hive has record of support calls, Kafka does not
17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Hive Perspective
Use Case: Hive needs to access Kafka topics
Hive table user_events,
partitioned by event_hour
user_id long
event_type varchar(256)
event_hour datetime
Kafka topic userEvents
Schema:
{ "group": "kafka",
"fields": [{
"userid": "long",
"eventtype": "string",
...
}]
}
• Hive table user_events is loaded every hour from Kafka topic userEvents
Example:
• Because HMS and SR are unified, Hive can view Kafka topic as partition of its table
Hive table user_events,
partition event_hour='latest'
• Hive queries can now read Kafka topic userEvents as a partition of user_events
• Would like to be able to read latest events from Kafka rather than wait until it loads into Hive
18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Notice File
 Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig,
Apache Ranger, Apache Spark, and Apache Tez are Apache Software Foundation projects
– All are referred to herein without “Apache” for brevity
 HDFS and MapReduce are components of Apache Hadoop
19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Thank You

More Related Content

What's hot

IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
DataWorks Summit
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Spark Summit
 

What's hot (20)

Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
Apache Atlas: Governance for your Data
Apache Atlas: Governance for your DataApache Atlas: Governance for your Data
Apache Atlas: Governance for your Data
 
IoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFiIoT with Apache MXNet and Apache NiFi and MiniFi
IoT with Apache MXNet and Apache NiFi and MiniFi
 
Hadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and FutureHadoop Operations - Past, Present, and Future
Hadoop Operations - Past, Present, and Future
 
Securing Spark Applications
Securing Spark ApplicationsSecuring Spark Applications
Securing Spark Applications
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
An Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, FutureAn Overview on Optimization in Apache Hive: Past, Present, Future
An Overview on Optimization in Apache Hive: Past, Present, Future
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop Environment
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and ParquetBig Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
Big Data Storage - Comparing Speed and Features for Avro, JSON, ORC, and Parquet
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Why is my Hadoop* job slow?
Why is my Hadoop* job slow?Why is my Hadoop* job slow?
Why is my Hadoop* job slow?
 
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBa...
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 

Similar to Sharing metadata across the data lake and streams

Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
Joey Echeverria
 

Similar to Sharing metadata across the data lake and streams (20)

Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018Standalone metastore-dws-sjc-june-2018
Standalone metastore-dws-sjc-june-2018
 
Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
 
Schema Registry & Stream Analytics Manager
Schema Registry  & Stream Analytics ManagerSchema Registry  & Stream Analytics Manager
Schema Registry & Stream Analytics Manager
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Ambari metrics system - Apache ambari meetup (DataWorks Summit 2017)
Ambari metrics system - Apache ambari meetup (DataWorks Summit 2017)Ambari metrics system - Apache ambari meetup (DataWorks Summit 2017)
Ambari metrics system - Apache ambari meetup (DataWorks Summit 2017)
 
Put is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit EditionPut is the new rename: San Jose Summit Edition
Put is the new rename: San Jose Summit Edition
 
Moving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloudMoving towards enterprise ready Hadoop clusters on the cloud
Moving towards enterprise ready Hadoop clusters on the cloud
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real World
 
Big data spain keynote nov 2016
Big data spain keynote nov 2016Big data spain keynote nov 2016
Big data spain keynote nov 2016
 
Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0Meet HBase 2.0 and Phoenix 5.0
Meet HBase 2.0 and Phoenix 5.0
 
Micro services vs hadoop
Micro services vs hadoopMicro services vs hadoop
Micro services vs hadoop
 
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
The Enterprise and Connected Data, Trends in the Apache Hadoop Ecosystem by A...
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Sharing metadata across the data lake and streams

  • 1. Sharing Metadata Across the Data Lake and Streams Alan F. Gates Co-founder Hortonworks, Member Apache Hive PMC 19 April 2018
  • 2. 2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Motivating Use Cases
  • 3. 3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ETL HDFS/S3 Spark Hive on Tez HMS Atlas Ranger
  • 4. 4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Data Warehousing HDFS/S3 Hive LLAP HMS Atlas Ranger
  • 5. 5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Streaming Kafka Spark HWX Schema Registry
  • 6. 6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Issues  If you are using Hive Metastore (HMS) with non-Hive system, you still have to install Hive  No ability to share metadata between streaming and batch – HMS does not know what is in Kafka – Schema Registry does not know what is in HDFS/S3  Admins are required to maintain two separate metadata repositories, one for batch and one for streaming
  • 7. 7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Grand Vision HDFS/S3 Kafka Hive LLAP Spark HMS Atlas RangerSR Hive on Tez
  • 8. 8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Between Us and the Grand Vision  Make HMS separable from Hive  Unify HMS and Schema Registry so batch and streaming can see each other’s data – Also reduces the number of metadata systems admins have to install and maintain
  • 9. 9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Making the Metastore Standalone
  • 10. 10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Breaking out the Metastore  HMS is already widely used beyond Hive: Impala, Presto, Spark to name a few – Want to make it easier for these and other systems to use HMS  In Hive 3.0 the Metastore will be released as a separate module  Can be installed and run without the rest of Hive – A few features missing when Hive not present: e.g. the compactor – These will be added in the future  Backwards compatibility maintained for clients – A few small changes for server hook implementations  Intent is to make it a separate Apache project – Enables better collaboration with non-Hive projects – Still in discussion with the Hive PMC on this
  • 11. 11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Is this HCatalog 2.0?  Didn’t we do this before? Wasn’t it called HCatalog? No, HCatalog is different  HCatalog focuses on making the Metastore accessible by MapReduce, Pig, and other applications – Includes metadata access – Also includes data access (serdes, object inspectors, and input/output formats)  Metastore stores metadata, including which serdes etc. to use; but does not provide readers and writers  HCatalog stays with Hive in this split, it does not go with the Metastore – Because it includes the data access
  • 12. 12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Beyond SQL Use Cases
  • 13. 13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Introduction to Hortonworks Schema Registry  Provides a central repository for messages’ metadata – Works with Apache Kafka, Apache NiFi  Every schema has a name: e.g. temp_sensor_data – Schema is generally tied to a Kafka topic  Schemas can have one or more versions – Different messages in a topic may have different versions of the schema – Compatibility between schema versions can be none, backwards, forwards, or both  Schema defined in JSON text  Java/REST API for programs, UI for humans  Apache licensed, working on contributing to Metastore now that it is separate
  • 14. 14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Schema Registry
  • 15. 15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Warning: Slideware ahead
  • 16. 16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Schema Registry Perspective Kafka topic userEvents Schema: { "group": "kafka", "fields": [{ "userid": "long", "eventtype": "string", ... }] } • A stream userEvents • An application that flags users who have called support in the last 24 hours Hive table support_calls userid long calltime timestamp summary string supportCalls Schema: { "group": "hive", "fields": [{ "userid": "long", "calltime": "timestamp", "summary" : "string" }] } • App can cache this table every hour • Do a join as events arrive to flag users who need extra attention • Because HMS and SR are unified, streaming apps can view this as an SR Schema Use Case: Stream processing applications need access to Hive tables Example: • Hive has record of support calls, Kafka does not
  • 17. 17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Hive Perspective Use Case: Hive needs to access Kafka topics Hive table user_events, partitioned by event_hour user_id long event_type varchar(256) event_hour datetime Kafka topic userEvents Schema: { "group": "kafka", "fields": [{ "userid": "long", "eventtype": "string", ... }] } • Hive table user_events is loaded every hour from Kafka topic userEvents Example: • Because HMS and SR are unified, Hive can view Kafka topic as partition of its table Hive table user_events, partition event_hour='latest' • Hive queries can now read Kafka topic userEvents as a partition of user_events • Would like to be able to read latest events from Kafka rather than wait until it loads into Hive
  • 18. 18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Notice File  Apache Atlas, Apache Hadoop, Apache Hive, Apache Impala, Apache Kafka, Apache Pig, Apache Ranger, Apache Spark, and Apache Tez are Apache Software Foundation projects – All are referred to herein without “Apache” for brevity  HDFS and MapReduce are components of Apache Hadoop
  • 19. 19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Thank You

Editor's Notes

  1. Note, picture isn’t perfect because if you are using Spark without Hive you still have to install Hive to get the metastore.
  2. Note: HMS et al replaced by Schema Registry