Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.
Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!