● What is it ?
● How does it work ?
HCatalog – What is it ?
● A Hive metastore interface set
● Shared schema and data types for Hadoop tools
● Rest interface for external data access
● Assists inter operability between
– Pig, Hive and Map Reduce
● Table abstraction of data storage
● Will provide data availability notifications
HCatalog – How does it work ?
– HCatLoader + HCatStorer interface
● Map Reduce
– HCatInputFormat + HCatOutputFormat interface
– No interface necessary
– Direct access to meta data
● Notifications when data available
HCatalog – Interfaces
● Interface via
– Map Reduce
● Access data via
– Orc file
– RC file
– Text file
– Sequence file
– Custom format
HCatalog – Example
A data flow example from hive.apache.org
First Joe in data acquisition uses distcp to get data onto the grid.
hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data
hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'"
Second Sally in data processing uses Pig to cleanse and prepare the data.
Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.
A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …);
B = filter A by bot_finder(zeta) = 0;
store Z into 'data/processedevents/20100819/data';
With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started.
A = load 'rawevents' using HCatLoader();
B = filter A by date = '20100819' and by bot_finder(zeta) = 0;
store Z into 'processedevents' using HcatStorer("date=20100819");
Note that the pig job refers to the data by name rawevents rather than a location
Now access the data via Hive QL
select advertiser_id, count(clicks) from processedevents
where date = ‘20100819’ group by advertiser_id;
● Feel free to contact us at
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems