HCatalogTable Management for HadoopAlan F. Gates
Committer and mentor for Apache HCatalogCommitter and PMC member for Apache PigCo-founder HortonworksTwitter @alanfgatesWho Am I?Photo credit: Charles Dawley
MotivationPigHiveMap ReduceHCatSerDeCustomLoaderHiveColumnarLoaderRCFileInputFormatCustomInputFormatColumnarSerDeCustomSerDeHCatLoaderHCatInputFormatHCatalogRCFileStorageDriverCustomStorageDriverCustom FormatRCFile
More MotivationData WarehouseHiveBI ToolsAnalysisData FactoryPig/MapReducePipelinesIterative ProcessingResearch
End User Exampleraw     = load ‘/rawevents/20100819/data’using MyLoader()as (ts:long, user:chararray, url:chararray);botless= filterraw byNotABot(user);…storeoutput into ‘/processedevents/20100819/data’;Processedevents consumersmust be manually informed by producer thatdata isavailable, or poll on HDFS (= bad for the NameNode)raw     = load ‘rawevents’ using HCatLoader();botless= filterraw by date = ‘20100819’and NotABot(user);…storeoutput into ‘processedevents’using HCatStorage(“date=20100819”);Processedevents consumers will be notified by HCatalogdata is available and canthen start their jobs
Metadata ArchitectureHCatLoaderHCatStorageHCatInputFormatHCatOutputFormatCLINotificationHive metadata interfaceThrift serverRDBMS= Current HCatalog= Hive= Future HCatalog
Storage ArchitectureHCatLoaderHCatStorageHCatInputFormatHCatOutputFormatInputStorageDriverOutputStorageDriverHDFSHBase
Project StatusHCatalog was accepted to the Apache Incubator last March0.1 released this month, includesRead/write from PigRead/write from MapReduceRead/write from HiveWorks only with secure HadoopStorageDrivers for RCFile and Text
Future Plans0.2, plan to release in JulyNotification via JMS when data is availableStore to multiple partitions simultaneouslyImport/Export toolsLater this yearStoring in HBaseIntegration with Hadoop streamingBytearray/blob typeRCFile compression improvementsHigh Availability for Thrift serverEventuallyData management interfaces for archivers, cleaners, etc.Statistics storage
Get Involvedincubator.apache.org/hcatalogJoin the mailing lists User list: hcatalog-user@incubator.apache.orgDev list: hcatalog-dev@incubator.apache.org
Questions?

HCatalog Hadoop Summit 2011

Editor's Notes

  • #4 Current situation:Different data type models and notions of schemaIf you’re using all three tools must write or obtain IF/OF, Load/Store, and SerDe for any new formatFor Pig and MR must understand where file is located, what its schema is, how it is compressed, what storage format was usedVision:Shared data type model and schemaWrite/obtain one storage driver, works with all toolsNo need to know where data is located, what its schema is, how it is compressed, what format was used
  • #5 It would be nice if, as data is created in Pig and M/R it is instantly available in Hive
  • #6 Would look the same for MRInput changes from file to tablePartitioning of data moves from load to the filter clauseSchema is now provided to PigIf the data creator changes file format tomorrow, or the admin switches the files from one path to another, the first script has to be rewritten and retested while there are no changes in the second