Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HCatalog Hadoop Summit 2011


Published on

Alan Gates' HCatalog talk

Published in: Technology
  • Be the first to comment

HCatalog Hadoop Summit 2011

  1. 1. HCatalog<br />Table Management for Hadoop<br />Alan F. Gates<br />
  2. 2. Committer and mentor for Apache HCatalog<br />Committer and PMC member for Apache Pig<br />Co-founder Hortonworks<br />Twitter @alanfgates<br />Who Am I?<br />Photo credit: Charles Dawley<br />
  3. 3. Motivation<br />Pig<br />Hive<br />Map Reduce<br />HCatSerDe<br />CustomLoader<br />Hive<br />Columnar<br />Loader<br />RCFile<br />Input<br />Format<br />Custom<br />Input<br />Format<br />Columnar<br />SerDe<br />Custom<br />SerDe<br />HCatLoader<br />HCatInputFormat<br />HCatalog<br />RCFile<br />StorageDriver<br />Custom<br />StorageDriver<br />Custom Format<br />RCFile<br />
  4. 4. More Motivation<br />Data Warehouse<br />Hive<br />BI Tools<br />Analysis<br />Data Factory<br />Pig/MapReduce<br />Pipelines<br />Iterative Processing<br />Research<br />
  5. 5. End User Example<br />raw = load ‘/rawevents/20100819/data’using MyLoader()<br />as (ts:long, user:chararray, url:chararray);<br />botless= filterraw byNotABot(user);<br />…<br />storeoutput into ‘/processedevents/20100819/data’;<br />Processedevents consumersmust be manually informed by producer thatdata is<br />available, or poll on HDFS (= bad for the NameNode)<br />raw = load ‘rawevents’ using HCatLoader();<br />botless= filterraw by date = ‘20100819’and NotABot(user);<br />…<br />storeoutput into ‘processedevents’<br />using HCatStorage(“date=20100819”);<br />Processedevents consumers will be notified by HCatalogdata is available and can<br />then start their jobs<br />
  6. 6. Metadata Architecture<br />HCatLoader<br />HCatStorage<br />HCatInputFormat<br />HCatOutputFormat<br />CLI<br />Notification<br />Hive metadata interface<br />Thrift server<br />RDBMS<br />= Current HCatalog<br />= Hive<br />= Future HCatalog<br />
  7. 7. Storage Architecture<br />HCatLoader<br />HCatStorage<br />HCatInputFormat<br />HCatOutputFormat<br />Input<br />StorageDriver<br />Output<br />StorageDriver<br />HDFS<br />HBase<br />
  8. 8. Project Status<br />HCatalog was accepted to the Apache Incubator last March<br />0.1 released this month, includes<br />Read/write from Pig<br />Read/write from MapReduce<br />Read/write from Hive<br />Works only with secure Hadoop<br />StorageDrivers for RCFile and Text<br />
  9. 9. Future Plans<br />0.2, plan to release in July<br />Notification via JMS when data is available<br />Store to multiple partitions simultaneously<br />Import/Export tools<br />Later this year<br />Storing in HBase<br />Integration with Hadoop streaming<br />Bytearray/blob type<br />RCFile compression improvements<br />High Availability for Thrift server<br />Eventually<br />Data management interfaces for archivers, cleaners, etc.<br />Statistics storage<br />
  10. 10. Get Involved<br /><br />Join the mailing lists <br />User list:<br />Dev list:<br />
  11. 11. Questions?<br />