HCatalog Hadoop Summit 2011

6,990 views

Published on

Alan Gates' HCatalog talk

Published in: Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,990
On SlideShare
0
From Embeds
0
Number of Embeds
533
Actions
Shares
0
Downloads
181
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • Current situation:Different data type models and notions of schemaIf you’re using all three tools must write or obtain IF/OF, Load/Store, and SerDe for any new formatFor Pig and MR must understand where file is located, what its schema is, how it is compressed, what storage format was usedVision:Shared data type model and schemaWrite/obtain one storage driver, works with all toolsNo need to know where data is located, what its schema is, how it is compressed, what format was used
  • It would be nice if, as data is created in Pig and M/R it is instantly available in Hive
  • Would look the same for MRInput changes from file to tablePartitioning of data moves from load to the filter clauseSchema is now provided to PigIf the data creator changes file format tomorrow, or the admin switches the files from one path to another, the first script has to be rewritten and retested while there are no changes in the second
  • HCatalog Hadoop Summit 2011

    1. 1. HCatalog<br />Table Management for Hadoop<br />Alan F. Gates<br />
    2. 2. Committer and mentor for Apache HCatalog<br />Committer and PMC member for Apache Pig<br />Co-founder Hortonworks<br />Twitter @alanfgates<br />Who Am I?<br />Photo credit: Charles Dawley<br />
    3. 3. Motivation<br />Pig<br />Hive<br />Map Reduce<br />HCatSerDe<br />CustomLoader<br />Hive<br />Columnar<br />Loader<br />RCFile<br />Input<br />Format<br />Custom<br />Input<br />Format<br />Columnar<br />SerDe<br />Custom<br />SerDe<br />HCatLoader<br />HCatInputFormat<br />HCatalog<br />RCFile<br />StorageDriver<br />Custom<br />StorageDriver<br />Custom Format<br />RCFile<br />
    4. 4. More Motivation<br />Data Warehouse<br />Hive<br />BI Tools<br />Analysis<br />Data Factory<br />Pig/MapReduce<br />Pipelines<br />Iterative Processing<br />Research<br />
    5. 5. End User Example<br />raw = load ‘/rawevents/20100819/data’using MyLoader()<br />as (ts:long, user:chararray, url:chararray);<br />botless= filterraw byNotABot(user);<br />…<br />storeoutput into ‘/processedevents/20100819/data’;<br />Processedevents consumersmust be manually informed by producer thatdata is<br />available, or poll on HDFS (= bad for the NameNode)<br />raw = load ‘rawevents’ using HCatLoader();<br />botless= filterraw by date = ‘20100819’and NotABot(user);<br />…<br />storeoutput into ‘processedevents’<br />using HCatStorage(“date=20100819”);<br />Processedevents consumers will be notified by HCatalogdata is available and can<br />then start their jobs<br />
    6. 6. Metadata Architecture<br />HCatLoader<br />HCatStorage<br />HCatInputFormat<br />HCatOutputFormat<br />CLI<br />Notification<br />Hive metadata interface<br />Thrift server<br />RDBMS<br />= Current HCatalog<br />= Hive<br />= Future HCatalog<br />
    7. 7. Storage Architecture<br />HCatLoader<br />HCatStorage<br />HCatInputFormat<br />HCatOutputFormat<br />Input<br />StorageDriver<br />Output<br />StorageDriver<br />HDFS<br />HBase<br />
    8. 8. Project Status<br />HCatalog was accepted to the Apache Incubator last March<br />0.1 released this month, includes<br />Read/write from Pig<br />Read/write from MapReduce<br />Read/write from Hive<br />Works only with secure Hadoop<br />StorageDrivers for RCFile and Text<br />
    9. 9. Future Plans<br />0.2, plan to release in July<br />Notification via JMS when data is available<br />Store to multiple partitions simultaneously<br />Import/Export tools<br />Later this year<br />Storing in HBase<br />Integration with Hadoop streaming<br />Bytearray/blob type<br />RCFile compression improvements<br />High Availability for Thrift server<br />Eventually<br />Data management interfaces for archivers, cleaners, etc.<br />Statistics storage<br />
    10. 10. Get Involved<br />incubator.apache.org/hcatalog<br />Join the mailing lists <br />User list: hcatalog-user@incubator.apache.org<br />Dev list: hcatalog-dev@incubator.apache.org<br />
    11. 11. Questions?<br />

    ×