TriHUG November HCatalog Talk by Alan Gates


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Current situation:Different data type models and notions of schemaIf you’re using all three tools must write or obtain IF/OF, Load/Store, and SerDe for any new formatFor Pig and MR must understand where file is located, what its schema is, how it is compressed, what storage format was usedVision:Shared data type model and schemaWrite/obtain one storage driver, works with all toolsNo need to know where data is located, what its schema is, how it is compressed, what format was used
  • Would look the same for MRInput changes from file to tablePartitioning of data moves from load to the filter clauseSchema is now provided to PigIf the data creator changes file format tomorrow, or the admin switches the files from one path to another, the first script has to be rewritten and retested while there are no changes in the second
  • TriHUG November HCatalog Talk by Alan Gates

    1. 1. HCatalogTable Management for HadoopAlan F. Gates
    2. 2. Motivation: Data Sharing is Hard This is analyst Joe, he usesThis is programmer Bob, he Hive to build reports anduses Pig to crunch data. answer ad-hoc queries. Joe, I need today’s data OkPhoto Credit: totalAldo via Flickr Hmm, is it done yet? Where is it? What format did you use to store it today? Is it compressed? And can you help me load it into Hive, I can never remember all the parameters I have to pass that alter table command. Dude, we need HCatalog
    3. 3. More Motivation: Each tool requires its ownTranslator Pig Hive Map Reduce Hive HCatLoader HCatSerDe RCFile Custom HCatInputFormat Custom Columnar Custom Columnar Input Input Loader SerDe SerDe Loader Format Format HCatalog RCFile Custom StorageDriver StorageDriver Custom RCFile Format
    4. 4. End User Exampleraw = load „/rawevents/20100819/data‟ using MyLoader() as (ts:long, user:chararray, url:chararray);botless = filter raw by NotABot(user);…store output into „/processedevents/20100819/data‟;Processedevents consumers must be manually informed by producer that data isavailable, or poll on HDFS (= bad for the NameNode)raw = load „rawevents‟ using HCatLoader();botless = filter raw by date = „20100819‟ and NotABot(user);…store output into „processedevents‟ using HCatStorage(“date=20100819”);Processedevents consumers will be notified by HCatalog data is available and canthen start their jobs
    5. 5. Command Line for DDL• Uses Hive SQL• Create, drop, alter table• CREATE TABLE employee ( emp_id INT, emp_name STRING, emp_start_date STRING, emp_gender STRING) PARTITIONED BY ( emp_country STRING, emp_state STRING) STORED AS RCFILE tblproperties( hcat.isd=RCFileInputDriver, hcat.osd=RCFileOutputDriver);
    6. 6. Manages Data Format and Schema Changes• Allows columns to be appended to tables in new partitions − no need to change existing data − fields not present in old data will be read as null − must do „alter table add column‟ first• Allows storage format changes − no need to change existing data, HCatalog will handle reading each partition in the appropriate format − all new partitions will be written in current format
    7. 7. Security• Uses underlying storage permissions to determine authorization − Currently only works with HDFS based storage − If user can read from the HDFS directory, then he can read the table − If user can write to the HDFS directory, then he can write to the table − If the user can write to the database level directory, he can create and drop tables − Allows users to define which group to create table as so table access can be controlled by Unix group• Authentication done via kerberos
    8. 8. Metadata Architecture HCatLoader HCatStorage HTTP HCatInputFormat HCatOutputFormat CLI Notification Hive metadata interface Thrift server RDBMS = Current HCatalog = Hive = Future HCatalog
    9. 9. Storage Architecture HCatLoader HCatStorage HCatInputFormat HCatOutputFormat Input Output StorageDriver StorageDriver HDFS HBase
    10. 10. Project Status• HCatalog was accepted to the Apache Incubator last March• 0.2 released in October, includes: − Read/write from Pig − Read/write from MapReduce − Read/write from Hive − StorageDrivers for RCFile − Notification via JMS when data is available − Store to multiple partitions simultaneously − Import/Export tools
    11. 11. HCatalog 0.3• Plan to release mid-December• Adds a Binary type (to Hive and HCatalog)• Storage drivers for JSON and text• Improved integration with Hive for custom storage formats• Web services interface
    12. 12. Future Plans• Support for HBase and other data sources for storage• RCFile compression improvements• High Availability for Thrift server• Data management interfaces for archivers, cleaners, etc.• Additional metadata storage: − statistics − lineage/provenance − user tags
    13. 13. Get Involved•• Join the mailing lists − User list: − Dev list:
    14. 14. Questions?