• Like
HCatalog Hadoop Summit 2011
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

HCatalog Hadoop Summit 2011

  • 6,005 views
Published

Alan Gates' HCatalog talk

Alan Gates' HCatalog talk

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
6,005
On SlideShare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
173
Comments
0
Likes
9

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Current situation:Different data type models and notions of schemaIf you’re using all three tools must write or obtain IF/OF, Load/Store, and SerDe for any new formatFor Pig and MR must understand where file is located, what its schema is, how it is compressed, what storage format was usedVision:Shared data type model and schemaWrite/obtain one storage driver, works with all toolsNo need to know where data is located, what its schema is, how it is compressed, what format was used
  • It would be nice if, as data is created in Pig and M/R it is instantly available in Hive
  • Would look the same for MRInput changes from file to tablePartitioning of data moves from load to the filter clauseSchema is now provided to PigIf the data creator changes file format tomorrow, or the admin switches the files from one path to another, the first script has to be rewritten and retested while there are no changes in the second

Transcript

  • 1. HCatalog
    Table Management for Hadoop
    Alan F. Gates
  • 2. Committer and mentor for Apache HCatalog
    Committer and PMC member for Apache Pig
    Co-founder Hortonworks
    Twitter @alanfgates
    Who Am I?
    Photo credit: Charles Dawley
  • 3. Motivation
    Pig
    Hive
    Map Reduce
    HCatSerDe
    CustomLoader
    Hive
    Columnar
    Loader
    RCFile
    Input
    Format
    Custom
    Input
    Format
    Columnar
    SerDe
    Custom
    SerDe
    HCatLoader
    HCatInputFormat
    HCatalog
    RCFile
    StorageDriver
    Custom
    StorageDriver
    Custom Format
    RCFile
  • 4. More Motivation
    Data Warehouse
    Hive
    BI Tools
    Analysis
    Data Factory
    Pig/MapReduce
    Pipelines
    Iterative Processing
    Research
  • 5. End User Example
    raw = load ‘/rawevents/20100819/data’using MyLoader()
    as (ts:long, user:chararray, url:chararray);
    botless= filterraw byNotABot(user);

    storeoutput into ‘/processedevents/20100819/data’;
    Processedevents consumersmust be manually informed by producer thatdata is
    available, or poll on HDFS (= bad for the NameNode)
    raw = load ‘rawevents’ using HCatLoader();
    botless= filterraw by date = ‘20100819’and NotABot(user);

    storeoutput into ‘processedevents’
    using HCatStorage(“date=20100819”);
    Processedevents consumers will be notified by HCatalogdata is available and can
    then start their jobs
  • 6. Metadata Architecture
    HCatLoader
    HCatStorage
    HCatInputFormat
    HCatOutputFormat
    CLI
    Notification
    Hive metadata interface
    Thrift server
    RDBMS
    = Current HCatalog
    = Hive
    = Future HCatalog
  • 7. Storage Architecture
    HCatLoader
    HCatStorage
    HCatInputFormat
    HCatOutputFormat
    Input
    StorageDriver
    Output
    StorageDriver
    HDFS
    HBase
  • 8. Project Status
    HCatalog was accepted to the Apache Incubator last March
    0.1 released this month, includes
    Read/write from Pig
    Read/write from MapReduce
    Read/write from Hive
    Works only with secure Hadoop
    StorageDrivers for RCFile and Text
  • 9. Future Plans
    0.2, plan to release in July
    Notification via JMS when data is available
    Store to multiple partitions simultaneously
    Import/Export tools
    Later this year
    Storing in HBase
    Integration with Hadoop streaming
    Bytearray/blob type
    RCFile compression improvements
    High Availability for Thrift server
    Eventually
    Data management interfaces for archivers, cleaners, etc.
    Statistics storage
  • 10. Get Involved
    incubator.apache.org/hcatalog
    Join the mailing lists
    User list: hcatalog-user@incubator.apache.org
    Dev list: hcatalog-dev@incubator.apache.org
  • 11. Questions?