Future of HCatalogAlan F. Gates@alanfgates                     Page 1
Who Am I?• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig ...
Hadoop EcosystemMapReduce                                      Hive                   Pig                                 ...
Opening up Metadata to MR & Pig       MapReduce                            Hive                      Pig   HCaInputFormat/...
Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GE...
Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GE...
Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GE...
Reading and Writing Data in Parallel•  Use Case: Users want   –  to read and write records in parallel between Hadoop and ...
HCatReader and HCatWriter                                   getHCatReader                   Master                        ...
Storing Semi-/Unstructured DataTable Users                          File Users Name                 Zip            {"name"...
Storing Semi-/Unstructured DataTable Users                        File Users Name                 Zip          {"name":"al...
Storing Semi-/Unstructured DataTable Users                           File Users Name                 Zip            {"name...
Hive ODBC/JDBC Today                             Issue: Have to have Hive        JDBC                 code on the client  ...
ODBC/JDBC Proposal        JDBC        ClientProvide robust open source              REST                             Hadoo...
Questions   © 2012 Hortonworks                        Page 15
Upcoming SlideShare
Loading in …5
×

Future of HCatalog

3,431 views

Published on

The initial work in HCatalog has allowed users to share their data in Hadoop regardless of the tools they use and relieved them of needing to know where and how their data is stored. But there is much more to be done to deliver on the full promise of providing metadata and table management for Hadoop clusters. It should be easy to store and process semi-structured and unstructured data via HCatalog. We need interfaces and simple implementations of data life cycle management tools. We need to deepen the integration with NoSQL and MPP data stores. And we need to be able to store larger metadata such as partition level statistics and user generated metadata. This talk will cover these areas of growth and give an overview of how they might be approached.

Published in: Technology

Future of HCatalog

  1. 1. Future of HCatalogAlan F. Gates@alanfgates Page 1
  2. 2. Who Am I?• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig committer and PMC Member• Member of Apache Software Foundation and Incubator PMC• Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2
  3. 3. Hadoop EcosystemMapReduce Hive Pig SerDeInputFormat/ InputFormat/ Load/ Metastore ClientOuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 3
  4. 4. Opening up Metadata to MR & Pig MapReduce Hive Pig HCaInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 4
  5. 5. Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } © Hortonworks 2012 Page 5
  6. 6. Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 6
  7. 7. Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" }•  Included in HDP•  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © Hortonworks 2012 Page 7
  8. 8. Reading and Writing Data in Parallel•  Use Case: Users want –  to read and write records in parallel between Hadoop and their parallel system –  driven by their system –  in a language independent way –  without needing to understand Hadoop’s file formats•  Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs•  What exists today –  webhdfs –  Language independent –  Can move data in parallel –  Driven from the user side –  Moves only bytes, no understanding of file format –  Sqoop –  Can move data in parallel –  Understands data format –  Driven from Hadoop side –  Requires connector or JDBC © 2012 Hortonworks Page 8
  9. 9. HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader readInput SlaveSplits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 9
  10. 10. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 10
  11. 11. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 11
  12. 12. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 12
  13. 13. Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: •  Not concurrent ODBC •  Not secure Client •  Not scalableIssue: Open source versionnot easy to use © 2012 Hortonworks Page 13
  14. 14. ODBC/JDBC Proposal JDBC ClientProvide robust open source REST Hadoopimplementations Server •  Spawns job inside cluster ODBC •  Runs job as submitting user Client •  Works with security •  Scaling web services well understood © 2012 Hortonworks Page 14
  15. 15. Questions © 2012 Hortonworks Page 15

×