Future of HCatalog
Alan F. Gates
@alanfgates




                     Page 1
Who Am I?
• HCatalog committer and mentor
• Co-founder of Hortonworks
• Lead for Pig, Hive, and HCatalog at Hortonworks
• Pig committer and PMC Member
• Member of Apache Software Foundation and Incubator
  PMC
• Author of Programming Pig from O’Reilly




     © Hortonworks Inc. 2012
                                                       Page 2
Hadoop Ecosystem

MapReduce                                      Hive                   Pig




                                                         SerDe
InputFormat/                                          InputFormat/   Load/
                            Metastore Client
OuputFormat                                           OuputFormat    Store




                                                        HDFS
                               Metastore




       © Hortonworks 2012
                                                                             Page 3
Opening up Metadata to MR & Pig

       MapReduce                            Hive                      Pig



   HCaInputFormat/                                                HCatLoader/
   HCatOuputFormat                                                HCatStorer


                                                      SerDe
                                                   InputFormat/
                         Metastore Client
                                                   OuputFormat




                                                     HDFS
                            Metastore


    © Hortonworks 2012
                                                                                Page 4
Templeton - REST API
• REST endpoints: databases, tables, partitions, columns, table properties
• PUT to create/update, GET to list or describe, DELETE to drop

                                Get a list of all tables in the default database:




                 GET
                 http://…/v1/ddl/database/default/table
                                                                                    Hadoop/
                                                                                    HCatalog
                 {
                  "tables": ["counted","processed",],
                  "database": "default"
                 }




           © Hortonworks 2012
                                                                                               Page 5
Templeton - REST API
• REST endpoints: databases, tables, partitions, columns, table properties
• PUT to create/update, GET to list or describe, DELETE to drop
                                Create new table “rawevents”

                 PUT
                 {"columns": [{ "name": "url", "type": "string" },
                              { "name": "user", "type": "string"}],
                  "partitionedBy": [{ "name": "ds", "type": "string" }]}

                 http://…/v1/ddl/database/default/table/rawevents

                                                                    Hadoop/
                                                                    HCatalog
                 {
                  "table": "rawevents",
                  "database": "default”
                 }




           © Hortonworks 2012
                                                                               Page 6
Templeton - REST API
• REST endpoints: databases, tables, partitions, columns, table properties
• PUT to create/update, GET to list or describe, DELETE to drop
                                Describe table “rawevents”




                 GET
                 http://…/v1/ddl/database/default/table/rawevents

                                                                    Hadoop/
                                                                    HCatalog
                 {
                      "columns": [{"name": "url","type": "string"},
                                  {"name": "user","type": "string"}],
                      "database": "default",
                      "table": "rawevents"
                 }
• Included in HDP
• Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182
           © Hortonworks 2012
                                                                               Page 7
Reading and Writing Data in Parallel
• Use Case: Users want
   – to read and write records in parallel between Hadoop and their parallel system
   – driven by their system
   – in a language independent way
   – without needing to understand Hadoop’s file formats
• Example: an MPP data store wants to read data out of Hadoop as
  HCatRecords for its parallel jobs
• What exists today
   – webhdfs
        –   Language independent
        –   Can move data in parallel
        –   Driven from the user side
        –   Moves only bytes, no understanding of file format
   – Sqoop
        –   Can move data in parallel
        –   Understands data format
        –   Driven from Hadoop side
        –   Requires connector or JDBC



       © 2012 Hortonworks
                                                                                      Page 8
HCatReader and HCatWriter

                                   getHCatReader
                   Master                                HCatalog
                                    HCatReader


                                      read
Input                Slave
Splits                         Iterator<HCatRecord>



                                     read
                     Slave                                     HDFS
                                Iterator<HCatRecord>



                                        read
                     Slave
                                  Iterator<HCatRecord>

                     Right now all in Java, needs to be REST
         © 2012 Hortonworks
                                                                      Page 9
Storing Semi-/Unstructured Data

Table Users                           File Users
 Name                 Zip            {"name":"alice","zip":"93201"}
 Alice                93201          {"name":"bob”,"zip":"76331"}
 Bob                  76331          {"name":"cindy"}
                                     {"zip":"87890"}




   select name, zip                A = load ‘Users’ as
   from users;                         (name:chararray, zip:chararray);
                                   B = foreach A generate name, zip;




         © Hortonworks Inc. 2012
                                                                   Page 10
Storing Semi-/Unstructured Data

Table Users                         File Users
 Name                 Zip          {"name":"alice","zip":"93201"}
 Alice                93201        {"name":"bob”,"zip":"76331"}
 Bob                  76331        {"name":"cindy"}
                                   {"zip":"87890"}


                                   A = load ‘Users’ as
                                       (name:chararray, zip:chararray);
                                   B = foreach A generate name, zip;

   select name, zip
   from users;
                                   A = load ‘Users’
                                   B = foreach A generate name, zip;


         © Hortonworks Inc. 2012
                                                                 Page 11
Storing Semi-/Unstructured Data

Table Users                           File Users
 Name                 Zip            {"name":"alice","zip":"93201"}
 Alice                93201          {"name":"bob”,"zip":"76331"}
 Bob                  76331          {"name":"cindy"}
                                     {"zip":"87890"}

                                             A = load ‘Users’ as
                                                 (name:chararray,
                                                  zip:chararray);
                                             B = foreach A generate name, zip;


   select name, zip                A = load ‘Users’
   from users;                     B = foreach A generate name, zip;




         © Hortonworks Inc. 2012
                                                                      Page 12
Hive ODBC/JDBC Today

                             Issue: Have to have Hive
        JDBC                 code on the client
        Client



                                     Hive               Hadoop
                                    Server

                              Issues:
                              • Not concurrent
       ODBC                   • Not secure
       Client                 • Not scalable

Issue: Open source version
not easy to use



        © 2012 Hortonworks
                                                                 Page 13
ODBC/JDBC Proposal

        JDBC
        Client



Provide robust open source             REST                             Hadoop
implementations                        Server


                             •   Spawns job inside cluster
        ODBC                 •   Runs job as submitting user
        Client               •   Works with security
                             •   Scaling web services well understood




        © 2012 Hortonworks
                                                                                 Page 14
Questions




   © 2012 Hortonworks
                        Page 15

Future of HCatalog - Hadoop Summit 2012

  • 1.
    Future of HCatalog AlanF. Gates @alanfgates Page 1
  • 2.
    Who Am I? •HCatalog committer and mentor • Co-founder of Hortonworks • Lead for Pig, Hive, and HCatalog at Hortonworks • Pig committer and PMC Member • Member of Apache Software Foundation and Incubator PMC • Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2
  • 3.
    Hadoop Ecosystem MapReduce Hive Pig SerDe InputFormat/ InputFormat/ Load/ Metastore Client OuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 3
  • 4.
    Opening up Metadatato MR & Pig MapReduce Hive Pig HCaInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 4
  • 5.
    Templeton - RESTAPI • REST endpoints: databases, tables, partitions, columns, table properties • PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } © Hortonworks 2012 Page 5
  • 6.
    Templeton - RESTAPI • REST endpoints: databases, tables, partitions, columns, table properties • PUT to create/update, GET to list or describe, DELETE to drop Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 6
  • 7.
    Templeton - RESTAPI • REST endpoints: databases, tables, partitions, columns, table properties • PUT to create/update, GET to list or describe, DELETE to drop Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" } • Included in HDP • Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © Hortonworks 2012 Page 7
  • 8.
    Reading and WritingData in Parallel • Use Case: Users want – to read and write records in parallel between Hadoop and their parallel system – driven by their system – in a language independent way – without needing to understand Hadoop’s file formats • Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs • What exists today – webhdfs – Language independent – Can move data in parallel – Driven from the user side – Moves only bytes, no understanding of file format – Sqoop – Can move data in parallel – Understands data format – Driven from Hadoop side – Requires connector or JDBC © 2012 Hortonworks Page 8
  • 9.
    HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader read Input Slave Splits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 9
  • 10.
    Storing Semi-/Unstructured Data TableUsers File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 10
  • 11.
    Storing Semi-/Unstructured Data TableUsers File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 11
  • 12.
    Storing Semi-/Unstructured Data TableUsers File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 12
  • 13.
    Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: • Not concurrent ODBC • Not secure Client • Not scalable Issue: Open source version not easy to use © 2012 Hortonworks Page 13
  • 14.
    ODBC/JDBC Proposal JDBC Client Provide robust open source REST Hadoop implementations Server • Spawns job inside cluster ODBC • Runs job as submitting user Client • Works with security • Scaling web services well understood © 2012 Hortonworks Page 14
  • 15.
    Questions © 2012 Hortonworks Page 15

Editor's Notes

  • #11 SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
  • #12 SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
  • #13 SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
  • #14 Not concurrent: runs one query at a timeNot secure: runs as Hive user