Future of HCatalog
Alan F. Gates
@alanfgates




                     Page 1
Who Am I?
• HCatalog committer and mentor
• Co-founder of Hortonworks
• Lead for Pig, Hive, and HCatalog at Hortonworks
• Pig committer and PMC Member
• Member of Apache Software Foundation and Incubator
  PMC
• Author of Programming Pig from O’Reilly




     © Hortonworks Inc. 2012
                                                       Page 2
Hadoop Ecosystem

MapReduce                                      Hive                   Pig




                                                         SerDe
InputFormat/                                          InputFormat/   Load/
                            Metastore Client
OuputFormat                                           OuputFormat    Store




                                                        HDFS
                               Metastore




       © Hortonworks 2012
                                                                             Page 3
Opening up Metadata to MR & Pig

       MapReduce                            Hive                      Pig



   HCaInputFormat/                                                HCatLoader/
   HCatOuputFormat                                                HCatStorer


                                                      SerDe
                                                   InputFormat/
                         Metastore Client
                                                   OuputFormat




                                                     HDFS
                            Metastore


    © Hortonworks 2012
                                                                                Page 4
Templeton - REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop

                                Get a list of all tables in the default database:




                 GET
                 http://…/v1/ddl/database/default/table
                                                                                    Hadoop/
                                                                                    HCatalog
                 {
                     "tables": ["counted","processed",],
                     "database": "default"
                 }




           © Hortonworks 2012
                                                                                               Page 5
Templeton - REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop
                                Create new table “rawevents”

                 PUT
                 {"columns": [{ "name": "url", "type": "string" },
                              { "name": "user", "type": "string"}],
                  "partitionedBy": [{ "name": "ds", "type": "string" }]}

                 http://…/v1/ddl/database/default/table/rawevents

                                                                    Hadoop/
                                                                    HCatalog
                 {
                     "table": "rawevents",
                     "database": "default”
                 }




           © Hortonworks 2012
                                                                               Page 6
Templeton - REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop
                                Describe table “rawevents”




                 GET
                 http://…/v1/ddl/database/default/table/rawevents
                                                                    Hadoop/
                                                                    HCatalog
                 {
                      "columns": [{"name": "url","type": "string"},
                                  {"name": "user","type": "string"}],
                      "database": "default",
                      "table": "rawevents"
                 }
•  Included in HDP
•  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182
           © Hortonworks 2012
                                                                               Page 7
Reading and Writing Data in Parallel
•  Use Case: Users want
   –  to read and write records in parallel between Hadoop and their parallel system
   –  driven by their system
   –  in a language independent way
   –  without needing to understand Hadoop’s file formats
•  Example: an MPP data store wants to read data out of Hadoop as
   HCatRecords for its parallel jobs
•  What exists today
   –  webhdfs
         –  Language independent
         –  Can move data in parallel
         –  Driven from the user side
         –  Moves only bytes, no understanding of file format
   –  Sqoop
         –  Can move data in parallel
         –  Understands data format
         –  Driven from Hadoop side
         –  Requires connector or JDBC



        © 2012 Hortonworks
                                                                                       Page 8
HCatReader and HCatWriter

                                   getHCatReader
                   Master                                HCatalog
                                    HCatReader


                                      read
Input                Slave
Splits                         Iterator<HCatRecord>



                                     read
                     Slave                                     HDFS
                                Iterator<HCatRecord>



                                        read
                     Slave
                                  Iterator<HCatRecord>

                     Right now all in Java, needs to be REST
         © 2012 Hortonworks
                                                                      Page 9
Storing Semi-/Unstructured Data

Table Users                          File Users
 Name                 Zip            {"name":"alice","zip":"93201"}
 Alice                93201          {"name":"bob”,"zip":"76331"}
 Bob                  76331          {"name":"cindy"}
                                     {"zip":"87890"}




   select name, zip                A = load ‘Users’ as
   from users;                         (name:chararray, zip:chararray);
                                   B = foreach A generate name, zip;




         © Hortonworks Inc. 2012
                                                                   Page 10
Storing Semi-/Unstructured Data

Table Users                        File Users
 Name                 Zip          {"name":"alice","zip":"93201"}
 Alice                93201        {"name":"bob”,"zip":"76331"}
 Bob                  76331        {"name":"cindy"}
                                   {"zip":"87890"}


                                   A = load ‘Users’ as
                                       (name:chararray, zip:chararray);
                                   B = foreach A generate name, zip;

   select name, zip
   from users;
                                   A = load ‘Users’
                                   B = foreach A generate name, zip;


         © Hortonworks Inc. 2012
                                                                 Page 11
Storing Semi-/Unstructured Data

Table Users                           File Users
 Name                 Zip            {"name":"alice","zip":"93201"}
 Alice                93201          {"name":"bob”,"zip":"76331"}
 Bob                  76331          {"name":"cindy"}
                                     {"zip":"87890"}

                                              A = load ‘Users’ as
                                                  (name:chararray,
                                                   zip:chararray);
                                              B = foreach A generate name, zip;


   select name, zip                A = load ‘Users’
   from users;                     B = foreach A generate name, zip;




         © Hortonworks Inc. 2012
                                                                       Page 12
Hive ODBC/JDBC Today

                             Issue: Have to have Hive
        JDBC                 code on the client
        Client



                                     Hive               Hadoop
                                    Server

                              Issues:
                              •  Not concurrent
       ODBC                   •  Not secure
       Client                 •  Not scalable

Issue: Open source version
not easy to use



        © 2012 Hortonworks
                                                                 Page 13
ODBC/JDBC Proposal

        JDBC
        Client



Provide robust open source              REST                             Hadoop
implementations                         Server


                             •    Spawns job inside cluster
        ODBC                 •    Runs job as submitting user
        Client               •    Works with security
                             •    Scaling web services well understood




        © 2012 Hortonworks
                                                                                  Page 14
Questions




   © 2012 Hortonworks
                        Page 15

Future of HCatalog

  • 1.
    Future of HCatalog AlanF. Gates @alanfgates Page 1
  • 2.
    Who Am I? • HCatalogcommitter and mentor • Co-founder of Hortonworks • Lead for Pig, Hive, and HCatalog at Hortonworks • Pig committer and PMC Member • Member of Apache Software Foundation and Incubator PMC • Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2
  • 3.
    Hadoop Ecosystem MapReduce Hive Pig SerDe InputFormat/ InputFormat/ Load/ Metastore Client OuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 3
  • 4.
    Opening up Metadatato MR & Pig MapReduce Hive Pig HCaInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 4
  • 5.
    Templeton - RESTAPI •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } © Hortonworks 2012 Page 5
  • 6.
    Templeton - RESTAPI •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 6
  • 7.
    Templeton - RESTAPI •  REST endpoints: databases, tables, partitions, columns, table properties •  PUT to create/update, GET to list or describe, DELETE to drop Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" } •  Included in HDP •  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © Hortonworks 2012 Page 7
  • 8.
    Reading and WritingData in Parallel •  Use Case: Users want –  to read and write records in parallel between Hadoop and their parallel system –  driven by their system –  in a language independent way –  without needing to understand Hadoop’s file formats •  Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs •  What exists today –  webhdfs –  Language independent –  Can move data in parallel –  Driven from the user side –  Moves only bytes, no understanding of file format –  Sqoop –  Can move data in parallel –  Understands data format –  Driven from Hadoop side –  Requires connector or JDBC © 2012 Hortonworks Page 8
  • 9.
    HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader read Input Slave Splits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 9
  • 10.
    Storing Semi-/Unstructured Data TableUsers File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 10
  • 11.
    Storing Semi-/Unstructured Data TableUsers File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 11
  • 12.
    Storing Semi-/Unstructured Data TableUsers File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 12
  • 13.
    Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: •  Not concurrent ODBC •  Not secure Client •  Not scalable Issue: Open source version not easy to use © 2012 Hortonworks Page 13
  • 14.
    ODBC/JDBC Proposal JDBC Client Provide robust open source REST Hadoop implementations Server •  Spawns job inside cluster ODBC •  Runs job as submitting user Client •  Works with security •  Scaling web services well understood © 2012 Hortonworks Page 14
  • 15.
    Questions © 2012 Hortonworks Page 15