Future of HCatalogAlan F. Gates@alanfgates                     Page 1
Who Am I?• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig ...
Hadoop EcosystemMapReduce                                      Hive                   Pig                                 ...
Opening up Metadata to MR & Pig       MapReduce                            Hive                      Pig   HCaInputFormat/...
Templeton - REST API• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET ...
Templeton - REST API• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET ...
Templeton - REST API• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET ...
Reading and Writing Data in Parallel• Use Case: Users want   – to read and write records in parallel between Hadoop and th...
HCatReader and HCatWriter                                   getHCatReader                   Master                        ...
Storing Semi-/Unstructured DataTable Users                           File Users Name                 Zip            {"name...
Storing Semi-/Unstructured DataTable Users                         File Users Name                 Zip          {"name":"a...
Storing Semi-/Unstructured DataTable Users                           File Users Name                 Zip            {"name...
Hive ODBC/JDBC Today                             Issue: Have to have Hive        JDBC                 code on the client  ...
ODBC/JDBC Proposal        JDBC        ClientProvide robust open source             REST                             Hadoop...
Questions   © 2012 Hortonworks                        Page 15
Upcoming SlideShare
Loading in...5
×

Future of HCatalog - Hadoop Summit 2012

10,446

Published on

Published in: Technology
2 Comments
16 Likes
Statistics
Notes
  • all of this slide outstanding, very apreciate
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • my roomate's aunt makes $83/hr on the laptop. She has been without work for 8 months but last month her pay was $8682 just working on the laptop for a few hours. Read more on this site...Nu t t ÿ R î ç h DÖt c o m
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
10,446
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
199
Comments
2
Likes
16
Embeds 0
No embeds

No notes for slide
  • SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
  • SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
  • SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
  • Not concurrent: runs one query at a timeNot secure: runs as Hive user
  • Future of HCatalog - Hadoop Summit 2012

    1. 1. Future of HCatalogAlan F. Gates@alanfgates Page 1
    2. 2. Who Am I?• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig committer and PMC Member• Member of Apache Software Foundation and Incubator PMC• Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2
    3. 3. Hadoop EcosystemMapReduce Hive Pig SerDeInputFormat/ InputFormat/ Load/ Metastore ClientOuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 3
    4. 4. Opening up Metadata to MR & Pig MapReduce Hive Pig HCaInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 4
    5. 5. Templeton - REST API• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } © Hortonworks 2012 Page 5
    6. 6. Templeton - REST API• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 6
    7. 7. Templeton - REST API• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" }• Included in HDP• Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © Hortonworks 2012 Page 7
    8. 8. Reading and Writing Data in Parallel• Use Case: Users want – to read and write records in parallel between Hadoop and their parallel system – driven by their system – in a language independent way – without needing to understand Hadoop’s file formats• Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs• What exists today – webhdfs – Language independent – Can move data in parallel – Driven from the user side – Moves only bytes, no understanding of file format – Sqoop – Can move data in parallel – Understands data format – Driven from Hadoop side – Requires connector or JDBC © 2012 Hortonworks Page 8
    9. 9. HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader readInput SlaveSplits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 9
    10. 10. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 10
    11. 11. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 11
    12. 12. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 12
    13. 13. Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: • Not concurrent ODBC • Not secure Client • Not scalableIssue: Open source versionnot easy to use © 2012 Hortonworks Page 13
    14. 14. ODBC/JDBC Proposal JDBC ClientProvide robust open source REST Hadoopimplementations Server • Spawns job inside cluster ODBC • Runs job as submitting user Client • Works with security • Scaling web services well understood © 2012 Hortonworks Page 14
    15. 15. Questions © 2012 Hortonworks Page 15
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×