Your SlideShare is downloading. ×
Future of HCatalog
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Future of HCatalog

2,829
views

Published on

The initial work in HCatalog has allowed users to share their data in Hadoop regardless of the tools they use and relieved them of needing to know where and how their data is stored. But there is much …

The initial work in HCatalog has allowed users to share their data in Hadoop regardless of the tools they use and relieved them of needing to know where and how their data is stored. But there is much more to be done to deliver on the full promise of providing metadata and table management for Hadoop clusters. It should be easy to store and process semi-structured and unstructured data via HCatalog. We need interfaces and simple implementations of data life cycle management tools. We need to deepen the integration with NoSQL and MPP data stores. And we need to be able to store larger metadata such as partition level statistics and user generated metadata. This talk will cover these areas of growth and give an overview of how they might be approached.

Published in: Technology

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,829
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Future of HCatalogAlan F. Gates@alanfgates Page 1
  • 2. Who Am I?• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig committer and PMC Member• Member of Apache Software Foundation and Incubator PMC• Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2
  • 3. Hadoop EcosystemMapReduce Hive Pig SerDeInputFormat/ InputFormat/ Load/ Metastore ClientOuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 3
  • 4. Opening up Metadata to MR & Pig MapReduce Hive Pig HCaInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 4
  • 5. Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } © Hortonworks 2012 Page 5
  • 6. Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 6
  • 7. Templeton - REST API•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" }•  Included in HDP•  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © Hortonworks 2012 Page 7
  • 8. Reading and Writing Data in Parallel•  Use Case: Users want –  to read and write records in parallel between Hadoop and their parallel system –  driven by their system –  in a language independent way –  without needing to understand Hadoop’s file formats•  Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs•  What exists today –  webhdfs –  Language independent –  Can move data in parallel –  Driven from the user side –  Moves only bytes, no understanding of file format –  Sqoop –  Can move data in parallel –  Understands data format –  Driven from Hadoop side –  Requires connector or JDBC © 2012 Hortonworks Page 8
  • 9. HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader readInput SlaveSplits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 9
  • 10. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 10
  • 11. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 11
  • 12. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 12
  • 13. Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: •  Not concurrent ODBC •  Not secure Client •  Not scalableIssue: Open source versionnot easy to use © 2012 Hortonworks Page 13
  • 14. ODBC/JDBC Proposal JDBC ClientProvide robust open source REST Hadoopimplementations Server •  Spawns job inside cluster ODBC •  Runs job as submitting user Client •  Works with security •  Scaling web services well understood © 2012 Hortonworks Page 14
  • 15. Questions © 2012 Hortonworks Page 15