Future of HCatalog - Hadoop Summit 2012
Upcoming SlideShare
Loading in...5
×
 

Future of HCatalog - Hadoop Summit 2012

on

  • 10,924 views

 

Statistics

Views

Total Views
10,924
Views on SlideShare
10,322
Embed Views
602

Actions

Likes
16
Downloads
169
Comments
2

13 Embeds 602

http://d.hatena.ne.jp 348
https://twitter.com 157
http://eventifier.co 43
http://garagekidztweetz.hatenablog.com 21
https://zen.myatos.net 7
http://blog.geekple.com 7
http://www.onlydoo.com 6
http://tweetedtimes.com 5
http://translate.googleusercontent.com 3
http://webcache.googleusercontent.com 2
http://us-w1.rockmelt.com 1
http://kred.com 1
http://www.twylah.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • all of this slide outstanding, very apreciate
    Are you sure you want to
    Your message goes here
    Processing…
  • my roomate's aunt makes $83/hr on the laptop. She has been without work for 8 months but last month her pay was $8682 just working on the laptop for a few hours. Read more on this site...Nu t t ÿ R î ç h DÖt c o m
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
  • SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
  • SQL and traditional relational tables are focused on data warehousing, with consistently structured data (ie every tuple is the same)Much of the strength of Pig and Hadoop is the ability to process the vast amounts of semi/unstructured dataWith HCat we have made it easier for Pig and MR users to interact with data in the data warehouseNeed to make it go the other way as wellGood news is most of the pieces are in place, just need to tie a few things togetherObservation: much of the semi/unstructured data records its own structure (PB, Thrift, Avro, JSON, etc.)
  • Not concurrent: runs one query at a timeNot secure: runs as Hive user

Future of HCatalog - Hadoop Summit 2012 Future of HCatalog - Hadoop Summit 2012 Presentation Transcript

  • Future of HCatalogAlan F. Gates@alanfgates Page 1
  • Who Am I?• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig committer and PMC Member• Member of Apache Software Foundation and Incubator PMC• Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2
  • Hadoop EcosystemMapReduce Hive Pig SerDeInputFormat/ InputFormat/ Load/ Metastore ClientOuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 3
  • Opening up Metadata to MR & Pig MapReduce Hive Pig HCaInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 4
  • Templeton - REST API• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop Get a list of all tables in the default database: GET http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "tables": ["counted","processed",], "database": "default" } © Hortonworks 2012 Page 5
  • Templeton - REST API• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop Create new table “rawevents” PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "table": "rawevents", "database": "default” } © Hortonworks 2012 Page 6
  • Templeton - REST API• REST endpoints: databases, tables, partitions, columns, table properties• PUT to create/update, GET to list or describe, DELETE to drop Describe table “rawevents” GET http://…/v1/ddl/database/default/table/rawevents Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, {"name": "user","type": "string"}], "database": "default", "table": "rawevents" }• Included in HDP• Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182 © Hortonworks 2012 Page 7
  • Reading and Writing Data in Parallel• Use Case: Users want – to read and write records in parallel between Hadoop and their parallel system – driven by their system – in a language independent way – without needing to understand Hadoop’s file formats• Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs• What exists today – webhdfs – Language independent – Can move data in parallel – Driven from the user side – Moves only bytes, no understanding of file format – Sqoop – Can move data in parallel – Understands data format – Driven from Hadoop side – Requires connector or JDBC © 2012 Hortonworks Page 8
  • HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader readInput SlaveSplits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 9
  • Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 10
  • Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 11
  • Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 12
  • Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: • Not concurrent ODBC • Not secure Client • Not scalableIssue: Open source versionnot easy to use © 2012 Hortonworks Page 13
  • ODBC/JDBC Proposal JDBC ClientProvide robust open source REST Hadoopimplementations Server • Spawns job inside cluster ODBC • Runs job as submitting user Client • Works with security • Scaling web services well understood © 2012 Hortonworks Page 14
  • Questions © 2012 Hortonworks Page 15