HCatalogTable Management For HadoopAlan F. Gates@alanfgates                              Page 1
Who Am I?• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig ...
Who Are You?   © Hortonworks Inc. 2012                             Page 3
Many Data Tools                    MapReduce                    •  Early adopters                    •  Non-relational alg...
Users: Data Sharing is Hard                                                                  This is analyst Joe, he usesT...
Tool Comparison Feature                        MapReduce         Pig                   Hive Record format                 ...
How Hadoop Apps Access DataMapReduce                                      Hive                   Pig                      ...
Opening up Metadata to MR & Pig       MapReduce                            Hive                      Pig   HCatInputFormat...
Tools With HCatalog Feature                         MapReduce +            Pig + HCatalog        Hive                     ...
Pig ExampleAssume you want to count how many time each of your users went to each ofyour URLsraw     = load /data/rawevent...
Pig ExampleAssume you want to count how many time each of your users went to each ofyour URLsraw     = load /data/rawevent...
Working with HCatalog in MapReduceSetting input:             table to read fromHCatInputFormat.setInput(job,  InputJobInfo...
Relationship to Hive•  If you are a Hive user, you can transition to HCatalog with no   metadata modifications (true start...
Architecture                                 HCatInputFormat                  HCatLoader                   Hive           ...
WebHCat REST API•  FKA Templeton•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to creat...
HCatalog Project• HCatalog is an Apache Incubator project• Version 0.4.0-incubating released May 2012  – Hive/Pig/MapReduc...
Current Work• Improve HCat security   – Many operations in Hive are not checked for authorization (e.g.     create index, ...
Hive ODBC/JDBC Today                             Issue: Have to have Hive        JDBC                 code on the client  ...
ODBC/JDBC Proposal        JDBC        ClientProvide robust open source              REST                             Hadoo...
Reading and Writing Data in Parallel•  Use Case: Users want   –  to read and write records in parallel between Hadoop and ...
HCatReader and HCatWriter                                   getHCatReader                   Master                        ...
Storing Semi-/Unstructured DataTable Users                          File Users Name                 Zip            {"name"...
Storing Semi-/Unstructured DataTable Users                        File Users Name                 Zip          {"name":"al...
Storing Semi-/Unstructured DataTable Users                           File Users Name                 Zip            {"name...
Data Lifecycle Management                                   ArchivingCleaning                                   Replicatio...
Questions   © 2012 Hortonworks                        Page 26
Upcoming SlideShare
Loading in...5
×

HCatalog: Table Management for Hadoop - CHUG - 20120917

5,713

Published on

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,713
On Slideshare
0
From Embeds
0
Number of Embeds
18
Actions
Shares
0
Downloads
0
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide

HCatalog: Table Management for Hadoop - CHUG - 20120917

  1. 1. HCatalogTable Management For HadoopAlan F. Gates@alanfgates Page 1
  2. 2. Who Am I?• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig committer and PMC Member• Member of Apache Software Foundation and Incubator PMC• Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2
  3. 3. Who Are You? © Hortonworks Inc. 2012 Page 3
  4. 4. Many Data Tools MapReduce •  Early adopters •  Non-relational algorithms •  Performance sensitive applications Pig •  ETL •  Data modeling •  Iterative algorithms Hive •  Analysis •  Connectors to BI tools Strength: Pick the right tool for your application Weakness: Hard for users to share their data © Hortonworks 2012 Page 4
  5. 5. Users: Data Sharing is Hard This is analyst Joe, he usesThis is programmer Bob, he Hive to build reports anduses Pig to crunch data. answer ad-hoc queries. Bob, I need today’s data OkPhoto Credit: totalAldo via Flickr Hmm, is it done yet? Where is it? What format did you use to store it today? Is it compressed? And can you help me load it into Hive, I can never remember all the parameters I have to pass that alter table command. Dude, we need HCatalog © Hortonworks Inc. 2012 Page 5
  6. 6. Tool Comparison Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, int, float, string, bytes, maps, maps, structs, lists tuples, bags Schema Encoded in app Declared in script Read from or read by loader metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata•  Pig and MR users need to know a lot to write their apps•  When data schema, location, or format change Pig and MR apps must be rewritten, retested, and redeployed•  Hive users have to load data from Pig/MR users to have access to it © Hortonworks 2012 Page 6
  7. 7. How Hadoop Apps Access DataMapReduce Hive Pig SerDeInputFormat/ InputFormat/ Load/ Metastore ClientOuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 7
  8. 8. Opening up Metadata to MR & Pig MapReduce Hive Pig HCatInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 8
  9. 9. Tools With HCatalog Feature MapReduce + Pig + HCatalog Hive HCatalog Record format Record Tuple Record Data model int, float, string, int, float, string, int, float, string, maps, structs, lists bytes, maps, maps, structs, lists tuples, bags Schema Read from Read from Read from metadata metadata metadata Data location Read from Read from Read from metadata metadata metadata Data format Read from Read from Read from metadata metadata metadata•  Pig/MR users can read schema from metadata•  Pig/MR users are insulated from schema, location, and format changes•  All users have access to other users’ data as soon as it is committed © Hortonworks 2012 Page 9
  10. 10. Pig ExampleAssume you want to count how many time each of your users went to each ofyour URLsraw = load /data/rawevents/20120530 as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless);store cntd into /data/counted/20120530; © Hortonworks 2012 Page 10
  11. 11. Pig ExampleAssume you want to count how many time each of your users went to each ofyour URLsraw = load /data/rawevents/20120530 as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless);store cntd into /data/counted/20120530; No need to know No need toUsing HCatalog: file location declare schemaraw = load rawevents using HCatLoader();botless = filter raw by myudfs.NotABot(user) and ds == 20120530;grpd = group botless by (url, user); Partition filtercntd = foreach grpd generate flatten(url, user), COUNT(botless);store cntd into counted using HCatStorer(); © Hortonworks 2012 Page 11
  12. 12. Working with HCatalog in MapReduceSetting input: table to read fromHCatInputFormat.setInput(job, InputJobInfo.create(dbname, tableName, filter)); database to read specify whichSetting output: from partitions to readHCatOutputFormat.setOutput(job, OutputJobInfo.create(dbname, tableName, partitionSpec)); specify whichObtaining schema: partition to writeschema = HCatInputFormat.getOutputSchema(); an ordered listKey is unused, Value is HCatRecord: with get/setString url = value.get("url", schema);output.set("cnt", schema, cnt); access fields by name © Hortonworks 2012 Page 12
  13. 13. Relationship to Hive•  If you are a Hive user, you can transition to HCatalog with no metadata modifications (true starting with version 0.4) – Uses Hive’s SerDes for data translation (starting in 0.4) – Uses Hive SQL for data definition (DDL)•  For non-Hive users a command line tool is provided that uses Hive DDL: hcat -e ”create table rawevents (url string, user string);"; – Starting in Pig 0.11, you will be able to issue DDL commands from Pig•  Provides a different security implementation; security delegated to underlying storage – Currently integrated with HDFS – Integration with HBase in progress – In near future you will be able to choose to use Hive authorization model if you wish © Hortonworks Inc. 2012 Page 13
  14. 14. Architecture HCatInputFormat HCatLoader Hive Metastore MySQL ServerHCat CLI © Hortonworks Inc. 2012 Page 14
  15. 15. WebHCat REST API•  FKA Templeton•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Create tables in “rawevents” of all new table “rawevents” Get a list Describe table the default database: PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} GET GET http://…/v1/ddl/database/default/table/rawevents http://…/v1/ddl/database/default/table/rawevents http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, "tables":"rawevents", "table": ["counted","processed",], {"name": "user","type": "string"}], "database": "default" "default” } "database": "default", "table": "rawevents" }•  Included in HDP•  Checked into trunk, will be part of HCatalog in 0.5 release. © Hortonworks 2012 Page 15
  16. 16. HCatalog Project• HCatalog is an Apache Incubator project• Version 0.4.0-incubating released May 2012 – Hive/Pig/MapReduce integration – Support for any data format with a SerDe (Text, Sequence, RCFile, JSON SerDes included) – Notification via JMS – Initial HBase integration © Hortonworks Inc. 2012 Page 16
  17. 17. Current Work• Improve HCat security – Many operations in Hive are not checked for authorization (e.g. create index, create view) – Security checks are in the client – Hive grant/revoke operations do not work properly with HCat implementation•  Make Hive Thrift metastore server highly available Hive Clients Metastore MySQL Server © Hortonworks Inc. 2012 Page 17
  18. 18. Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: •  Not concurrent ODBC •  Not secure Client •  Not scalableIssue: Open source versionnot easy to use © 2012 Hortonworks Page 18
  19. 19. ODBC/JDBC Proposal JDBC ClientProvide robust open source REST Hadoopimplementations Server •  Spawns job inside cluster ODBC •  Runs job as submitting user Client •  Works with security •  Scaling web services well understood © 2012 Hortonworks Page 19
  20. 20. Reading and Writing Data in Parallel•  Use Case: Users want –  to read and write records in parallel between Hadoop and their parallel system –  driven by their system –  in a language independent way –  without needing to understand Hadoop’s file formats•  Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs•  What exists today –  webhdfs –  Language independent –  Can move data in parallel –  Driven from the user side –  Moves only bytes, no understanding of file format –  Sqoop –  Can move data in parallel –  Understands data format –  Driven from Hadoop side –  Requires connector or JDBC © 2012 Hortonworks Page 20
  21. 21. HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader readInput SlaveSplits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 21
  22. 22. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 22
  23. 23. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 23
  24. 24. Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 24
  25. 25. Data Lifecycle Management ArchivingCleaning Replication Compaction © Hortonworks Inc. 2012 Page 25
  26. 26. Questions © 2012 Hortonworks Page 26

×