• Save
HCatalog: Table Management for Hadoop - CHUG - 20120917
Upcoming SlideShare
Loading in...5
×
 

HCatalog: Table Management for Hadoop - CHUG - 20120917

on

  • 4,284 views

 

Statistics

Views

Total Views
4,284
Slideshare-icon Views on SlideShare
3,372
Embed Views
912

Actions

Likes
11
Downloads
0
Comments
0

15 Embeds 912

http://dschool.co 537
http://localhost 170
http://www.dschool.co 144
http://iitkgpsv.org 26
http://gofficeplus.lgcns.com 12
http://jcc.dschool.co 8
http://mountcarmelschoolpatna.dschool.co 4
http://www.iitkgpsv.org 3
http://josephconventpatna.dschool.co 2
http://www.surendranathcentenary.dschool.co 1
http://christchurchhighschool.dschool.co 1
http://bwgsdoranda.dschool.co 1
http://www.jcc.dschool.co 1
http://sbe-ridgecrestcharterkern.dschool.co 1
http://stxaviershighschoolpatna.dschool.co 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    HCatalog: Table Management for Hadoop - CHUG - 20120917 HCatalog: Table Management for Hadoop - CHUG - 20120917 Presentation Transcript

    • HCatalogTable Management For HadoopAlan F. Gates@alanfgates Page 1
    • Who Am I?• HCatalog committer and mentor• Co-founder of Hortonworks• Lead for Pig, Hive, and HCatalog at Hortonworks• Pig committer and PMC Member• Member of Apache Software Foundation and Incubator PMC• Author of Programming Pig from O’Reilly © Hortonworks Inc. 2012 Page 2
    • Who Are You? © Hortonworks Inc. 2012 Page 3
    • Many Data Tools MapReduce •  Early adopters •  Non-relational algorithms •  Performance sensitive applications Pig •  ETL •  Data modeling •  Iterative algorithms Hive •  Analysis •  Connectors to BI tools Strength: Pick the right tool for your application Weakness: Hard for users to share their data © Hortonworks 2012 Page 4
    • Users: Data Sharing is Hard This is analyst Joe, he usesThis is programmer Bob, he Hive to build reports anduses Pig to crunch data. answer ad-hoc queries. Bob, I need today’s data OkPhoto Credit: totalAldo via Flickr Hmm, is it done yet? Where is it? What format did you use to store it today? Is it compressed? And can you help me load it into Hive, I can never remember all the parameters I have to pass that alter table command. Dude, we need HCatalog © Hortonworks Inc. 2012 Page 5
    • Tool Comparison Feature MapReduce Pig Hive Record format Key value pairs Tuple Record Data model User defined int, float, string, int, float, string, bytes, maps, maps, structs, lists tuples, bags Schema Encoded in app Declared in script Read from or read by loader metadata Data location Encoded in app Declared in script Read from metadata Data format Encoded in app Declared in script Read from metadata•  Pig and MR users need to know a lot to write their apps•  When data schema, location, or format change Pig and MR apps must be rewritten, retested, and redeployed•  Hive users have to load data from Pig/MR users to have access to it © Hortonworks 2012 Page 6
    • How Hadoop Apps Access DataMapReduce Hive Pig SerDeInputFormat/ InputFormat/ Load/ Metastore ClientOuputFormat OuputFormat Store HDFS Metastore © Hortonworks 2012 Page 7
    • Opening up Metadata to MR & Pig MapReduce Hive Pig HCatInputFormat/ HCatLoader/ HCatOuputFormat HCatStorer SerDe InputFormat/ Metastore Client OuputFormat HDFS Metastore © Hortonworks 2012 Page 8
    • Tools With HCatalog Feature MapReduce + Pig + HCatalog Hive HCatalog Record format Record Tuple Record Data model int, float, string, int, float, string, int, float, string, maps, structs, lists bytes, maps, maps, structs, lists tuples, bags Schema Read from Read from Read from metadata metadata metadata Data location Read from Read from Read from metadata metadata metadata Data format Read from Read from Read from metadata metadata metadata•  Pig/MR users can read schema from metadata•  Pig/MR users are insulated from schema, location, and format changes•  All users have access to other users’ data as soon as it is committed © Hortonworks 2012 Page 9
    • Pig ExampleAssume you want to count how many time each of your users went to each ofyour URLsraw = load /data/rawevents/20120530 as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless);store cntd into /data/counted/20120530; © Hortonworks 2012 Page 10
    • Pig ExampleAssume you want to count how many time each of your users went to each ofyour URLsraw = load /data/rawevents/20120530 as (url, user);botless = filter raw by myudfs.NotABot(user);grpd = group botless by (url, user);cntd = foreach grpd generate flatten(url, user), COUNT(botless);store cntd into /data/counted/20120530; No need to know No need toUsing HCatalog: file location declare schemaraw = load rawevents using HCatLoader();botless = filter raw by myudfs.NotABot(user) and ds == 20120530;grpd = group botless by (url, user); Partition filtercntd = foreach grpd generate flatten(url, user), COUNT(botless);store cntd into counted using HCatStorer(); © Hortonworks 2012 Page 11
    • Working with HCatalog in MapReduceSetting input: table to read fromHCatInputFormat.setInput(job, InputJobInfo.create(dbname, tableName, filter)); database to read specify whichSetting output: from partitions to readHCatOutputFormat.setOutput(job, OutputJobInfo.create(dbname, tableName, partitionSpec)); specify whichObtaining schema: partition to writeschema = HCatInputFormat.getOutputSchema(); an ordered listKey is unused, Value is HCatRecord: with get/setString url = value.get("url", schema);output.set("cnt", schema, cnt); access fields by name © Hortonworks 2012 Page 12
    • Relationship to Hive•  If you are a Hive user, you can transition to HCatalog with no metadata modifications (true starting with version 0.4) – Uses Hive’s SerDes for data translation (starting in 0.4) – Uses Hive SQL for data definition (DDL)•  For non-Hive users a command line tool is provided that uses Hive DDL: hcat -e ”create table rawevents (url string, user string);"; – Starting in Pig 0.11, you will be able to issue DDL commands from Pig•  Provides a different security implementation; security delegated to underlying storage – Currently integrated with HDFS – Integration with HBase in progress – In near future you will be able to choose to use Hive authorization model if you wish © Hortonworks Inc. 2012 Page 13
    • Architecture HCatInputFormat HCatLoader Hive Metastore MySQL ServerHCat CLI © Hortonworks Inc. 2012 Page 14
    • WebHCat REST API•  FKA Templeton•  REST endpoints: databases, tables, partitions, columns, table properties•  PUT to create/update, GET to list or describe, DELETE to drop Create tables in “rawevents” of all new table “rawevents” Get a list Describe table the default database: PUT {"columns": [{ "name": "url", "type": "string" }, { "name": "user", "type": "string"}], "partitionedBy": [{ "name": "ds", "type": "string" }]} GET GET http://…/v1/ddl/database/default/table/rawevents http://…/v1/ddl/database/default/table/rawevents http://…/v1/ddl/database/default/table Hadoop/ HCatalog { "columns": [{"name": "url","type": "string"}, "tables":"rawevents", "table": ["counted","processed",], {"name": "user","type": "string"}], "database": "default" "default” } "database": "default", "table": "rawevents" }•  Included in HDP•  Checked into trunk, will be part of HCatalog in 0.5 release. © Hortonworks 2012 Page 15
    • HCatalog Project• HCatalog is an Apache Incubator project• Version 0.4.0-incubating released May 2012 – Hive/Pig/MapReduce integration – Support for any data format with a SerDe (Text, Sequence, RCFile, JSON SerDes included) – Notification via JMS – Initial HBase integration © Hortonworks Inc. 2012 Page 16
    • Current Work• Improve HCat security – Many operations in Hive are not checked for authorization (e.g. create index, create view) – Security checks are in the client – Hive grant/revoke operations do not work properly with HCat implementation•  Make Hive Thrift metastore server highly available Hive Clients Metastore MySQL Server © Hortonworks Inc. 2012 Page 17
    • Hive ODBC/JDBC Today Issue: Have to have Hive JDBC code on the client Client Hive Hadoop Server Issues: •  Not concurrent ODBC •  Not secure Client •  Not scalableIssue: Open source versionnot easy to use © 2012 Hortonworks Page 18
    • ODBC/JDBC Proposal JDBC ClientProvide robust open source REST Hadoopimplementations Server •  Spawns job inside cluster ODBC •  Runs job as submitting user Client •  Works with security •  Scaling web services well understood © 2012 Hortonworks Page 19
    • Reading and Writing Data in Parallel•  Use Case: Users want –  to read and write records in parallel between Hadoop and their parallel system –  driven by their system –  in a language independent way –  without needing to understand Hadoop’s file formats•  Example: an MPP data store wants to read data out of Hadoop as HCatRecords for its parallel jobs•  What exists today –  webhdfs –  Language independent –  Can move data in parallel –  Driven from the user side –  Moves only bytes, no understanding of file format –  Sqoop –  Can move data in parallel –  Understands data format –  Driven from Hadoop side –  Requires connector or JDBC © 2012 Hortonworks Page 20
    • HCatReader and HCatWriter getHCatReader Master HCatalog HCatReader readInput SlaveSplits Iterator<HCatRecord> read Slave HDFS Iterator<HCatRecord> read Slave Iterator<HCatRecord> Right now all in Java, needs to be REST © 2012 Hortonworks Page 21
    • Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} select name, zip A = load ‘Users’ as from users; (name:chararray, zip:chararray); B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 22
    • Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip from users; A = load ‘Users’ B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 23
    • Storing Semi-/Unstructured DataTable Users File Users Name Zip {"name":"alice","zip":"93201"} Alice 93201 {"name":"bob”,"zip":"76331"} Bob 76331 {"name":"cindy"} {"zip":"87890"} A = load ‘Users’ as (name:chararray, zip:chararray); B = foreach A generate name, zip; select name, zip A = load ‘Users’ from users; B = foreach A generate name, zip; © Hortonworks Inc. 2012 Page 24
    • Data Lifecycle Management ArchivingCleaning Replication Compaction © Hortonworks Inc. 2012 Page 25
    • Questions © 2012 Hortonworks Page 26