Future of HCatalog - Hadoop Summit 2012

Future of HCatalog
Alan F. Gates
@alanfgates

Page 1

Who Am I?
• HCatalog committer and mentor
• Co-founder of Hortonworks
• Lead for Pig, Hive, and HCatalog at Hortonworks
• Pig committer and PMC Member
• Member of Apache Software Foundation and Incubator
PMC
• Author of Programming Pig from O’Reilly

© Hortonworks Inc. 2012
Page 2

Hadoop Ecosystem

MapReduce Hive Pig

SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store

HDFS
Metastore

© Hortonworks 2012
Page 3

Opening up Metadata to MR & Pig

MapReduce Hive Pig

HCaInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer

SerDe
InputFormat/
Metastore Client
OuputFormat

HDFS
Metastore

© Hortonworks 2012
Page 4

Templeton - REST API
• REST endpoints: databases, tables, partitions, columns, table properties
• PUT to create/update, GET to list or describe, DELETE to drop

Get a list of all tables in the default database:

GET
http://…/v1/ddl/database/default/table
Hadoop/
HCatalog
{
"tables": ["counted","processed",],
"database": "default"
}

© Hortonworks 2012
Page 5

Create new table “rawevents”

PUT
{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],
"partitionedBy": [{ "name": "ds", "type": "string" }]}

http://…/v1/ddl/database/default/table/rawevents

Hadoop/
HCatalog
{
"table": "rawevents",
"database": "default”
}

© Hortonworks 2012
Page 6

Describe table “rawevents”

GET
http://…/v1/ddl/database/default/table/rawevents

Hadoop/
HCatalog
{
"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],
"database": "default",
"table": "rawevents"
}
• Included in HDP
• Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182
© Hortonworks 2012
Page 7

Reading and Writing Data in Parallel
• Use Case: Users want
– to read and write records in parallel between Hadoop and their parallel system
– driven by their system
– in a language independent way
– without needing to understand Hadoop’s file formats
• Example: an MPP data store wants to read data out of Hadoop as
HCatRecords for its parallel jobs
• What exists today
– webhdfs
– Language independent
– Can move data in parallel
– Driven from the user side
– Moves only bytes, no understanding of file format
– Sqoop
– Can move data in parallel
– Understands data format
– Driven from Hadoop side
– Requires connector or JDBC

© 2012 Hortonworks
Page 8

HCatReader and HCatWriter

getHCatReader
Master HCatalog
HCatReader

read
Input Slave
Splits Iterator<HCatRecord>

read
Slave HDFS
Iterator<HCatRecord>

read
Slave
Iterator<HCatRecord>

Right now all in Java, needs to be REST
© 2012 Hortonworks
Page 9

Storing Semi-/Unstructured Data

Table Users File Users
Name Zip {"name":"alice","zip":"93201"}
Alice 93201 {"name":"bob”,"zip":"76331"}
Bob 76331 {"name":"cindy"}
{"zip":"87890"}

select name, zip A = load ‘Users’ as
from users; (name:chararray, zip:chararray);
B = foreach A generate name, zip;

Page 10


{"zip":"87890"}

A = load ‘Users’ as
(name:chararray, zip:chararray);

select name, zip
from users;
A = load ‘Users’

Page 11


{"zip":"87890"}

A = load ‘Users’ as
(name:chararray,
zip:chararray);

select name, zip A = load ‘Users’
from users; B = foreach A generate name, zip;

Page 12

Hive ODBC/JDBC Today

Issue: Have to have Hive
JDBC code on the client
Client

Hive Hadoop
Server

Issues:
• Not concurrent
ODBC • Not secure
Client • Not scalable

Issue: Open source version
not easy to use

© 2012 Hortonworks
Page 13

ODBC/JDBC Proposal

JDBC
Client

Provide robust open source REST Hadoop
implementations Server

• Spawns job inside cluster
ODBC • Runs job as submitting user
Client • Works with security
• Scaling web services well understood

© 2012 Hortonworks
Page 14

Future of HCatalog - Hadoop Summit 2012

More Related Content

What's hot

Viewers also liked

Similar to Future of HCatalog - Hadoop Summit 2012

More from Hortonworks

Recently uploaded

Future of HCatalog - Hadoop Summit 2012

Editor's Notes