Future of HCatalog

Future of HCatalog
Alan F. Gates
@alanfgates

Page 1

Who Am I?
• HCatalog committer and mentor
• Co-founder of Hortonworks
• Lead for Pig, Hive, and HCatalog at Hortonworks
• Pig committer and PMC Member
• Member of Apache Software Foundation and Incubator
PMC
• Author of Programming Pig from O’Reilly

© Hortonworks Inc. 2012
Page 2

Hadoop Ecosystem

MapReduce Hive Pig

SerDe
InputFormat/ InputFormat/ Load/
Metastore Client
OuputFormat OuputFormat Store

HDFS
Metastore

© Hortonworks 2012
Page 3

Opening up Metadata to MR & Pig

MapReduce Hive Pig

HCaInputFormat/ HCatLoader/
HCatOuputFormat HCatStorer

SerDe
InputFormat/
Metastore Client
OuputFormat

HDFS
Metastore

© Hortonworks 2012
Page 4

Templeton - REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop

Get a list of all tables in the default database:

GET
http://…/v1/ddl/database/default/table
Hadoop/
HCatalog
{
"tables": ["counted","processed",],
"database": "default"
}

© Hortonworks 2012
Page 5

Create new table “rawevents”

PUT
{"columns": [{ "name": "url", "type": "string" },
{ "name": "user", "type": "string"}],
"partitionedBy": [{ "name": "ds", "type": "string" }]}

http://…/v1/ddl/database/default/table/rawevents

Hadoop/
HCatalog
{
"table": "rawevents",
"database": "default”
}

© Hortonworks 2012
Page 6

Describe table “rawevents”

GET
http://…/v1/ddl/database/default/table/rawevents
Hadoop/
HCatalog
{
"columns": [{"name": "url","type": "string"},
{"name": "user","type": "string"}],
"database": "default",
"table": "rawevents"
}
•  Included in HDP
•  Not yet checked in, but you can find the code on Apache’s JIRA HCATALOG-182
© Hortonworks 2012
Page 7

Reading and Writing Data in Parallel
•  Use Case: Users want
–  to read and write records in parallel between Hadoop and their parallel system
–  driven by their system
–  in a language independent way
–  without needing to understand Hadoop’s file formats
•  Example: an MPP data store wants to read data out of Hadoop as
HCatRecords for its parallel jobs
•  What exists today
–  webhdfs
–  Language independent
–  Can move data in parallel
–  Driven from the user side
–  Moves only bytes, no understanding of file format
–  Sqoop
–  Can move data in parallel
–  Understands data format
–  Driven from Hadoop side
–  Requires connector or JDBC

© 2012 Hortonworks
Page 8

HCatReader and HCatWriter

getHCatReader
Master HCatalog
HCatReader

read
Input Slave
Splits Iterator<HCatRecord>

read
Slave HDFS
Iterator<HCatRecord>

read
Slave
Iterator<HCatRecord>

Right now all in Java, needs to be REST
© 2012 Hortonworks
Page 9

Storing Semi-/Unstructured Data

Table Users File Users
Name Zip {"name":"alice","zip":"93201"}
Alice 93201 {"name":"bob”,"zip":"76331"}
Bob 76331 {"name":"cindy"}
{"zip":"87890"}

select name, zip A = load ‘Users’ as
from users; (name:chararray, zip:chararray);
B = foreach A generate name, zip;

Page 10


{"zip":"87890"}

A = load ‘Users’ as
(name:chararray, zip:chararray);

select name, zip
from users;
A = load ‘Users’

Page 11


{"zip":"87890"}

A = load ‘Users’ as
(name:chararray,
zip:chararray);

select name, zip A = load ‘Users’
from users; B = foreach A generate name, zip;

Page 12

Hive ODBC/JDBC Today

Issue: Have to have Hive
JDBC code on the client
Client

Hive Hadoop
Server

Issues:
•  Not concurrent
ODBC •  Not secure
Client •  Not scalable

Issue: Open source version
not easy to use

© 2012 Hortonworks
Page 13

ODBC/JDBC Proposal

JDBC
Client

Provide robust open source REST Hadoop
implementations Server

•  Spawns job inside cluster
ODBC •  Runs job as submitting user
Client •  Works with security
•  Scaling web services well understood

© 2012 Hortonworks
Page 14

Future of HCatalog

More Related Content

What's hot

Similar to Future of HCatalog

More from DataWorks Summit

Recently uploaded

Future of HCatalog