Jan 2012 HUG: HCatalog

HCatalog (and friends)

Sushanth Sowmyan
Committer, Apache HCatalog
sush@hortonworks.com
@khorgath

© Hortonworks Inc. 2011 Page 1

Let's think about data for a bit...

From Wikipedia:

Data ( /ˈdeɪtəә/ day-təә, /ˈdætəә/ da-təә, or /ˈdɑːtəә/ dah-təә)

Qualitative or quantitative attributes of a variable or set of
variables. Data are typically the results of measurements and
can be the basis of graphs, images, or observations of a set
of variables. Data are often viewed as the lowest level of
abstraction from which information and then knowledge are
derived. Raw data, i.e., unprocessed data, refers to a
collection of numbers, characters, images or other outputs
from devices that collect information to convert physical
quantities into symbols.

Architecting the Future of Big Data
Page 2
© Hortonworks Inc. 2011

So what is needed to make Data useful?

l  Arguably, tools to convert data into information.

Arguably also, knowledge about the data, so that the
l 

tools can then make use of the data in a meaningful sense,
to extract information from it.


So what are the characteristics of a Data Warehouse?

l  Data is present, organized, recorded, and catalogued.

l  Tools exist that are able to operate on the data.

So what do tools need to be able to operate on data?


Finding it

Photo credit :
© dkeats on flickr


Finding it

l  Knowing where data is.

l  Evolve : Knowing which data is where – “naming” data,

Evolve : Organization to support various data modeling
l 

concepts (table, partitions, columns, records)

l  Evolve : “done” semantics, existence semantics


Reading it

Photo credit : Architecting the Future of Big Data
© kylesteed on flickr © Hortonworks Inc. 2011
Page 7

Reading it

Each tool having its own “storage space”, its own private
l 

world

Evolve : Abstracting away storage mechanism and having
l 

tools sit on top of file formats and mechanisms, so now,
suddenly, tools have interoperability.

Evolve : Having a storage abstraction that adapts to existing
l 

storage mechanisms in an easy to develop manner


Who are the various actors in a data ecosystem?

l  Analyst – uses sql (hive) and/or jdbc-based tools

l  Programmer – cares about data transformation - uses Pig or M/R

Project owner - cares about amount of resources used, data
l 

portability, data connectors

Ops - needs to manage data storage, cluster management, need
l 

to control data expiry, replication, import and export.


(stealing slide from Alan's TriHUG talk)


Also :

People who help aforementioned people:

Tool Writer - wants abstractions to deal with variances,
l 

wants to be able to store and retrieve relevant metadata and
data, so they can focus on their user

Storage subsystem writer - wants standardization so that
l 

they can be used by other actors.


What do they all want?

l  Need it Working – Correctness

l  Speed, Efficiency

l  Interoperability, Convenience.


Did somebody say Interoperability?


Making Your Structured Data
Available to the MapReduce Engine

MapReduce Pig Hive

HCatalog

MPP
HDFS HBase
Store

•  Users can query data with Pig, Hive, or custom MapReduce jobs
•  Standard HDFS formats available Q1 2012
•  HBase data by early Q2 2012

Architecting the Future of Big Data 14

Hcatalog underlying architecture

HCatLoader HCatStorer

HCatInputFormat HCatOutputFormat CLI Notification

Hive MetaStore Client

Generated Thrift Client

Hive
MetaStore RDBMS

Page 15

Problem: Need to Know Where Data Is

PIG
HIVE

B
hdfs:///use

ouse/queen
MapReduce

t1
r/wilber/pro

l/datase

hive/wareh
outh/fou
jectA

hdfs:///user/
ser/mam
hdfs:///u

Storage

Page 16

Solution: Register Through HCatalog

PIG
HIVE

Proje
MapReduce

ctA

Foul1
HCatalog

Storage

Page 17

Problem: Data in variety of formats

•  Data files maybe organized in different formats
•  Data files may contain different formats in different partitions

Storage
(HDFS, HBASE , etc)

Page 18

Solution: HCat provides common
abstraction

Hadoop Application
•  Registered Data w/ Schema
•  HCat normalizes data to application

HCatalog

Storage

Page 19

Getting Involved

Incubator site : http://incubator.apache.org/hcatalog

User list: hcatalog-user@incubator.apache.org

Dev list: hcatalog-dev@incubator.apache.org


TODO

HCATALOG-8 : HCatalog needs a logo

HBase integration, trying to nail down a better table metaphor

Hive integration – interoperability between the notion of
StorageDriver and StorageHandler, project dependency
management

HCATALOG-182 : Improve the “and friends” bit.


Waitaminnit... what was that
about “friends” ?


Templeton
A Webservices API
for Hadoop

Photo credit : Architecting the Future of Big Data
© PKMousie on flickr © Hortonworks Inc. 2011

Templeton: ISV Front-door for Hadoop

•  Insulation from interface changes release to release
•  Opens the door to languages other than Java
•  Thin clients through webservices vs forced fat-clients in gateway

•  Still prototyping! But see a common need.

Page 24

Templeton Specific Support

Move data directly into/out-of HDFS through WebHDFS

Webservice calls to HCatalog
–  Register table relationships for data (e.g., createTable, createDatabase)
–  Adjust tables (e.g., AlterTable)
–  Look at a statistics (e.g., ShowTable)

Webservice calls to start work
–  MapReduce, Pig, Hive
–  Poll for job status
–  Notification URL when job completes (optional)

Stateless Server
–  Horizontally scale for load
–  Configurable for HA
–  Currently Requires ZooKeeper to track job status info

Page 25

ANY QUESTIONS ?


Jan 2012 HUG: HCatalog

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Jan 2012 HUG: HCatalog

Similar to Jan 2012 HUG: HCatalog (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

Jan 2012 HUG: HCatalog