Your SlideShare is downloading. ×
0
Apache HCatalog
● What is it ?
● How does it work ?
● Interfaces
● Architecture
● Example
www.semtech-solutions.co.nz info...
HCatalog – What is it ?
● A Hive metastore interface set
● Shared schema and data types for Hadoop tools
● Rest interface ...
HCatalog – How does it work ?
● Pig
– HCatLoader + HCatStorer interface
● Map Reduce
– HCatInputFormat + HCatOutputFormat ...
HCatalog – Interfaces
● Interface via
– Pig
– Map Reduce
– Hive
– Streaming
● Access data via
– Orc file
– RC file
– Text ...
HCatalog – Interfaces
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Architecture
www.semtech-solutions.co.nz info@semtech-solutions.co.nz
HCatalog – Example
A data flow example from hive.apache.org
First Joe in data acquisition uses distcp to get data onto the...
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project...
Upcoming SlideShare
Loading in...5
×

An introduction to Apache HCatalog

2,314

Published on

An introduction to Apache HCatalog, what is it ?
Why is it useful and how can it help Pig, Hive and
MapReduce users on Hadoop share data ?

Published in: Technology, Business

Transcript of "An introduction to Apache HCatalog"

  1. 1. Apache HCatalog ● What is it ? ● How does it work ? ● Interfaces ● Architecture ● Example www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  2. 2. HCatalog – What is it ? ● A Hive metastore interface set ● Shared schema and data types for Hadoop tools ● Rest interface for external data access ● Assists inter operability between – Pig, Hive and Map Reduce ● Table abstraction of data storage ● Will provide data availability notifications www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  3. 3. HCatalog – How does it work ? ● Pig – HCatLoader + HCatStorer interface ● Map Reduce – HCatInputFormat + HCatOutputFormat interface ● Hive – No interface necessary – Direct access to meta data ● Notifications when data available www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  4. 4. HCatalog – Interfaces ● Interface via – Pig – Map Reduce – Hive – Streaming ● Access data via – Orc file – RC file – Text file – Sequence file – Custom format www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  5. 5. HCatalog – Interfaces www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  6. 6. HCatalog – Architecture www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  7. 7. HCatalog – Example A data flow example from hive.apache.org First Joe in data acquisition uses distcp to get data onto the grid. hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'" Second Sally in data processing uses Pig to cleanse and prepare the data. Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS. A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …); B = filter A by bot_finder(zeta) = 0; … store Z into 'data/processedevents/20100819/data'; With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started. A = load 'rawevents' using HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; … store Z into 'processedevents' using HcatStorer("date=20100819"); Note that the pig job refers to the data by name rawevents rather than a location Now access the data via Hive QL select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by advertiser_id; www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  8. 8. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×