An introduction to Apache HCatalog
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

An introduction to Apache HCatalog

  • 1,800 views
Uploaded on

An introduction to Apache HCatalog, what is it ? ...

An introduction to Apache HCatalog, what is it ?
Why is it useful and how can it help Pig, Hive and
MapReduce users on Hadoop share data ?

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,800
On Slideshare
1,800
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
55
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Apache HCatalog ● What is it ? ● How does it work ? ● Interfaces ● Architecture ● Example www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 2. HCatalog – What is it ? ● A Hive metastore interface set ● Shared schema and data types for Hadoop tools ● Rest interface for external data access ● Assists inter operability between – Pig, Hive and Map Reduce ● Table abstraction of data storage ● Will provide data availability notifications www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 3. HCatalog – How does it work ? ● Pig – HCatLoader + HCatStorer interface ● Map Reduce – HCatInputFormat + HCatOutputFormat interface ● Hive – No interface necessary – Direct access to meta data ● Notifications when data available www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 4. HCatalog – Interfaces ● Interface via – Pig – Map Reduce – Hive – Streaming ● Access data via – Orc file – RC file – Text file – Sequence file – Custom format www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 5. HCatalog – Interfaces www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 6. HCatalog – Architecture www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 7. HCatalog – Example A data flow example from hive.apache.org First Joe in data acquisition uses distcp to get data onto the grid. hadoop distcp file:///file.dat hdfs://data/rawevents/20100819/data hcat "alter table rawevents add partition (ds='20100819') location 'hdfs://data/rawevents/20100819/data'" Second Sally in data processing uses Pig to cleanse and prepare the data. Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS. A = load '/data/rawevents/20100819/data' as (alpha:int, beta:chararray, …); B = filter A by bot_finder(zeta) = 0; … store Z into 'data/processedevents/20100819/data'; With HCatalog, HCatalog will send a JMS message that data is available. The Pig job can then be started. A = load 'rawevents' using HCatLoader(); B = filter A by date = '20100819' and by bot_finder(zeta) = 0; … store Z into 'processedevents' using HcatStorer("date=20100819"); Note that the pig job refers to the data by name rawevents rather than a location Now access the data via Hive QL select advertiser_id, count(clicks) from processedevents where date = ‘20100819’ group by advertiser_id; www.semtech-solutions.co.nz info@semtech-solutions.co.nz
  • 8. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – info@semtech-solutions.co.nz ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems