HUG Meetup 2013: HCatalog / Hive Data Out

HCatalog/ Hive DataOut
Bay Area Hadoop User Group Meetup
May 15, 2013

Moving Data Out of Hadoop Clusters Today
2Yahoo! Presentation, Confidential
Client’s
Machine
HTTP
Client
HTTP
Server
Launcher/
Gateway
HDFS
Proxy1
HTTP
Proxy
M/R on
YARN
HDFS
Hadoop RPC
Hadoop RPC
SSH
HTTPS
HTTPS
M/R on
YARN
Custom
Proxy
HTTPS
HTTP
Server
Filers
HTTPS
HDFS
M/R on
YARN
DistCp
Clients Multi-tenant Hadoop Clusters Managed Data-loading
1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP
SSH

SQLLDR
Typical Data Out Scenario
HDFS
ProxyHDFS
§  Data (to be pulled out) is stored in a predefined directory structure as files
§  Client determines (through a custom interface) if a particular data feed of interest is
committed or not
§  If committed, client gets the list of files first, and then pulls them out (file-by-file)
through HDFSProxy
CustomInterface
Filer Temp Table
Main Table
cURL
data copy
INSERT
Oracle DB
Ext. Table
Main Table
delimited files

Pros and Cons of the Data Out Approach
Pros
§  Security of DB passwords – password not stored in the grid
§  Compression – cross-colo network bandwidth is expensive and compression is not possible with
JDBC drivers
§  Encryption – data out of the grids has to be encrypted as it may be cross-colo
§  ACLs – DB hosts are not accessible from grid nodes, and hence the proxy
Cons
§  Directory structure – has to be predefined and known to downstream consumers of data
§  Data discovery – availability of data for consumption requires polling or other hooks
§  Overhead – Use of DONE files
§  Maintenance – Separate schema files and schema file formats
The introduction of HCatalog and JMS notifications solves the problem

Hadoop – One Platform, Many Tools
Yahoo! Presentation, Confidential 5
Metastore
HDFS
Hive
Metastore Client
InputFormat/
OuputFormat
SerDe
InputFormat/
OuputFormat
MapReduce Pig
Load/
Store
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
MapReduce/ Pig
§  Pipelines
§  Iterative Processing
§  Research
Data Warehouse
Hive
§  BI Tools
§  Analysis

HCatLoader/
HCatStorer
HCatalog – Opening Up the Hive Metastore
Metastore
HDFS
Metastore Client
InputFormat/
OuputFormat
SerDe
HCatInputFormat/
HCatOuputFormat
MapReduce Pig
Hive
REST
External
System

HCatalog Value Proposition
§  Centralized metadata service for Hadoop
§  Facilitates interoperability among tools such as Pig, Hive, M/R, allows for
sharing of data
§  Provides DB-like abstractions (databases, tables, and partitions) and
supports schema evolution
§  Abstracts out the file storage format and data location

HiveServer2 with HCatalog
HDFS
(ODBC)
HiveServer2
(ODBC/ JDBC)
Data Out Client
(JDBC)
HCatalog Server
(Metastore)
Messaging
Service
(ActiveMQ)
HiveServer2
Jobs
Hive Jobs
(CLI)
HCat Jobs
(Pig, M/R)
doAs(user)
doAs(user)
JMS notification (Producer)
Notification (Consumer)

Issues Solved
Directory structure – has to be predefined and known to downstream
consumers of data
Data discovery – availability of data for consumption requires polling or
other hooks
Overhead – Use of DONE files
Maintenance – Separate schema files and schema file formats
✔
✔
✔
✔

DataOut Motivation
§  Many ways to load and manage data on the grid
§  HCatalog/Hive
§  Pig
§  Hadoop MR
§  Sqoop
§  GDM
§  Fewer ways of getting data off the cluster
§  Sqoop
§  HDFSProxy
§  HDFS copy to local file system
§  distcp between clusters
§  Challenges
§  Underlying file format
§  Size of data
§  SLA

DataOut Overview
§  What is DataOut?
§  Efficient method of moving data off the grid
§  API exposes a programmatic interface
§  What are the advantages of DataOut?
§  API based on well-known JDBC API
§  Works with HCatalog/Hive
§  Agnostic to the underlying storage format
§  Parts of the whole data can be pulled in parallel
§  What are the limitations of DataOut?
§  Queries must be SELECT * FROM type queries

DataOut Deployment
HDFS
HS2 HS2 … HS2 HS2
DataOut
Client
Query Data

How DataOut Works
HiveServer2M
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
Execute Query
Prepare Splits
Fetch Splits
Legend:
M – Master, S – Slave, FS/ DB – Filesystem/ Database

Code to Prepare the HiveSplits
DataOut
dataout
=
new
DataOut();

HiveConnection
c
=
dataout.getConnection();

Statement
s
=
c.createGenerateSplitStatement();

ResultSet
rs
=
s.executeQuery(sql);

while(rs.next())
{

HiveSplit
split
=
(HiveSplit)
rs.getObject(1);

/*
Launch
job
to
fetch
the
split
data.
*/

}

/*
Synchronize
on
fetch
jobs.
*/

rs.close();

s.close();

c.close();

Code to Retrieve the HiveSplits
DataOut
dataout
=
new
DataOut();

HiveConnection
c
=
dataout.getConnection();

PreparedStatement
ps
=
c.prepareFetchSplitStatement(split);

ResultSet
rs
=
ps.executeQuery();

while(rs.next())
{

/*
Process
row
data.
*/

}

rs.close();

ps.close();

c.close();

/*
Communicate
with
master
process.
*/

DataOut Demo

HS2 Performance – Single Client Connection

HS2 Performance – Five Concurrent Clients

HS2 Performance Summary
§  Throughput scales linearly
§  Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s
§  Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s
§  Throughput is affected by fetch size
§  Sweet spot around ~200 rows
§  Average row size may affect this number (pending further testing)
§  HiveServer2 is capable of handling multiple clients
§  Throughput of 10GB in ~20 minutes with five client connections
§  Drop-off in throughput is expected and reasonable
§  5x increase in concurrent connections = 2x increase in transfer time
§  Goal of 50GB in 5min
§  Achievable with ~10 HiveServer2 instances streaming data

HUG Meetup 2013: HCatalog / Hive Data Out

HUG Meetup 2013: HCatalog / Hive Data Out

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HUG Meetup 2013: HCatalog / Hive Data Out

Similar to HUG Meetup 2013: HCatalog / Hive Data Out (20)

More from Sumeet Singh

More from Sumeet Singh (9)

Recently uploaded

Recently uploaded (20)

HUG Meetup 2013: HCatalog / Hive Data Out