2013 feb 20_thug_h_catalog

Toronto Hadoop User Group
February 20, 2013

Apache HCatalog Overview
& Next-gen Hive
Adam Muise

© Hortonworks Inc. 2012 Page 1

HCatalog


Apache HCatalog
• Incubator Project at Apache.org
• Good adoption
• Will likely merge with Hive project as it adds
important functionality to metastore
• Allows for a “schema-on-read” approach to Big Data
in HDFS
• Seat of a lot of innovation in Data Management
• Used by platform partners to enhance Hadoop
integration
• Will likely be used to enhance existing Data
Management products in the Enterprise & create new
products


Great Tooling Options
MapReduce
•  Early adopters
•  Non-relational algorithms
•  Performance sensitive applications

Pig
•  ETL
•  Data modeling
•  Iterative algorithms

Hive
•  Analysis
•  Connectors to BI tools

Strength: Right tool for right application
Weakness: Hard to share their data

HCatalog Changes the Game
Apache HCatalog provides flexible metadata
services across tools and external access
•  Consistency of metadata and data models across the
Enterprise (MapReduce, Pig, Hbase, Hive, External Systems)
•  Accessibility: share data as tables in and out of HDFS
•  Availability: enables flexible, thin-client access via REST API

HCatalog Shared table
and schema
management
•  Raw Hadoop data Table access opens the
•  Inconsistent, unknown Aligned metadata platform
•  Tool specific access REST API


Options == Complexity
Feature MapReduce Pig Hive
Record format Key value pairs Tuple Record
Data model User defined int, float, string, int, float, string,
bytes, maps, maps, structs, lists
tuples, bags
Schema Encoded in app Declared in script Read from
or read by loader metadata
Data location Encoded in app Declared in script Read from
metadata
Data format Encoded in app Declared in script Read from
metadata

•  Pig and MR users need to know a lot to write their apps
•  When data schema, location, or format change Pig and MR apps must be
rewritten, retested, and redeployed
•  Hive users have to load data from Pig/MR users to have access to it
© Hortonworks 2012
Page 6

Hcatalog == Simple, Consistent
Feature MapReduce + Pig + HCatalog Hive
HCatalog
Record format Record Tuple Record
Data model int, float, string, int, float, string, int, float, string,
maps, structs, lists bytes, maps, maps, structs, lists
tuples, bags
Schema Read from Read from Read from
metadata metadata metadata
Data location Read from Read from Read from
Data format Read from Read from Read from

•  Pig/MR users can read schema from metadata
•  Pig/MR users are insulated from schema, location, and format changes
•  All users have access to other users’ data as soon as it is committed
© Hortonworks 2012
Page 7

Hadoop Ecosystem

Hive Pig MapReduce
(SQL) (scripting) (Java)

Interface: Interface: Interface:
SQL SerDe Load/Store

DML

Input/ Input/ Input/
OutputFormat OutputFormat OutputFormat
DDL

metastore dn1 dn2 dn3 . . .
- tables
- partitions
- ﬁles . . . . . .
- types

. . . . dnN

HDFS

© Hortonworks Inc. 2012

Opening up Metadata to MR & Pig

Hive Pig MapReduce
(SQL) (scripting) (Java)

HCat Metadata layer
Interface: Interface:
SQL HCatLoad/Store

Interface:
SerDe

HCatInput/OutputFormat

metastore dn1 dn2 dn3 . . .
- tables
- partitions
- ﬁles . . . . . .
- types

. . . . dnN

HDFS

© Hortonworks Inc. 2012

Pig Example
Assume you want to count how many time each of your users went to each of your URLs

raw = load '/data/rawevents/20120530' as (url, user);
botless = filter raw by myudfs.NotABot(user);
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
No need to know No need to
store cntd into '/data/counted/20120530';
file location declare schema
Using HCatalog:

raw = load 'rawevents' using HCatLoader();
botless = filter raw by myudfs.NotABot(user) and ds == '20120530';
grpd = group botless by (url, user);
cntd = foreach grpd generate flatten(url, user), COUNT(botless);
store cntd into 'counted' using HCatStorer();
Partition filter

© Hortonworks 2012
Page 10

Working with HCatalog in MapReduce
Setting input: table to read from
HCatInputFormat.setInput(job,
InputJobInfo.create(dbname, tableName, filter));

database to read specify which
Setting output:
from partitions to read
HCatOutputFormat.setOutput(job,
OutputJobInfo.create(dbname, tableName, partitionSpec));

specify which
Obtaining schema: partition to write
schema = HCatInputFormat.getOutputSchema();
access fields by
Key is unused, Value is HCatRecord: name
String url = value.get("url", schema);
output.set("cnt", schema, cnt);

© Hortonworks 2012
Page 11

Managing Metadata
•  If you are a Hive user, you can use your Hive metastore with no modifications

•  If not, you can use the HCatalog command line tool to issue Hive DDL (Data
Definition Language) commands:
> /usr/bin/hcat -e ”create table rawevents (url string,
user string) partitioned by (ds string);";

•  Starting in Pig 0.11, you will be able to issue DDL commands from Pig

© Hortonworks 2012
Page 12

Templeton - REST API
•  REST endpoints: databases, tables, partitions, columns, table properties
•  PUT to create/update, GET to list or describe, DELETE to drop

Get a list of all tables in the default database:

Create new table “rawevents”
Hadoop/
HCatalog
Describe table “rawevents”

© Hortonworks 2012
Page 13

HCatalog is in Hortonworks Data Platform…

OPERATIONAL
SERVICES
DATA

SERVICES

AMBARI
FLUME
PIG
HIVE

HBASE

SQOOP
HCATALOG

OOZIE

WEBHDFS
MAP
REDUCE

HADOOP
CORE

HDFS
YARN
(in
2.0)

Enterprise Readiness
PLATFORM
SERVICES
High Availability, Disaster Recovery, Snapshots, Security, etc…

HORTONWORKS

DATA
PLATFORM
(HDP)

OS
Cloud
VM
Appliance


Key 2013 “Enterprise Hadoop” Initiatives

Invest In:
Hive / “Stinger”
Interactive Query
– Platform Services
Ambari HBase – DR, Snapshot, …
Manage & Operate Online Data
OPERATIONAL
DATA

SERVICES
SERVICES

HADOOP
CORE

– Data Services
PLATFORM
SERVICES
– In support of Refine,
“Knox” HORTONWORKS

“Herd” Explore, Enrich
Secure Access
DATA
PLATFORM
(HDP)
Data Integration

– Operational Services
“Continuum” – Manageability,
Biz Continuity
Security, …


Hive/Stinger: Interactive Query
Near-realtime queries in good old Hive…


Top BI Vendors Support Hive Today


Goal: Enhance Hive for BI Use Cases
Enterprise Reports Parameterized Reports

Dashboard / Scorecard

Data Mining Visualization

More SQL
&
Better Performance

Batch Interactive


Differing Needs For Scale / Interaction
Interactivity Scalability and Reliability
is key are key

Non-
Interactive Batch
Interactive
•  Parameterized •  Data preparation •  Operational batch
Reports •  Incremental batch processing
•  Drilldown processing •  Enterprise
•  Visualization •  Dashboards / Reports
•  Exploration Scorecards •  Data Mining

5s – 1m 1m – 1h 1h+

Data Size


Stinger: Make Hive Best for All Needs

Non-
Interactive Batch
Interactive
•  Parameterized •  Data preparation •  Operational batch
Reports •  Incremental batch processing
•  Drilldown processing •  Enterprise
•  Visualization •  Dashboards / Reports
•  Exploration Scorecards •  Data Mining

5s – 1m 1m – 1h 1h+

Data Size

Improve Latency & Throughput Extend Deep Analytical Ability
•  Query engine improvements •  Analytics functions
•  New “Optimized RCFile” column store •  Improved SQL coverage
•  Next-gen runtime (elim’s M/R latency) •  Continued focus on core Hive use cases


Analytic Function Use Cases
• OVER
– Rankings, top 10, bottom 10
– Running balances
– Statistics within time windows (e.g. last 3 months, last 6 months)
• LEAD / LAG
– Trend identification
– Sessionization
– Forecasting / prediction
• Distributions
– Histograms and bucketing
• Good for Enterprise Reports, Dashboards, Data
Mining and Business Processing.


Stinger 2013 Roadmap Summary
• HDP 1.x (aka Hadoop 1.x …)
– Additional SQL Types
– SQL Analytic Functions (OVER, Subqueries in WHERE, etc.)
– Modern Optimized Column Store (ORC file)
– Hive Query Enhancements
– Startup time, star joins, optimize M/R DAGs, vectorization, etc.

• HDP 2.x (aka Hadoop 2.x …)
– Features in HDP 1.3 & 1.4
– Next-gen runtime that eliminates startup time
– Persistent function registry
– Other features


Questions?


2013 feb 20_thug_h_catalog

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to 2013 feb 20_thug_h_catalog

Similar to 2013 feb 20_thug_h_catalog (20)

More from Adam Muise

More from Adam Muise (16)

Recently uploaded

Recently uploaded (20)

2013 feb 20_thug_h_catalog