Data Discovery on Hadoop
PRESENTED BY Sumeet Singh, Thiruvel Thirumoolan ⎪ February 19, 2015
S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 5 , S a n J o s e
Introduction
2
§  Developer in the Hive-HCatalog team, and active
contributor to Apache Hive
§  Responsible for Hive, HiveServer2 and HCatalog
across all Hadoop clusters and ensuring they work
at scale for the usage patterns of Yahoo
§  Loves mining the trove of Hadoop logs for usage
patterns and insights
§  Bachelors degree from Anna University
Thiruvel Thirumoolan
Principal Engineer
Hadoop and Big Data Platforms
Platforms and Personalization Products
701 First Avenue,
Sunnyvale, CA 94089 USA
@thiruvel
§  Manages Hadoop products team at Yahoo
§  Responsible for Product Management, Strategy and
Customer Engagements
§  Managed Cloud Services products team and headed
Strategy functions for the Cloud Platform Group at
Yahoo
§  MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Sr. Director, Product Management
Cloud and Big Data Platforms
Platforms and Personalization Products
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
Agenda
3
The Data Management Challenge1
Apache HCatalog to Rescue
Data Registration and Discovery
Opening up Access to Data
Q&A
2
3
4
5
Hadoop as the Source of Truth for All Data
4
TV
PC
Phone
Tablet
Pushed Data
Pulled Data
Web Crawl
Social
Email
3rd Party Content
Data Highway
Hadoop Grid
BI, Reporting, Adhoc Analytics
Data
Content
Ads
No-SQL
Serving Stores
Serving
ILLUSTRATIVE
5
42,300
servers
600 PB
0
100
200
300
400
500
600
700
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
RawHDFSStorage(inPB)
NumberofServers
Year
Servers Storage
1 Across all Hadoop (18 clusters, 40,080 servers, 565 PB) and HBase (9 clusters, 2,250 servers, 35 PB) clusters, Feb 16, 2015
1.5 billion
files & dir
Growth in HDFS1 in the Last 10 Years
+70 PB
+66 PB
+64 PB
+80 PB
+100 PB
Last 5 years: 22.2% CAGR
Processing and Analyzing Data with Hadoop…Then
6
HDFS
MapReduce (YARN)
Pig HiveJava MR APIs
InputFormat/ OutputFormat
Load / Store SerDe
MetaStore
Client
Hive
MetaStore
Hadoop
Streaming
Oozie
Processing and Analyzing Data with HBase…Then
7
HDFS
HBase
Pig HiveJava MR APIs
TableInputFormat/
TableOutputFormat
MetaStore
Client
Hive
MetaStore
Oozie
HBaseStorage
HBaseStorage
Handler
Hadoop Jobs on the Platform Today
8
100%
(28.9 M)
1%4%
5%
7%
35%
47%
All Jobs Pig Oozie Launcher Java MR Hive GDM Streaming, distcp,
Spark
Job Distribution (Jan 2015)
Challenges in Managing Data on Multi-tenant Platforms
9
Data Producers
Platform Services
Data Consumers
§  Data shared across tools such as MR, Pig, and Hive
§  Schema and semantics knowledge across the
company
§  Support for schema evolution and downstream
change communication
§  Fine-grained access controls (row / column) vs. all
or nothing
§  Clear ownership of data
§  Data lineage and integrity
§  Audits and compliance (e.g. SOX)
§  Retention, duplication, and waste
Data Economy Challenges
Apache
HCatalog
&
Data Discovery
Apache HCatalog in the Technology Stack
10
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie Grid UI
GDM &
Proxies
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
HCatalog Facilitates Interoperability…Now
11
HDFS
MapReduce (YARN)
Pig HiveJava MR APIs
InputFormat/ OutputFormat
SerDe & Storage Handler
MetaStore
Client
HCatalog
MetaStore
HCatInputFormat /
HCatOutputFormat
HCatLoader/
HCatStorer
HDFS
HBase Notifications
Oozie
Data Model
12
Database
(namespace)
Table
(schema)
Table
(schema)
Partitions Partitions
Buckets
Buckets
Skewed Unskewed
Optional
per table
Partitions, buckets, and skews facilitate faster, more direct access to data
Sample Table Registration
13
Select project database
USE	
  xyz;	
  	
  
Create table
CREATE	
  EXTERNAL	
  TABLE	
  search	
  (	
  
bcookie	
  string 	
   	
  COMMENT	
  ‘Standard	
  browser	
  cookie’,	
  
time_stamp	
  int	
   	
   	
  COMMENT	
  ‘DD-­‐MON-­‐YYYY	
  HH:MI:SS	
  (AM/PM)’,	
  
uid	
  string	
   	
   	
  COMMENT	
  ‘User	
  id’,	
  
ip	
  string 	
   	
  COMMENT	
  ‘...’,	
  	
  
pg_spaceid	
  string	
   	
  COMMENT	
  ‘...’,	
  	
  
...)	
  
PARTITIONED	
  BY	
  (	
  
locale	
  string	
   	
   	
  COMMENT	
  ‘Country	
  of	
  origin’,	
  	
  
datestamp	
  string 	
   	
  COMMENT	
  ‘Date	
  in	
  YYYYMMDD	
  format’)	
  
STORED	
  AS	
  ORC	
  
LOCATION	
  ‘/projects/search/...’;	
  
Add partitions manually, (if you choose to)
ALTER	
  TABLE	
  search	
  ADD	
  PARTITION	
  (	
  locale=‘US’,	
  datestamp=‘20130201’)	
  	
  
LOCATION	
  ‘/projects/search/...’;	
  
Your company’s data (metadata) can be registered with HCatalog irrespective of the tool used
Getting Data into HCatalog – DML and DDL
14
LOAD Files into tables
Copy / move data from HDFS or local filesystem into HCatalog tables
LOAD	
  DATA	
  [LOCAL]	
  INPATH	
  'filepath'	
  [OVERWRITE]	
  INTO	
  TABLE	
  tablename	
  	
  
[PARTITION	
  (partcol1=val1,	
  partcol2=val2	
  ...)];
INSERT data from a query into tables
Query results can be inserted into tables of file system directories by using the insert clause.
INSERT	
  OVERWRITE	
  TABLE	
  tablename1	
  [PARTITION	
  (partcol1=val1,	
  partcol2=val2	
  ...)	
  [IF	
  NOT	
  EXISTS]]	
  
select_statement1	
  FROM	
  from_statement;	
  
	
  
INSERT	
  INTO	
  TABLE	
  tablename1	
  [PARTITION	
  (partcol1=val1,	
  partcol2=val2	
  ...)]	
  select_statement1	
  FROM	
  
from_statement;	
  
HCatalog also supports multiple inserts in the same statement or dynamic partition inserts.
ALTER TABLE ADD PARTITIONS
ALTER	
  TABLE	
  table_name	
  ADD	
  PARTITION	
  (partCol	
  =	
  'value1')	
  location	
  'loc1’;	
  
Getting Data into HCatalog – HCatalog APIs
15
Pig
HCatLoader and HCatStorer is used with Pig scripts to read from and write data to HCatalog-managed tables
A	
  =	
  load	
  '$DB.$TABLE'	
  using	
  org.apache.hcatalog.pig.HCatLoader();	
  
B	
  =	
  FILTER	
  A	
  BY	
  $FILTER;	
  
C	
  =	
  foreach	
  B	
  generate	
  foo,	
  bar;	
  
store	
  C	
  into	
  '$OUTPUT_DB.$OUTPUT_TABLE'	
  USING	
  org.apache.hcatalog.pig.HCatStorer	
  ('$OUTPUT_PARTITION');	
  
MapReduce
HCatInputFormat and HCatOutputFormat is used with MapReduce to read from and write data to HCatalog-managed tables.
Map<String,	
  String>	
  partitionValues	
  =	
  new	
  HashMap<String,	
  String>();	
  
partitionValues.put("a",	
  "1");	
  
partitionValues.put("b",	
  "1");	
  
HCatTableInfo	
  info	
  =	
  HCatTableInfo.getOutputTableInfo(dbName,	
  tblName,	
  partitionValues);	
  
HCatOutputFormat.setOutput(job,	
  info);	
  
	
  
	
  
	
  
HCatalog Integration with Data Mgmt. Platform (GDM)
16
MetaStore
Cluster 1 - Colo 1
HDFS
Cluster 2 – Colo 2
HDFS
Grid Data
Management
Feed Acquisition
Feed
Replication
MetaStore
Feed datasets as
partitioned external
tables
Growl extracts
schema for backfill
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
Partitions are dropped with
(HCatClient.dropPartitions(…)) after
retention expiration with a
drop_partition notification
add_partition
event notification
add_partition
event notification
HCatalog Notifications
17
Namespace:	
  E.g.	
  “hcat.thebestcluster”	
  
JMS	
  Topic:	
  E.g.	
  “<dbname>.<tablename>”	
  
Sample	
  JMS	
  Notification	
  
{	
  
	
  	
  "timestamp"	
  :	
  1360272556,	
  
	
  	
  "eventType"	
  :	
  "ADD_PARTITION",	
  
	
  	
  "server"	
  	
  	
  	
  :	
  "thebestcluster-­‐hcat.dc1.grid.yahoo.com",	
  
	
  	
  "servicePrincipal"	
  :	
  "hcat/thebestcluster-­‐hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM",	
  
	
  	
  "db"	
  	
  	
  	
  	
  	
  	
  	
  :	
  "xyz",	
  
	
  	
  "table"	
  	
  	
  	
  	
  :	
  "search",	
  
	
  	
  "partitions":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "locale"	
  :	
  "US",	
  "datestamp"	
  :	
  "20140602"	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "locale"	
  :	
  "UK",	
  "datestamp"	
  :	
  "20140602"	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "locale"	
  :	
  "IN",	
  "datestamp"	
  :	
  "20140602"	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ]	
  
}	
  
HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database, add_table, add_partition, drop_partition,
drop_table, and drop_database. Notifications can be extended for schema change communication
HCat
Client
HCat
MetaStore
ActiveMQ
Server
Register Channel Publish to listener channels
Subscribers
Oozie, HCatalog, and Messaging Integration
18
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
Data Discovery with HCatalog
19
§  Unified metadata store for all data at Yahoo
§  Discovery is about
o  Browsing / inspecting metadata and data
o  Searching for datasets
§  It helps to solve
o  Schema knowledge across the company
o  Ownerships
o  Data type – dev or prod
o  Understand data
o  Schema evolution
o  Lineage
Data Discovery Features
20
§  Browsing
o  Tables / Databases
o  Schema, format, properties
o  Partitions and metadata about each partition
§  Searches for tables
o  Table name (regex) or Comments
o  Column name or comments
o  Ownership, File format
o  Location
o  Properties (Dev/Prod)
Data Discovery UI in Production
21
Search Tables Search
The Best Cluster
audience_db	
  
tumblr_db	
  
user_db	
  
flickr_db	
  
page_clicks	
   Hourly	
  clickstream	
  table	
  
ad_clicks	
   Hourly	
  ad	
  clicks	
  table	
  	
  
user_info	
   User	
  registration	
  info	
  
session_info	
   Session	
  feed	
  info	
  
audience_info	
   Primary	
  audience	
  table	
  
GLOBAL HCATALOG DASHBOARD
Available Databases
Available Tables (audience_db)
Search the HCat tables
Browse
the DBs
by
cluster
Search
results
or
browse
db
results
1 2 Next 1 2 Next
ILLUSTRATIVE
Table Display UI
22
GLOBAL HCATALOG DASHBOARD
HCat Instance The	
  Best	
  Cluster	
  
Database audience_db	
  
Table page_clicks	
  
Owner Awesome	
  Yahoo	
  
Schema
Partitions
Column Type Description
bcookie	
   string	
   Standard	
  browser	
  cookie	
  
timestamp	
   string	
   DD-­‐MON-­‐YYYY	
  HH:MI:SS	
  (AM/PM)	
  
.
.
Column Type Description
dt	
   string	
   Date	
  in	
  YYYY_MM_DD	
  format	
  	
  
ILLUSTRATIVE
Data Discovery Physical View
23
Discovery UI
Global View of
All Data in Metastore
DC1-C1
DC1-C2
DCn-Cn
.
.
.
DC2-C1
DC2-C2
DCm-Cm
.
.
.
Data Center 1 Data Center 2
MS WebServer
HCat API
MS WebServer
HCat API
MS WebServer
HCat API
MSWebServer
HCat API
MS
WebServer
HCat API
WebServer
HCat API
MS
ILLUSTRATIVE
Data Discovery Design
24
§  A single web interface connects to all Metastore instances (all datacenters)
§  Select an appropriate cluster and browse all metadata
o  A webserver runs on each Metastore
o  All reads audited
o  ACLs (future)
§  Search functionality will be added to web interface and Metastore
o  New Thrift interface to support search
o  All searches audited
§  Long term design
o  Load on production
o  Read and Write HCatalog instances
Data Discovery Design – APIs
25
§  Search
o  Searches across various fields in order
o  Simple ranking
o  Search order for multiple keywords
o  Optimized implementation for database
o  Will be contributed back
§  Unique partition values
o  One or more partition keys
o  Filtering and Ordering supported
o  HIVE-7604 (https://issues.apache.org/jira/browse/HIVE-7604)
Data Discovery Design – Optimizations
26
§  Allows to peek into the data (select * limit n)
§  Existing implementations costly
o  Too much client and server resources
o  Timeouts and failures
§  Optimized partition objects and used names
§  New implementation takes a few seconds at most
§  HIVE-9573 (https://issues.apache.org/jira/browse/HIVE-9573)
Going Forward – Lineage
27
Advantages Challenges
Bottleneck
Ownership
Quality
Offline / Real Time
Data Flow / Control Flow
Software Stack
Going Forward – Lineage
28
Statistics help in heuristics instead of running a job
Table 1 /
Partition 1
(Stage-1)
HBase
ORC Table
Partition 1
(Stage-2)
Dimension
Table
Statistics/
Agg. Table
(Stage-3)
Daily Stats
Table
(Stage-4)
Copied by
distcp / external
registrar
Hourly
ILLUSTRATIVE
Going Forward – Schema Versioning
29
Schema
Column Type Description
bcookie	
   string	
   Standard	
  browser	
  cookie	
  
timestamp	
   string	
   DD-­‐MON-­‐YYYY	
  HH:MI:SS	
  (AM/PM)	
  
uid	
   string	
   User	
  id	
  
File Format
ORC	
  
Table Properties
Compression	
  
Type	
  
zlib	
  
External	
  
.
.
§  User ‘awesome_yahoo’
added ‘foo string’ to the
table on May 29, 2014 at
‘1:10 AM’
§  User ‘me_too’ added table
properties
‘orc.compress=ZLIB’ on
May 30, 2014 at ‘9:00 AM’
§  User ‘me_too’ changed the
file format from ‘RCFile’ to
‘ORC’ on Jun 1, 2014 at
‘10:30 AM’
.
.
.
ILLUSTRATIVE
HCatalog is Part of a Broader Solution Set
30
Hive
HiveServer2
HCatalog
§  Data warehousing to facilitate querying and managing large datasets in HDFS
§  Mechanism to project structure onto HDFS data and query using a SQL-like language
§  Server process (Thrift-based RPC) for concurrent clients connecting over ODBC/JDBC
§  Authentication and authorization for ODBC/JDBC clients for metadata access
§  Table and storage management layer for Hadoop tools to easily share data
§  Relational view of data, storage location and format abstraction, notifications of availability
Starling
§  Hadoop log warehouse for analytics on grid usage (job history, tasks, job counters etc.)
§  1TB of raw logs processed / day, 24 TB of processed data
Product Role in the Grid Stack
Deployment Layout
31
ILLUSTRATIVE
Batch &
Interactive SQL
Tez and MapReduce
on YARN
+
HDFS
RDBMS
LoadBalancer
HCatalog
Thrift
HS2
ODBC/JDBC
Launcher Gateway
LoadBalancer
Data Out Client
Client/ CLI
HiveQL
M/R / Tez Jobs
Pig M/R
Cloud
Messaging
HiveServer2
Hadoop
Hive
HCatalog
BI, Reporting,
DataOut,
Dev UI –
Data to
Desktop (D2D)
Data Governance
32
Data Access
Public
Non-sensitive
Financial
Restricted
$
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Owner Review
Manager
approves
Employee
acknowledges
SQL-based Authorization for Controlled Access
33
§  SQL-compliant authorization model (Users, Roles, Privileges, Objects)
§  Fine-grain authorization and access control patterns (row and column in conjunction with views)
§  Can be used in conjunction with storage-based authorization
Privileges Access Control
§  Objects consist of databases, tables, and
views
§  Privileges are GRANTed on objects
o  SELECT: read access to an object
o  INSERT: write (insert) access to an
object
o  UPDATE: write (update) access to an
object
o  DELETE: delete access for an object
o  ALL PRIVILEGES: all privileges
§  Roles can be associated with objects
§  Privileges are associated with roles
§  CREATE, DROP, and SET ROLE
statements manipulate roles and
membership
§  SUPERUSER role for databases can grant
access control to users or roles (not limited
to HDFS permissions)
§  PUBLIC role includes all users
§  Prevents undesirable operations on objects
by unauthorized users
Audits, Compliance, and Efficiency
34
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
In Summary
35
Data shared across tools such as MR, Pig, and Hive Apache HCatalog
Schema and semantics knowledge across the company Data Discovery
Support for schema evolution and downstream change
communication
Apache HCatalog
Fine-grained access controls (row / column) vs. all or nothing SQL-based Authorization
Clear ownership of data Data Discovery
Data lineage and integrity Data Discovery / Starling
Audits and compliance (e.g. SOX) Data Discovery / Starling
Retention, duplication, and waste Data Discovery / Starling
✔
✔
✔
✔
✔
✔
✔
✔
Thank You
@sumeetksingh
@thiruvel
Office Hours
4:00pm–4:40pm, Table A

Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop

  • 1.
    Data Discovery onHadoop PRESENTED BY Sumeet Singh, Thiruvel Thirumoolan ⎪ February 19, 2015 S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 5 , S a n J o s e
  • 2.
    Introduction 2 §  Developer inthe Hive-HCatalog team, and active contributor to Apache Hive §  Responsible for Hive, HiveServer2 and HCatalog across all Hadoop clusters and ensuring they work at scale for the usage patterns of Yahoo §  Loves mining the trove of Hadoop logs for usage patterns and insights §  Bachelors degree from Anna University Thiruvel Thirumoolan Principal Engineer Hadoop and Big Data Platforms Platforms and Personalization Products 701 First Avenue, Sunnyvale, CA 94089 USA @thiruvel §  Manages Hadoop products team at Yahoo §  Responsible for Product Management, Strategy and Customer Engagements §  Managed Cloud Services products team and headed Strategy functions for the Cloud Platform Group at Yahoo §  MBA from UCLA and MS from Rensselaer Polytechnic Institute (RPI) Sumeet Singh Sr. Director, Product Management Cloud and Big Data Platforms Platforms and Personalization Products 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh
  • 3.
    Agenda 3 The Data ManagementChallenge1 Apache HCatalog to Rescue Data Registration and Discovery Opening up Access to Data Q&A 2 3 4 5
  • 4.
    Hadoop as theSource of Truth for All Data 4 TV PC Phone Tablet Pushed Data Pulled Data Web Crawl Social Email 3rd Party Content Data Highway Hadoop Grid BI, Reporting, Adhoc Analytics Data Content Ads No-SQL Serving Stores Serving ILLUSTRATIVE
  • 5.
    5 42,300 servers 600 PB 0 100 200 300 400 500 600 700 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 2006 20072008 2009 2010 2011 2012 2013 2014 2015 RawHDFSStorage(inPB) NumberofServers Year Servers Storage 1 Across all Hadoop (18 clusters, 40,080 servers, 565 PB) and HBase (9 clusters, 2,250 servers, 35 PB) clusters, Feb 16, 2015 1.5 billion files & dir Growth in HDFS1 in the Last 10 Years +70 PB +66 PB +64 PB +80 PB +100 PB Last 5 years: 22.2% CAGR
  • 6.
    Processing and AnalyzingData with Hadoop…Then 6 HDFS MapReduce (YARN) Pig HiveJava MR APIs InputFormat/ OutputFormat Load / Store SerDe MetaStore Client Hive MetaStore Hadoop Streaming Oozie
  • 7.
    Processing and AnalyzingData with HBase…Then 7 HDFS HBase Pig HiveJava MR APIs TableInputFormat/ TableOutputFormat MetaStore Client Hive MetaStore Oozie HBaseStorage HBaseStorage Handler
  • 8.
    Hadoop Jobs onthe Platform Today 8 100% (28.9 M) 1%4% 5% 7% 35% 47% All Jobs Pig Oozie Launcher Java MR Hive GDM Streaming, distcp, Spark Job Distribution (Jan 2015)
  • 9.
    Challenges in ManagingData on Multi-tenant Platforms 9 Data Producers Platform Services Data Consumers §  Data shared across tools such as MR, Pig, and Hive §  Schema and semantics knowledge across the company §  Support for schema evolution and downstream change communication §  Fine-grained access controls (row / column) vs. all or nothing §  Clear ownership of data §  Data lineage and integrity §  Audits and compliance (e.g. SOX) §  Retention, duplication, and waste Data Economy Challenges Apache HCatalog & Data Discovery
  • 10.
    Apache HCatalog inthe Technology Stack 10 Compute Services Storage Infrastructure Services HivePig Oozie Grid UI GDM & Proxies YARN MapReduce HDFS HBase Zookeeper Support Shop Monitoring Starling Messaging Service HCatalog Storm SparkTez
  • 11.
    HCatalog Facilitates Interoperability…Now 11 HDFS MapReduce(YARN) Pig HiveJava MR APIs InputFormat/ OutputFormat SerDe & Storage Handler MetaStore Client HCatalog MetaStore HCatInputFormat / HCatOutputFormat HCatLoader/ HCatStorer HDFS HBase Notifications Oozie
  • 12.
    Data Model 12 Database (namespace) Table (schema) Table (schema) Partitions Partitions Buckets Buckets SkewedUnskewed Optional per table Partitions, buckets, and skews facilitate faster, more direct access to data
  • 13.
    Sample Table Registration 13 Selectproject database USE  xyz;     Create table CREATE  EXTERNAL  TABLE  search  (   bcookie  string    COMMENT  ‘Standard  browser  cookie’,   time_stamp  int      COMMENT  ‘DD-­‐MON-­‐YYYY  HH:MI:SS  (AM/PM)’,   uid  string      COMMENT  ‘User  id’,   ip  string    COMMENT  ‘...’,     pg_spaceid  string    COMMENT  ‘...’,     ...)   PARTITIONED  BY  (   locale  string      COMMENT  ‘Country  of  origin’,     datestamp  string    COMMENT  ‘Date  in  YYYYMMDD  format’)   STORED  AS  ORC   LOCATION  ‘/projects/search/...’;   Add partitions manually, (if you choose to) ALTER  TABLE  search  ADD  PARTITION  (  locale=‘US’,  datestamp=‘20130201’)     LOCATION  ‘/projects/search/...’;   Your company’s data (metadata) can be registered with HCatalog irrespective of the tool used
  • 14.
    Getting Data intoHCatalog – DML and DDL 14 LOAD Files into tables Copy / move data from HDFS or local filesystem into HCatalog tables LOAD  DATA  [LOCAL]  INPATH  'filepath'  [OVERWRITE]  INTO  TABLE  tablename     [PARTITION  (partcol1=val1,  partcol2=val2  ...)]; INSERT data from a query into tables Query results can be inserted into tables of file system directories by using the insert clause. INSERT  OVERWRITE  TABLE  tablename1  [PARTITION  (partcol1=val1,  partcol2=val2  ...)  [IF  NOT  EXISTS]]   select_statement1  FROM  from_statement;     INSERT  INTO  TABLE  tablename1  [PARTITION  (partcol1=val1,  partcol2=val2  ...)]  select_statement1  FROM   from_statement;   HCatalog also supports multiple inserts in the same statement or dynamic partition inserts. ALTER TABLE ADD PARTITIONS ALTER  TABLE  table_name  ADD  PARTITION  (partCol  =  'value1')  location  'loc1’;  
  • 15.
    Getting Data intoHCatalog – HCatalog APIs 15 Pig HCatLoader and HCatStorer is used with Pig scripts to read from and write data to HCatalog-managed tables A  =  load  '$DB.$TABLE'  using  org.apache.hcatalog.pig.HCatLoader();   B  =  FILTER  A  BY  $FILTER;   C  =  foreach  B  generate  foo,  bar;   store  C  into  '$OUTPUT_DB.$OUTPUT_TABLE'  USING  org.apache.hcatalog.pig.HCatStorer  ('$OUTPUT_PARTITION');   MapReduce HCatInputFormat and HCatOutputFormat is used with MapReduce to read from and write data to HCatalog-managed tables. Map<String,  String>  partitionValues  =  new  HashMap<String,  String>();   partitionValues.put("a",  "1");   partitionValues.put("b",  "1");   HCatTableInfo  info  =  HCatTableInfo.getOutputTableInfo(dbName,  tblName,  partitionValues);   HCatOutputFormat.setOutput(job,  info);        
  • 16.
    HCatalog Integration withData Mgmt. Platform (GDM) 16 MetaStore Cluster 1 - Colo 1 HDFS Cluster 2 – Colo 2 HDFS Grid Data Management Feed Acquisition Feed Replication MetaStore Feed datasets as partitioned external tables Growl extracts schema for backfill HCatClient. addPartitions(…) Mark LOAD_DONE HCatClient. addPartitions(…) Mark LOAD_DONE Partitions are dropped with (HCatClient.dropPartitions(…)) after retention expiration with a drop_partition notification add_partition event notification add_partition event notification
  • 17.
    HCatalog Notifications 17 Namespace:  E.g.  “hcat.thebestcluster”   JMS  Topic:  E.g.  “<dbname>.<tablename>”   Sample  JMS  Notification   {      "timestamp"  :  1360272556,      "eventType"  :  "ADD_PARTITION",      "server"        :  "thebestcluster-­‐hcat.dc1.grid.yahoo.com",      "servicePrincipal"  :  "hcat/thebestcluster-­‐hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM",      "db"                :  "xyz",      "table"          :  "search",      "partitions":  [                                        {  "locale"  :  "US",  "datestamp"  :  "20140602"  },                                        {  "locale"  :  "UK",  "datestamp"  :  "20140602"  },                                        {  "locale"  :  "IN",  "datestamp"  :  "20140602"  }                                  ]   }   HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database, add_table, add_partition, drop_partition, drop_table, and drop_database. Notifications can be extended for schema change communication HCat Client HCat MetaStore ActiveMQ Server Register Channel Publish to listener channels Subscribers
  • 18.
    Oozie, HCatalog, andMessaging Integration 18 Oozie Message Bus HCatalog 3. Push notification <New Partition> 2. Register Topic 4. Notify New Partition Data Producer HDFS Produce data (distcp, pig, M/R..) /data/click/2014/06/02 1. Query/Poll Partition Start workflow Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’)
  • 19.
    Data Discovery withHCatalog 19 §  Unified metadata store for all data at Yahoo §  Discovery is about o  Browsing / inspecting metadata and data o  Searching for datasets §  It helps to solve o  Schema knowledge across the company o  Ownerships o  Data type – dev or prod o  Understand data o  Schema evolution o  Lineage
  • 20.
    Data Discovery Features 20 § Browsing o  Tables / Databases o  Schema, format, properties o  Partitions and metadata about each partition §  Searches for tables o  Table name (regex) or Comments o  Column name or comments o  Ownership, File format o  Location o  Properties (Dev/Prod)
  • 21.
    Data Discovery UIin Production 21 Search Tables Search The Best Cluster audience_db   tumblr_db   user_db   flickr_db   page_clicks   Hourly  clickstream  table   ad_clicks   Hourly  ad  clicks  table     user_info   User  registration  info   session_info   Session  feed  info   audience_info   Primary  audience  table   GLOBAL HCATALOG DASHBOARD Available Databases Available Tables (audience_db) Search the HCat tables Browse the DBs by cluster Search results or browse db results 1 2 Next 1 2 Next ILLUSTRATIVE
  • 22.
    Table Display UI 22 GLOBALHCATALOG DASHBOARD HCat Instance The  Best  Cluster   Database audience_db   Table page_clicks   Owner Awesome  Yahoo   Schema Partitions Column Type Description bcookie   string   Standard  browser  cookie   timestamp   string   DD-­‐MON-­‐YYYY  HH:MI:SS  (AM/PM)   . . Column Type Description dt   string   Date  in  YYYY_MM_DD  format     ILLUSTRATIVE
  • 23.
    Data Discovery PhysicalView 23 Discovery UI Global View of All Data in Metastore DC1-C1 DC1-C2 DCn-Cn . . . DC2-C1 DC2-C2 DCm-Cm . . . Data Center 1 Data Center 2 MS WebServer HCat API MS WebServer HCat API MS WebServer HCat API MSWebServer HCat API MS WebServer HCat API WebServer HCat API MS ILLUSTRATIVE
  • 24.
    Data Discovery Design 24 § A single web interface connects to all Metastore instances (all datacenters) §  Select an appropriate cluster and browse all metadata o  A webserver runs on each Metastore o  All reads audited o  ACLs (future) §  Search functionality will be added to web interface and Metastore o  New Thrift interface to support search o  All searches audited §  Long term design o  Load on production o  Read and Write HCatalog instances
  • 25.
    Data Discovery Design– APIs 25 §  Search o  Searches across various fields in order o  Simple ranking o  Search order for multiple keywords o  Optimized implementation for database o  Will be contributed back §  Unique partition values o  One or more partition keys o  Filtering and Ordering supported o  HIVE-7604 (https://issues.apache.org/jira/browse/HIVE-7604)
  • 26.
    Data Discovery Design– Optimizations 26 §  Allows to peek into the data (select * limit n) §  Existing implementations costly o  Too much client and server resources o  Timeouts and failures §  Optimized partition objects and used names §  New implementation takes a few seconds at most §  HIVE-9573 (https://issues.apache.org/jira/browse/HIVE-9573)
  • 27.
    Going Forward –Lineage 27 Advantages Challenges Bottleneck Ownership Quality Offline / Real Time Data Flow / Control Flow Software Stack
  • 28.
    Going Forward –Lineage 28 Statistics help in heuristics instead of running a job Table 1 / Partition 1 (Stage-1) HBase ORC Table Partition 1 (Stage-2) Dimension Table Statistics/ Agg. Table (Stage-3) Daily Stats Table (Stage-4) Copied by distcp / external registrar Hourly ILLUSTRATIVE
  • 29.
    Going Forward –Schema Versioning 29 Schema Column Type Description bcookie   string   Standard  browser  cookie   timestamp   string   DD-­‐MON-­‐YYYY  HH:MI:SS  (AM/PM)   uid   string   User  id   File Format ORC   Table Properties Compression   Type   zlib   External   . . §  User ‘awesome_yahoo’ added ‘foo string’ to the table on May 29, 2014 at ‘1:10 AM’ §  User ‘me_too’ added table properties ‘orc.compress=ZLIB’ on May 30, 2014 at ‘9:00 AM’ §  User ‘me_too’ changed the file format from ‘RCFile’ to ‘ORC’ on Jun 1, 2014 at ‘10:30 AM’ . . . ILLUSTRATIVE
  • 30.
    HCatalog is Partof a Broader Solution Set 30 Hive HiveServer2 HCatalog §  Data warehousing to facilitate querying and managing large datasets in HDFS §  Mechanism to project structure onto HDFS data and query using a SQL-like language §  Server process (Thrift-based RPC) for concurrent clients connecting over ODBC/JDBC §  Authentication and authorization for ODBC/JDBC clients for metadata access §  Table and storage management layer for Hadoop tools to easily share data §  Relational view of data, storage location and format abstraction, notifications of availability Starling §  Hadoop log warehouse for analytics on grid usage (job history, tasks, job counters etc.) §  1TB of raw logs processed / day, 24 TB of processed data Product Role in the Grid Stack
  • 31.
    Deployment Layout 31 ILLUSTRATIVE Batch & InteractiveSQL Tez and MapReduce on YARN + HDFS RDBMS LoadBalancer HCatalog Thrift HS2 ODBC/JDBC Launcher Gateway LoadBalancer Data Out Client Client/ CLI HiveQL M/R / Tez Jobs Pig M/R Cloud Messaging HiveServer2 Hadoop Hive HCatalog BI, Reporting, DataOut, Dev UI – Data to Desktop (D2D)
  • 32.
    Data Governance 32 Data Access Public Non-sensitive Financial Restricted $ Governance Classification Noaddn. reqmt. LMS Integration Stock Admin Integration Owner Review Manager approves Employee acknowledges
  • 33.
    SQL-based Authorization forControlled Access 33 §  SQL-compliant authorization model (Users, Roles, Privileges, Objects) §  Fine-grain authorization and access control patterns (row and column in conjunction with views) §  Can be used in conjunction with storage-based authorization Privileges Access Control §  Objects consist of databases, tables, and views §  Privileges are GRANTed on objects o  SELECT: read access to an object o  INSERT: write (insert) access to an object o  UPDATE: write (update) access to an object o  DELETE: delete access for an object o  ALL PRIVILEGES: all privileges §  Roles can be associated with objects §  Privileges are associated with roles §  CREATE, DROP, and SET ROLE statements manipulate roles and membership §  SUPERUSER role for databases can grant access control to users or roles (not limited to HDFS permissions) §  PUBLIC role includes all users §  Prevents undesirable operations on objects by unauthorized users
  • 34.
    Audits, Compliance, andEfficiency 34 Starling FS, Job, Task logs Cluster 1 Cluster 2 Cluster n... CF, Region, Action, Query Stats Cluster 1 Cluster 2 Cluster n... DB, Tbl., Part., Colmn. Access Stats ...MS 1 MS 2 MS n GDM Data Defn., Flow, Feed, Source F 1 F 2 F n Log Warehouse Log Sources
  • 35.
    In Summary 35 Data sharedacross tools such as MR, Pig, and Hive Apache HCatalog Schema and semantics knowledge across the company Data Discovery Support for schema evolution and downstream change communication Apache HCatalog Fine-grained access controls (row / column) vs. all or nothing SQL-based Authorization Clear ownership of data Data Discovery Data lineage and integrity Data Discovery / Starling Audits and compliance (e.g. SOX) Data Discovery / Starling Retention, duplication, and waste Data Discovery / Starling ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • 36.