Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop

Data Discovery on Hadoop
PRESENTED BY Sumeet Singh, Thiruvel Thirumoolan ⎪ February 19, 2015
S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 5 , S a n J o s e

Introduction
2
§  Developer in the Hive-HCatalog team, and active
contributor to Apache Hive
§  Responsible for Hive, HiveServer2 and HCatalog
across all Hadoop clusters and ensuring they work
at scale for the usage patterns of Yahoo
§  Loves mining the trove of Hadoop logs for usage
patterns and insights
§  Bachelors degree from Anna University
Thiruvel Thirumoolan
Principal Engineer
Hadoop and Big Data Platforms
Platforms and Personalization Products
701 First Avenue,
Sunnyvale, CA 94089 USA
@thiruvel
§  Manages Hadoop products team at Yahoo
§  Responsible for Product Management, Strategy and
Customer Engagements
§  Managed Cloud Services products team and headed
Strategy functions for the Cloud Platform Group at
Yahoo
§  MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Sr. Director, Product Management
Cloud and Big Data Platforms
Platforms and Personalization Products
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh

Agenda
3
The Data Management Challenge1
Apache HCatalog to Rescue
Data Registration and Discovery
Opening up Access to Data
Q&A
2
3
4
5

Hadoop as the Source of Truth for All Data
4
TV
PC
Phone
Tablet
Pushed Data
Pulled Data
Web Crawl
Social
Email
3rd Party Content
Data Highway
Hadoop Grid
BI, Reporting, Adhoc Analytics
Data
Content
Ads
No-SQL
Serving Stores
Serving
ILLUSTRATIVE

5
42,300
servers
600 PB
0
100
200
300
400
500
600
700
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
RawHDFSStorage(inPB)
NumberofServers
Year
Servers Storage
1 Across all Hadoop (18 clusters, 40,080 servers, 565 PB) and HBase (9 clusters, 2,250 servers, 35 PB) clusters, Feb 16, 2015
1.5 billion
files & dir
Growth in HDFS1 in the Last 10 Years
+70 PB
+66 PB
+64 PB
+80 PB
+100 PB
Last 5 years: 22.2% CAGR

Processing and Analyzing Data with Hadoop…Then
6
HDFS
MapReduce (YARN)
Pig HiveJava MR APIs
InputFormat/ OutputFormat
Load / Store SerDe
MetaStore
Client
Hive
MetaStore
Hadoop
Streaming
Oozie

Processing and Analyzing Data with HBase…Then
7
HDFS
HBase
TableInputFormat/
TableOutputFormat
MetaStore
Client
Hive
MetaStore
Oozie
HBaseStorage
HBaseStorage
Handler

Hadoop Jobs on the Platform Today
8
100%
(28.9 M)
1%4%
5%
7%
35%
47%
All Jobs Pig Oozie Launcher Java MR Hive GDM Streaming, distcp,
Spark
Job Distribution (Jan 2015)

Challenges in Managing Data on Multi-tenant Platforms
9
Data Producers
Platform Services
Data Consumers
§  Data shared across tools such as MR, Pig, and Hive
§  Schema and semantics knowledge across the
company
§  Support for schema evolution and downstream
change communication
§  Fine-grained access controls (row / column) vs. all
or nothing
§  Clear ownership of data
§  Data lineage and integrity
§  Audits and compliance (e.g. SOX)
§  Retention, duplication, and waste
Data Economy Challenges
Apache
HCatalog
&
Data Discovery

Apache HCatalog in the Technology Stack
10
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie Grid UI
GDM &
Proxies
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez

HCatalog Facilitates Interoperability…Now
11
HDFS
MapReduce (YARN)
InputFormat/ OutputFormat
SerDe & Storage Handler
MetaStore
Client
HCatalog
MetaStore
HCatInputFormat /
HCatOutputFormat
HCatLoader/
HCatStorer
HDFS
HBase Notifications
Oozie

Data Model
12
Database
(namespace)
Table
(schema)
Table
(schema)
Partitions Partitions
Buckets
Buckets
Skewed Unskewed
Optional
per table
Partitions, buckets, and skews facilitate faster, more direct access to data

Sample Table Registration
13
Select project database
USE
xyz;

Create table
CREATE
EXTERNAL
TABLE
search
(

bcookie
string

COMMENT
‘Standard
browser
cookie’,

time_stamp
int

COMMENT
‘DD-‐MON-‐YYYY
HH:MI:SS
(AM/PM)’,

uid
string

COMMENT
‘User
id’,

ip
string

COMMENT
‘...’,

pg_spaceid
string

COMMENT
‘...’,

...)

PARTITIONED
BY
(

locale
string

COMMENT
‘Country
of
origin’,

datestamp
string

COMMENT
‘Date
in
YYYYMMDD
format’)

STORED
AS
ORC

LOCATION
‘/projects/search/...’;

Add partitions manually, (if you choose to)
ALTER
TABLE
search
ADD
PARTITION
(
locale=‘US’,
datestamp=‘20130201’)

LOCATION
‘/projects/search/...’;

Your company’s data (metadata) can be registered with HCatalog irrespective of the tool used

Getting Data into HCatalog – DML and DDL
14
LOAD Files into tables
Copy / move data from HDFS or local filesystem into HCatalog tables
LOAD
DATA
[LOCAL]
INPATH
'filepath'
[OVERWRITE]
INTO
TABLE
tablename

[PARTITION
(partcol1=val1,
partcol2=val2
...)];
INSERT data from a query into tables
Query results can be inserted into tables of file system directories by using the insert clause.
INSERT
OVERWRITE
TABLE
tablename1
[PARTITION
(partcol1=val1,
partcol2=val2
...)
[IF
NOT
EXISTS]]

select_statement1
FROM
from_statement;

INSERT
INTO
TABLE
tablename1
[PARTITION
(partcol1=val1,
partcol2=val2
...)]
select_statement1
FROM

from_statement;

HCatalog also supports multiple inserts in the same statement or dynamic partition inserts.
ALTER TABLE ADD PARTITIONS
ALTER
TABLE
table_name
ADD
PARTITION
(partCol
=
'value1')
location
'loc1’;

Getting Data into HCatalog – HCatalog APIs
15
Pig
HCatLoader and HCatStorer is used with Pig scripts to read from and write data to HCatalog-managed tables
A
=
load
'$DB.$TABLE'
using
org.apache.hcatalog.pig.HCatLoader();

B
=
FILTER
A
BY
$FILTER;

C
=
foreach
B
generate
foo,
bar;

store
C
into
'$OUTPUT_DB.$OUTPUT_TABLE'
USING
org.apache.hcatalog.pig.HCatStorer
('$OUTPUT_PARTITION');

MapReduce
HCatInputFormat and HCatOutputFormat is used with MapReduce to read from and write data to HCatalog-managed tables.
Map<String,
String>
partitionValues
=
new
HashMap<String,
String>();

partitionValues.put("a",
"1");

partitionValues.put("b",
"1");

HCatTableInfo
info
=
HCatTableInfo.getOutputTableInfo(dbName,
tblName,
partitionValues);

HCatOutputFormat.setOutput(job,
info);

HCatalog Integration with Data Mgmt. Platform (GDM)
16
MetaStore
Cluster 1 - Colo 1
HDFS
Cluster 2 – Colo 2
HDFS
Grid Data
Management
Feed Acquisition
Feed
Replication
MetaStore
Feed datasets as
partitioned external
tables
Growl extracts
schema for backfill
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
Partitions are dropped with
(HCatClient.dropPartitions(…)) after
retention expiration with a
drop_partition notification
add_partition
event notification
add_partition
event notification

HCatalog Notifications
17
Namespace:
E.g.
“hcat.thebestcluster”

JMS
Topic:
E.g.
“<dbname>.<tablename>”

Sample
JMS
Notification

{

"timestamp"
:
1360272556,

"eventType"
:
"ADD_PARTITION",

"server"

:
"thebestcluster-‐hcat.dc1.grid.yahoo.com",

"servicePrincipal"
:
"hcat/thebestcluster-‐hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM",

"db"

:
"xyz",

"table"

:
"search",

"partitions":
[

{
"locale"
:
"US",
"datestamp"
:
"20140602"
},

{
"locale"
:
"UK",
"datestamp"
:
"20140602"
},

{
"locale"
:
"IN",
"datestamp"
:
"20140602"
}

]

}

HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database, add_table, add_partition, drop_partition,
drop_table, and drop_database. Notifications can be extended for schema change communication
HCat
Client
HCat
MetaStore
ActiveMQ
Server
Register Channel Publish to listener channels
Subscribers

Oozie, HCatalog, and Messaging Integration
18
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)

Data Discovery with HCatalog
19
§  Unified metadata store for all data at Yahoo
§  Discovery is about
o  Browsing / inspecting metadata and data
o  Searching for datasets
§  It helps to solve
o  Schema knowledge across the company
o  Ownerships
o  Data type – dev or prod
o  Understand data
o  Schema evolution
o  Lineage

Data Discovery Features
20
§  Browsing
o  Tables / Databases
o  Schema, format, properties
o  Partitions and metadata about each partition
§  Searches for tables
o  Table name (regex) or Comments
o  Column name or comments
o  Ownership, File format
o  Location
o  Properties (Dev/Prod)

Data Discovery UI in Production
21
Search Tables Search
The Best Cluster
audience_db

tumblr_db

user_db

flickr_db

page_clicks
Hourly
clickstream
table

ad_clicks
Hourly
ad
clicks
table

user_info
User
registration
info

session_info
Session
feed
info

audience_info
Primary
audience
table

GLOBAL HCATALOG DASHBOARD
Available Databases
Available Tables (audience_db)
Search the HCat tables
Browse
the DBs
by
cluster
Search
results
or
browse
db
results
1 2 Next 1 2 Next
ILLUSTRATIVE

Table Display UI
22
GLOBAL HCATALOG DASHBOARD
HCat Instance The
Best
Cluster

Database audience_db

Table page_clicks

Owner Awesome
Yahoo

Schema
Partitions
Column Type Description
bcookie
string
Standard
browser
cookie

timestamp
string
DD-‐MON-‐YYYY
HH:MI:SS
(AM/PM)

.
.
dt
string
Date
in
YYYY_MM_DD
format

ILLUSTRATIVE

Data Discovery Physical View
23
Discovery UI
Global View of
All Data in Metastore
DC1-C1
DC1-C2
DCn-Cn
.
.
.
DC2-C1
DC2-C2
DCm-Cm
.
.
.
Data Center 1 Data Center 2
MS WebServer
HCat API
MS WebServer
HCat API
MS WebServer
HCat API
MSWebServer
HCat API
MS
WebServer
HCat API
WebServer
HCat API
MS
ILLUSTRATIVE

Data Discovery Design
24
§  A single web interface connects to all Metastore instances (all datacenters)
§  Select an appropriate cluster and browse all metadata
o  A webserver runs on each Metastore
o  All reads audited
o  ACLs (future)
§  Search functionality will be added to web interface and Metastore
o  New Thrift interface to support search
o  All searches audited
§  Long term design
o  Load on production
o  Read and Write HCatalog instances

Data Discovery Design – APIs
25
§  Search
o  Searches across various fields in order
o  Simple ranking
o  Search order for multiple keywords
o  Optimized implementation for database
o  Will be contributed back
§  Unique partition values
o  One or more partition keys
o  Filtering and Ordering supported
o  HIVE-7604 (https://issues.apache.org/jira/browse/HIVE-7604)

Data Discovery Design – Optimizations
26
§  Allows to peek into the data (select * limit n)
§  Existing implementations costly
o  Too much client and server resources
o  Timeouts and failures
§  Optimized partition objects and used names
§  New implementation takes a few seconds at most
§  HIVE-9573 (https://issues.apache.org/jira/browse/HIVE-9573)

Going Forward – Lineage
27
Advantages Challenges
Bottleneck
Ownership
Quality
Offline / Real Time
Data Flow / Control Flow
Software Stack

Going Forward – Lineage
28
Statistics help in heuristics instead of running a job
Table 1 /
Partition 1
(Stage-1)
HBase
ORC Table
Partition 1
(Stage-2)
Dimension
Table
Statistics/
Agg. Table
(Stage-3)
Daily Stats
Table
(Stage-4)
Copied by
distcp / external
registrar
Hourly
ILLUSTRATIVE

Going Forward – Schema Versioning
29
Schema
bcookie
string
Standard
browser
cookie

timestamp
string
DD-‐MON-‐YYYY
HH:MI:SS
(AM/PM)

uid
string
User
id

File Format
ORC

Table Properties
Compression

Type

zlib

External

.
.
§  User ‘awesome_yahoo’
added ‘foo string’ to the
table on May 29, 2014 at
‘1:10 AM’
§  User ‘me_too’ added table
properties
‘orc.compress=ZLIB’ on
May 30, 2014 at ‘9:00 AM’
§  User ‘me_too’ changed the
file format from ‘RCFile’ to
‘ORC’ on Jun 1, 2014 at
‘10:30 AM’
.
.
.
ILLUSTRATIVE

HCatalog is Part of a Broader Solution Set
30
Hive
HiveServer2
HCatalog
§  Data warehousing to facilitate querying and managing large datasets in HDFS
§  Mechanism to project structure onto HDFS data and query using a SQL-like language
§  Server process (Thrift-based RPC) for concurrent clients connecting over ODBC/JDBC
§  Authentication and authorization for ODBC/JDBC clients for metadata access
§  Table and storage management layer for Hadoop tools to easily share data
§  Relational view of data, storage location and format abstraction, notifications of availability
Starling
§  Hadoop log warehouse for analytics on grid usage (job history, tasks, job counters etc.)
§  1TB of raw logs processed / day, 24 TB of processed data
Product Role in the Grid Stack

Deployment Layout
31
ILLUSTRATIVE
Batch &
Interactive SQL
Tez and MapReduce
on YARN
+
HDFS
RDBMS
LoadBalancer
HCatalog
Thrift
HS2
ODBC/JDBC
Launcher Gateway
LoadBalancer
Data Out Client
Client/ CLI
HiveQL
M/R / Tez Jobs
Pig M/R
Cloud
Messaging
HiveServer2
Hadoop
Hive
HCatalog
BI, Reporting,
DataOut,
Dev UI –
Data to
Desktop (D2D)

Data Governance
32
Data Access
Public
Non-sensitive
Financial
Restricted
$
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Owner Review
Manager
approves
Employee
acknowledges

SQL-based Authorization for Controlled Access
33
§  SQL-compliant authorization model (Users, Roles, Privileges, Objects)
§  Fine-grain authorization and access control patterns (row and column in conjunction with views)
§  Can be used in conjunction with storage-based authorization
Privileges Access Control
§  Objects consist of databases, tables, and
views
§  Privileges are GRANTed on objects
o  SELECT: read access to an object
o  INSERT: write (insert) access to an
object
o  UPDATE: write (update) access to an
object
o  DELETE: delete access for an object
o  ALL PRIVILEGES: all privileges
§  Roles can be associated with objects
§  Privileges are associated with roles
§  CREATE, DROP, and SET ROLE
statements manipulate roles and
membership
§  SUPERUSER role for databases can grant
access control to users or roles (not limited
to HDFS permissions)
§  PUBLIC role includes all users
§  Prevents undesirable operations on objects
by unauthorized users

Audits, Compliance, and Efficiency
34
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources

In Summary
35
Data shared across tools such as MR, Pig, and Hive Apache HCatalog
Schema and semantics knowledge across the company Data Discovery
Support for schema evolution and downstream change
communication
Apache HCatalog
Fine-grained access controls (row / column) vs. all or nothing SQL-based Authorization
Clear ownership of data Data Discovery
Data lineage and integrity Data Discovery / Starling
Audits and compliance (e.g. SOX) Data Discovery / Starling
Retention, duplication, and waste Data Discovery / Starling
✔
✔
✔
✔
✔
✔
✔
✔

Thank You
@sumeetksingh
@thiruvel
Office Hours
4:00pm–4:40pm, Table A

Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop

Similar to Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop (20)

More from Sumeet Singh

More from Sumeet Singh (14)

Recently uploaded

Recently uploaded (20)

Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop