Hadoop Summit San Jose 2014: Data Discovery on Hadoop

Data Discovery on Hadoop -
Realizing the Full Potential of Your Data
P R E S E N T E D B Y T h i r u v e l T h i r u m o o l a n , S u m e e t S i n g h ⎪ J u n e 3 , 2 0 1 4
2014 Hadoop Summit, San Jose, California

Introduction
2 2014 Hadoop Summit, San Jose, California
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
Thiruvel Thirumoolan
Principal Engineer
Hadoop and Big Data Platforms
Cloud Engineering Group
§  Developer in the Hive-HCatalog team, and active
contributor to Apache Hive
§  Responsible for Hive, HiveServer2 and HCatalog
across all Hadoop clusters and ensuring they work
at scale for the usage patterns of Yahoo
§  Loves mining the trove of Hadoop logs for usage
patterns and insights
§  Bachelors degree from Anna University
701 First Avenue,
Sunnyvale, CA 94089 USA
@thiruvel
§  Manages Hadoop products team at Yahoo!
§  Responsible for Product Management, Strategy
and Customer Engagements
§  Managed Cloud Services products team and
headed Strategy functions for the Cloud Platform
Group at Yahoo
§  M.B.A. from UCLA and M.S. from Rensselaer(RPI)
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh

Agenda
3
The Data Management Challenge1
Apache HCatalog to Rescue2
Data Registration and Discovery3
Opening Up Adhoc Access to Data4
Summary and Q&A5

Hadoop Grid as the Source of Truth for Data
TV
PC
Phone
Tablet
Pushed Data
Pulled Data
Web Crawl
Social
Email
3rd Party Content
Data
Advertising
Content
User Profiles /
No-SQL
Serving Stores
Serving
Data Highway
Feeds
Hadoop Grid
BI, Reporting, Adhoc Analytics
ILLUSTRATIVE

34,000
servers
478 PB
0
100
200
300
400
500
600
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers
Year
Servers
1 Across all Hadoop (16 clusters, 32,500 servers, 455 PB) and HBase (7 clusters, 1,500 servers, 23 PB) clusters, May 23, 2014
Growth in HDFS1
1.25 billion
files & dir

Processing and Analyzing Data with Hadoop…Then
HDFS
MapReduce (YARN)
Pig Hive
Java MR
APIs
InputFormat/ OutputFormat
Load / Store SerDe
MetaStore
Client
Hive
MetaStore
Hadoop
Streaming
Oozie

Processing and Analyzing Data with HBase…Then
HDFS
HBase
Pig HiveJava MR APIs
TableInputFormat/
TableOutputFormat
HBaseStorage MetaStore
Client
Hive
MetaStore
HBaseStorage
Handler
Oozie

Hadoop Jobs on the Platform Today
100%
(21.5 M)
1%4%
9%
10%
31%
45%
All Jobs Pig Oozie
Launcher
Java MR Hive GDM Streaming,
distcp, Spark
Job Distribution (May 1 – May 26, 2014)

Challenges in Managing Data on Multi-tenant Platforms
Data Producers
Platform Services
Data Consumers
§  Data shared across tools such as MR,
Pig, and Hive
§  Schema and semantics knowledge
across the company
§  Support for schema evolution and
downstream change communication
§  Fine-grained access controls (row /
column) vs. all or nothing
§  Clear ownership of data
§  Data lineage and integrity
§  Audits and compliance (e.g. SOX)
§  Retention, duplication, and waste
Data Economy Challenges
Apache
HCatalog
&
Data Discovery

Apache HCatalog in the Technology Stack at Yahoo
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez

HCatalog Facilitates Interoperability…Now
HDFS
MapReduce (YARN)
Pig HiveJava MR APIs
InputFormat/ OutputFormat
SerDe & Storage Handler
MetaStore
Client
HCatalog
MetaStore
HCatInputFormat /
HCatOutputFormat
HCatLoader/
HCatStorer
HDFS
HBase
Notifications
Oozie

Data Model
Database
(namespace)
Table
(schema)
Table
(schema)
Partitions Partitions
Buckets
Buckets
Skewed Unskewed
Optional
per table
Partitions, buckets, and skews facilitate faster, more direct access to data
Note on Buckets
§  It is hard to guess the right number of buckets that can also change overtime, hard to coordinate and align for joins
§  Community is working on dynamic bucketing that would have the same benefit without the need for static partitioning

Sample Table Registration
Select project database
USE
xyz;

Create table
CREATE
EXTERNAL
TABLE
search
(

bcookie
string

COMMENT
‘Standard
browser
cookie’,

time_stamp
int

COMMENT
‘DD-‐MON-‐YYYY
HH:MI:SS
(AM/PM)’,

uid
string

COMMENT
‘User
id’,

ip
string

COMMENT
‘...’,

pg_spaceid
string
COMMENT
‘...’,

...)

PARTITIONED
BY
(

locale
string

COMMENT
‘Country
of
origin’,

datestamp
string
COMMENT
‘Date
in
YYYYMMDD
format’)

STORED
AS
ORC

LOCATION
‘/projects/search/...’;

Add partitions manually, (if you choose to)
ALTER
TABLE
search
ADD
PARTITION
(
locale=‘US’,
datestamp=‘20130201’)

LOCATION
‘/projects/search/...’;

All your company’s data (metadata) can be registered with HCatalog irrespective of the
tool used.

Getting Data into HCatalog – DML and DDL
LOAD Files into tables
Load operations are copy/move operations from HDFS or local filesystem that move datafiles into locations
corresponding to HCat tables. File format must agree with the table format.
LOAD
DATA
[LOCAL]
INPATH
'filepath'
[OVERWRITE]
INTO
TABLE
tablename

[PARTITION
(partcol1=val1,
partcol2=val2
...)];

INSERT data from a query into tables
Query results can be inserted into tables of file system directories by using the insert clause.
INSERT
OVERWRITE
TABLE
tablename1
[PARTITION
(partcol1=val1,
partcol2=val2
...)
[IF
NOT
EXISTS]]

select_statement1
FROM
from_statement;

INSERT
INTO
TABLE
tablename1
[PARTITION
(partcol1=val1,
partcol2=val2
...)]
select_statement1
FROM

from_statement;

HCat also supports multiple inserts in the same statement or dynamic partition inserts.
ALTER TABLE ADD PARTITIONS
You can use ALTER TABLE ADD PARTITION to add partitions to a table. The location must be a directory
inside of which data files reside. If new partitions are directly added to HDFS, HCat will not be aware of
these.
ALTER
TABLE
table_name
ADD
PARTITION
(partCol
=
'value1')
location
'loc1’;

Getting Data into HCatalog – HCat APIs
Pig
HCatLoader is used with Pig scripts to read data from HCatalog-managed tables, and HCatStorer is used
with Pig scripts to write data to HCatalog-managed tables.

A
=
load
'$DB.$TABLE'
using
org.apache.hcatalog.pig.HCatLoader();

B
=
FILTER
A
BY
$FILTER;

C
=
foreach
B
generate
foo,
bar;

store
C
into
'$OUTPUT_DB.$OUTPUT_TABLE'
USING
org.apache.hcatalog.pig.HCatStorer

('$OUTPUT_PARTITION');

MapReduce
The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables.
HCatOutputFormat is used with MapReduce jobs to write data to HCatalog-managed tables.
Map<String,
String>
partitionValues
=
new
HashMap<String,
String>();

partitionValues.put("a",
"1");

partitionValues.put("b",
"1");

HCatTableInfo
info
=
HCatTableInfo.getOutputTableInfo(dbName,
tblName,
partitionValues);

HCatOutputFormat.setOutput(job,
info);

HCatalog Integration with Data Mgmt. Platform (GDM)
HCatalog
MetaStore
Cluster 1 - Colo 1
HDFS
Cluster 2 – Colo 2
HDFS
Grid Data
Management
Feed Acquisition
Feed
Replication
HCatalog
MetaStore
Feed datasets
as partitioned
external tables
Growl extracts
schema for
backfill
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
Partitions are dropped with
(HCatClient.dropPartitions(…))
after retention expiration with a
drop_partition notification
add_partition
event notification
add_partition
event notification

HCatalog Notification
Namespace:
E.g.
“hcat.thebestcluster”

JMS
Topic:
E.g.
“<dbname>.<tablename>”

Sample
JMS
Notification

{

"timestamp"
:
1360272556,

"eventType"
:
"ADD_PARTITION",

"server"

:
"thebestcluster-‐hcat.dc1.grid.yahoo.com",

"servicePrincipal"
:
"hcat/thebestcluster-‐hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM",

"db"

:
"xyz",

"table"

:
"search",

"partitions":
[

{
"locale"
:
"US",
"datestamp"
:
"20140602"
},

{
"locale"
:
"UK",
"datestamp"
:
"20140602"
},

{
"locale"
:
"IN",
"datestamp"
:
"20140602"
}

]

}

§  HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database,
add_table, add_partition, drop_partition, drop_table, and drop_database
§  Notifications can be extended for schema change notifications (proposed)
HCat
Client
HCat
MetaStore
ActiveMQ
Server
Register Channel Publish to listener channels
Subscribers

Oozie, HCatalog, and Messaging Integration
Oozie
Message
Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data
Producer
HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)

Data Discovery with HCatalog
§  HCatalog instances become a unifying metastore for all data at
Yahoo
§  Discovery is about
o  Browsing / inspecting metadata
o  Searching for datasets
§  It helps to solve
o  Schema knowledge across the company
o  Schema evolution
o  Lineage
o  Ownerships
o  Data type – dev or prod

Data Discovery Physical View
Global View of
All Data in HCatalog
DC1-C1
DC1-C2
DCn-Cn
.
.
.
DC2-C1
DC2-C2
DCm-Cm
.
.
.
Discovery UI
Data Center 1 Data Center 2
HCat REST
(Templeton)
HCat REST
(Templeton)
HCat REST
(Templeton)
HCatREST
(Templeton)
HCatREST
(Templeton)
HCat
REST
(Templeton)
ILLUSTRATIVE

Data Discovery Features
§  Browsing
o  Tables / Databases
o  Schema, format, properties
o  Partitions and metadata about each partition
§  Searches for tables
o  Table name (regex) or Comments
o  Column name or comments
o  Ownership, File format
o  Location
o  Properties (Dev/Prod)

Discovery UI
Search Tables Search
The Best Cluster
audience_db

tumblr_db

user_db

adv_warehouse

flickr_db

page_clicks
Hourly
clickstream
table

ad_clicks
Hourly
ad
clicks
table

user_info
User
registration
info

session_info
Session
feed
info

audience_info
Primary
audience
table

GLOBAL HCATALOG DASHBOARD
Available Databases
Available Tables (audience_db)
Search the HCat tables
Browse
the DBs
by
cluster
Search
results
or
browse
db
results
1 2 Next 1 2 Next
ILLUSTRATIVE

Table Display UI
ILLUSTRATIVE
GLOBAL HCATALOG DASHBOARD
HCat Instance The
Best
Cluster

Database audience_db

Table page_clicks

Owner Awesome
Yahoo

Schema
…more table information and properties (e.g. data format etc.)
Partitions
…list of partitions
Column Type Description
bcookie
string
Standard
browser
cookie

timestamp
string
DD-‐MON-‐YYYY
HH:MI:SS
(AM/PM)

uid
string
User
id

.
.
.

Data Discovery Design Approach
§  A single web interface connects to all HCatalog instances (same and
cross-colo)
§  Select an appropriate HCat instance and browse all metadata
o  Each HCatalog instance runs a webserver (Templeton/ WebHCat) to read
metadata
o  All reads audited
o  ACL’s apply
§  Search functionality will be added to Templeton and HCatalog
o  New Thrift interface to support search
o  All searches audited
o  ACL’s apply
§  Long term design
o  Read and Write HCatalog instances

Data Discovery Going Forward
§  Lineage
o  Source datasets
o  Derived datasets
§  Data Quality
o  Statistics help in heuristics instead of running a job
Table 1 /
Partition 1
HBase
ORC Table
Partition 1
Dimension
Table
Statistics/
Agg. Table
Daily Stats
Table
Copied by
distcp / external
registrar
Hourly
ILLUSTRATIVE

Data Discovery Going Forward (cont’d)
ILLUSTRATIVE
Schema
Column Type Description
bcookie
string
Standard
browser
cookie

timestamp
string
DD-‐MON-‐YYYY
HH:MI:SS
(AM/PM)

uid
string
User
id

File Format
ORC

Table Properties
Compression

Type

zlib

External

§  User ‘awesome_yahoo’
added ‘foo string’ to the
table on May 29, 2014 at
‘1:10 AM’
§  User ‘me_too’ added table
properties
‘orc.compress=ZLIB’ on
May 30, 2014 at ‘9:00 AM’
§  User ‘me_too’ changed the
file format from ‘RCFile’ to
‘ORC’ on Jun 1, 2014 at
‘10:30 AM’
.
.
.
.
.
.

HCatalog is Part of a Broader Solution Set
Hive
HiveServer2
HCatalog
§  Data warehousing software that facilitates querying and managing large
datasets in HDFS
§  Provides a mechanism to project structure onto HDFS data and query the
data using a SQL-like language called HiveQL
§  Server process (Thrift-based RPC interface) to support concurrent clients
connecting over ODBC/JDBC
§  Provides authentication and enforces authorization for ODBC/JDBC clients
for metadata access
§  Table and storage management layer that enables users with different tools
(Pig, M/R, and Hive) to more easily share data
§  Presents a relational view of data in HDFS, abstracts where or in what
format data is stored, and enables notifications of data availability
Starling
§  Hadoop log warehouse for analytics on grid usage (job history, tasks, job
counters etc.)
§  1TB of raw logs processed / day, 24 TB of processed data
Product Role in the Grid Stack

28
Deployment Layout
Tez and MapReduce
on YARN
+
HDFS
Oracle
DBMS
LoadBalancer
HCatalog
Thrift
HS2
ODBC/JDBC
Launcher Gateway
LoadBalancer
Data Out Client
Client/ CLI
HiveQL
M/R Jobs
Pig M/R
Cloud
Messaging
ActiveMQ
notifications
HiveServer2
Hadoop
Hive
HCatalog

Hive for Both Batch and Interactive Adhoc Analytics
Tez
§  Computation expressed as a dataflow graph
with reusable primitives
§  No intermediate outputs to HDFS
§  Built on top of YARN
§  Hive generates Tez plans for lower latency
Query Engine Improvements
§  Cost-based optimizations
§  In-memory joins
§  Caching hot tables
§  Vectorized processing
Better Columnar Store
§  ORCFile with predicate pushdown
§  Built for both speed and storage efficiency
Tez Service
§  Always-on pool of AMs / container re-use
Improved Latency and Throughput
Analytics Functions
§  SQL 2003 Compliant
§  OVER with PARTITION BY and ORDER BY
§  Wide variety of windowing functions:
o  RANK
o  LEAD/LAG
o  ROW_NUMBER
o  FIRST_VALUE
o  LAST_VALUE
o  Many more
§  Aligns well with BI ecosystem
Improving SQL Coverage
§  Non-correlated sub-queries using IN in
WHERE
§  Expanded SQL types including DATETIME,
VARCHAR, etc.
Extended Analytical Ability

HiveServer2 as ODBC / JDBC Endpoint
§  Gateway that Hive clients
can talk to
§  Supports concurrent clients
§  User/ global session/
configuration information
§  Support for secure clusters
and encryption
§  DoAs support allows Hive
queries to run as the
requester

Data to Desktop (D2D) – BI and Reporting on ODBC
HiveServer2
Hive
Hadoop
Desktop Web
Intelligence Server
Metadata Database
Grid ODBC driver

DataOut – Data to Any Off-Grid Destination on JDBC
HiveSplit HiveSplit
HiveServer2M
S
FS/DB
S
FS/DB
HiveSplit
S
FS/DB
Execute Query
Prepare Splits
Fetch Splits
Legend:
M – Master, S – Slave, FS/ DB – Filesystem/ Database
§  DataOut is an efficient
method of moving data off
the grid
§  Advantages:
o  API based on well-known
JDBC interface
o  Works with HCatalog / Hive
o  Agnostic to the underlying
storage format
o  Parts of the whole data can
be pulled in parallel

SQL-based Authorization for Controlled Access
§  SQL-compliant authorization model (Users, Roles, Privileges, Objects)
§  Fine-grain authorization and access control patterns (row and column in
conjunction with views)
§  Can be used in conjunction with storage-based authorization
Privileges Access Control
§  Objects consist of databases, tables,
and views
§  Privileges are GRANTed on objects
o  SELECT: read access to an object
o  INSERT: write (insert) access to an
object
o  UPDATE: write (update) access to an
object
o  DELETE: delete access for an object
o  ALL PRIVILEGES: all privileges
§  Roles can be associated with objects
§  Privileges are associated with roles
§  CREATE, DROP, and SET ROLE
statements manipulate roles and
membership
§  SUPERUSER role for databases can
grant access control to users or roles
(not limited to HDFS permissions)
§  PUBLIC role includes all users
§  Prevents undesirable operations on
objects by unauthorized users

Starling (Log Warehouse) for Historical Analysis and Trends
Cluster 1 Cluster 2 Cluster 3 Cluster N
Oozie
HCatalog HDFS
Hive
Starling
Dashboard
Discovery
Portal
Query
Server
Source
Clusters
Warehouse
Clusters

SQL on Hadoop the Fastest Growing Product on Grid
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
0
5
10
15
20
25
30
Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14
HiveJobs(%ofAllJobs)
AllGridJobs(inMillions)
All Jobs Hive (% of all jobs)
2.5 million
queries

In Summary
Data shared across tools such as MR, Pig, and Hive Apache HCatalog
Schema and semantics knowledge across the
company
Data Discovery
Support for schema evolution and downstream
change communication
Apache HCatalog
Fine-grained access controls (row / column) vs. all
or nothing
SQL-based
Authorization
Clear ownership of data Data Discovery
Data lineage and integrity Data Discovery / Starling
Audits and compliance (e.g. SOX) Data Discovery / Starling
Retention, duplication, and waste Data Discovery / Starling
✔
✔
✔
✔
✔
✔
✔
✔

Acknowledge
1 Apache Hive (and HiveServer2, HCatalog) Community
http://hive.apache.org/people.html
2 HCatalog and Hive Development Team at Yahoo
Olga Natkovich Annie Lin Fangyue Wang
Chris Drome Jin Sun Selina Zhang
Mithun Radhakrishnan Viraj Bhat
3 Oozie Development Team
Rohini Palaniswamy Ryota Egashira Purshotam Shah
Mona Chitnis Michelle Chiang
4 Grid Data Management (GDM) Team
Mark Holderbaugh Aaron Gresch Lawrence Prem Kumar
Scott Preece Yan Braun
5 Service Engineering and Data Operations
Rob Realini David Kuder Chuck Sheldon
Rajiv Chittajallu Vineeth Vadrevu Andy Rhee
6 Product Management
Sid Shaik Amrit Lal Kimsukh Kundu

Thank You
@thiruvel
@sumeetksingh
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.

Hadoop Summit San Jose 2014: Data Discovery on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop Summit San Jose 2014: Data Discovery on Hadoop

Similar to Hadoop Summit San Jose 2014: Data Discovery on Hadoop (20)

More from Sumeet Singh

More from Sumeet Singh (14)

Recently uploaded

Recently uploaded (20)

Hadoop Summit San Jose 2014: Data Discovery on Hadoop