SlideShare a Scribd company logo
1 of 38
Data Discovery on Hadoop -
Realizing the Full Potential of Your Data
P R E S E N T E D B Y T h i r u v e l T h i r u m o o l a n , S u m e e t S i n g h ⎪ J u n e 3 , 2 0 1 4
2014 Hadoop Summit, San Jose, California
Introduction
2 2014 Hadoop Summit, San Jose, California
Sumeet Singh
Senior Director, Product Management
Hadoop and Big Data Platforms
Cloud Engineering Group
Thiruvel Thirumoolan
Principal Engineer
Hadoop and Big Data Platforms
Cloud Engineering Group
§  Developer in the Hive-HCatalog team, and active
contributor to Apache Hive
§  Responsible for Hive, HiveServer2 and HCatalog
across all Hadoop clusters and ensuring they work
at scale for the usage patterns of Yahoo
§  Loves mining the trove of Hadoop logs for usage
patterns and insights
§  Bachelors degree from Anna University
701 First Avenue,
Sunnyvale, CA 94089 USA
@thiruvel
§  Manages Hadoop products team at Yahoo!
§  Responsible for Product Management, Strategy
and Customer Engagements
§  Managed Cloud Services products team and
headed Strategy functions for the Cloud Platform
Group at Yahoo
§  M.B.A. from UCLA and M.S. from Rensselaer(RPI)
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
Agenda
3
The Data Management Challenge1
Apache HCatalog to Rescue2
Data Registration and Discovery3
Opening Up Adhoc Access to Data4
Summary and Q&A5
2014 Hadoop Summit, San Jose, California
Hadoop Grid as the Source of Truth for Data
4 2014 Hadoop Summit, San Jose, California
TV
PC
Phone
Tablet
Pushed Data
Pulled Data
Web Crawl
Social
Email
3rd Party Content
Data
Advertising
Content
User Profiles /
No-SQL
Serving Stores
Serving
Data Highway
Feeds
Hadoop Grid
BI, Reporting, Adhoc Analytics
ILLUSTRATIVE
5 2014 Hadoop Summit, San Jose, California
34,000
servers
478 PB
0
100
200
300
400
500
600
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
RawHDFSStorage(inPB)
NumberofServers
Year
Servers
1 Across all Hadoop (16 clusters, 32,500 servers, 455 PB) and HBase (7 clusters, 1,500 servers, 23 PB) clusters, May 23, 2014
Growth in HDFS1
1.25 billion
files & dir
Processing and Analyzing Data with Hadoop…Then
6 2014 Hadoop Summit, San Jose, California
HDFS
MapReduce (YARN)
Pig Hive
Java MR
APIs
InputFormat/ OutputFormat
Load / Store SerDe
MetaStore
Client
Hive
MetaStore
Hadoop
Streaming
Oozie
Processing and Analyzing Data with HBase…Then
7 2014 Hadoop Summit, San Jose, California
HDFS
HBase
Pig HiveJava MR APIs
TableInputFormat/
TableOutputFormat
HBaseStorage MetaStore
Client
Hive
MetaStore
HBaseStorage
Handler
Oozie
Hadoop Jobs on the Platform Today
8 2014 Hadoop Summit, San Jose, California
100%
(21.5 M)
1%4%
9%
10%
31%
45%
All Jobs Pig Oozie
Launcher
Java MR Hive GDM Streaming,
distcp, Spark
Job Distribution (May 1 – May 26, 2014)
Challenges in Managing Data on Multi-tenant Platforms
9 2014 Hadoop Summit, San Jose, California
Data Producers
Platform Services
Data Consumers
§  Data shared across tools such as MR,
Pig, and Hive
§  Schema and semantics knowledge
across the company
§  Support for schema evolution and
downstream change communication
§  Fine-grained access controls (row /
column) vs. all or nothing
§  Clear ownership of data
§  Data lineage and integrity
§  Audits and compliance (e.g. SOX)
§  Retention, duplication, and waste
Data Economy Challenges
Apache
HCatalog
&
Data Discovery
Apache HCatalog in the Technology Stack at Yahoo
10 2014 Hadoop Summit, San Jose, California
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie HDFS ProxyGDM
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
HCatalog Facilitates Interoperability…Now
11 2014 Hadoop Summit, San Jose, California
HDFS
MapReduce (YARN)
Pig HiveJava MR APIs
InputFormat/ OutputFormat
SerDe & Storage Handler
MetaStore
Client
HCatalog
MetaStore
HCatInputFormat /
HCatOutputFormat
HCatLoader/
HCatStorer
HDFS
HBase
Notifications
Oozie
12 2014 Hadoop Summit, San Jose, California
Data Model
Database
(namespace)
Table
(schema)
Table
(schema)
Partitions Partitions
Buckets
Buckets
Skewed Unskewed
Optional
per table
Partitions, buckets, and skews facilitate faster, more direct access to data
Note on Buckets
§  It is hard to guess the right number of buckets that can also change overtime, hard to coordinate and align for joins
§  Community is working on dynamic bucketing that would have the same benefit without the need for static partitioning
Sample Table Registration
13 2014 Hadoop Summit, San Jose, California
Select project database
USE	
  xyz;	
  	
  
Create table
CREATE	
  EXTERNAL	
  TABLE	
  search	
  (	
  
bcookie	
  string	
   	
  COMMENT	
  ‘Standard	
  browser	
  cookie’,	
  
time_stamp	
  int	
   	
  COMMENT	
  ‘DD-­‐MON-­‐YYYY	
  HH:MI:SS	
  (AM/PM)’,	
  
uid	
  string	
   	
   	
  COMMENT	
  ‘User	
  id’,	
  
ip	
  string 	
   	
  COMMENT	
  ‘...’,	
  	
  
pg_spaceid	
  string 	
  COMMENT	
  ‘...’,	
  	
  
...)	
  
PARTITIONED	
  BY	
  (	
  
locale	
  string	
  	
   	
  COMMENT	
  ‘Country	
  of	
  origin’,	
  	
  
datestamp	
  string 	
  COMMENT	
  ‘Date	
  in	
  YYYYMMDD	
  format’)	
  
STORED	
  AS	
  ORC	
  
LOCATION	
  ‘/projects/search/...’;	
  
Add partitions manually, (if you choose to)
ALTER	
  TABLE	
  search	
  ADD	
  PARTITION	
  (	
  locale=‘US’,	
  datestamp=‘20130201’)	
  	
  
LOCATION	
  ‘/projects/search/...’;	
  
All your company’s data (metadata) can be registered with HCatalog irrespective of the
tool used.
Getting Data into HCatalog – DML and DDL
14 2014 Hadoop Summit, San Jose, California
LOAD Files into tables
Load operations are copy/move operations from HDFS or local filesystem that move datafiles into locations
corresponding to HCat tables. File format must agree with the table format.
LOAD	
  DATA	
  [LOCAL]	
  INPATH	
  'filepath'	
  [OVERWRITE]	
  INTO	
  TABLE	
  tablename	
  	
  
[PARTITION	
  (partcol1=val1,	
  partcol2=val2	
  ...)];	
  
INSERT data from a query into tables
Query results can be inserted into tables of file system directories by using the insert clause.
INSERT	
  OVERWRITE	
  TABLE	
  tablename1	
  [PARTITION	
  (partcol1=val1,	
  partcol2=val2	
  ...)	
  [IF	
  NOT	
  EXISTS]]	
  
select_statement1	
  FROM	
  from_statement;	
  
	
  
INSERT	
  INTO	
  TABLE	
  tablename1	
  [PARTITION	
  (partcol1=val1,	
  partcol2=val2	
  ...)]	
  select_statement1	
  FROM	
  
from_statement;	
  
HCat also supports multiple inserts in the same statement or dynamic partition inserts.
ALTER TABLE ADD PARTITIONS
You can use ALTER TABLE ADD PARTITION to add partitions to a table. The location must be a directory
inside of which data files reside. If new partitions are directly added to HDFS, HCat will not be aware of
these.
ALTER	
  TABLE	
  table_name	
  ADD	
  PARTITION	
  (partCol	
  =	
  'value1')	
  location	
  'loc1’;	
  
Getting Data into HCatalog – HCat APIs
15 2014 Hadoop Summit, San Jose, California
Pig
HCatLoader is used with Pig scripts to read data from HCatalog-managed tables, and HCatStorer is used
with Pig scripts to write data to HCatalog-managed tables.
	
  	
  A	
  =	
  load	
  '$DB.$TABLE'	
  using	
  org.apache.hcatalog.pig.HCatLoader();	
  
	
  	
  B	
  =	
  FILTER	
  A	
  BY	
  $FILTER;	
  
	
  	
  C	
  =	
  foreach	
  B	
  generate	
  foo,	
  bar;	
  
	
  	
  store	
  C	
  into	
  '$OUTPUT_DB.$OUTPUT_TABLE'	
  USING	
  org.apache.hcatalog.pig.HCatStorer	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
('$OUTPUT_PARTITION');	
  
	
  
MapReduce
The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables.
HCatOutputFormat is used with MapReduce jobs to write data to HCatalog-managed tables.
Map<String,	
  String>	
  partitionValues	
  =	
  new	
  HashMap<String,	
  String>();	
  
partitionValues.put("a",	
  "1");	
  
partitionValues.put("b",	
  "1");	
  
HCatTableInfo	
  info	
  =	
  HCatTableInfo.getOutputTableInfo(dbName,	
  tblName,	
  partitionValues);	
  
HCatOutputFormat.setOutput(job,	
  info);	
  
	
  
	
  
	
  
HCatalog Integration with Data Mgmt. Platform (GDM)
16 2014 Hadoop Summit, San Jose, California
HCatalog
MetaStore
Cluster 1 - Colo 1
HDFS
Cluster 2 – Colo 2
HDFS
Grid Data
Management
Feed Acquisition
Feed
Replication
HCatalog
MetaStore
Feed datasets
as partitioned
external tables
Growl extracts
schema for
backfill
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
Partitions are dropped with
(HCatClient.dropPartitions(…))
after retention expiration with a
drop_partition notification
add_partition
event notification
add_partition
event notification
HCatalog Notification
17 2014 Hadoop Summit, San Jose, California
Namespace:	
  E.g.	
  “hcat.thebestcluster”	
  
JMS	
  Topic:	
  E.g.	
  “<dbname>.<tablename>”	
  
Sample	
  JMS	
  Notification	
  
{	
  
	
  	
  "timestamp"	
  :	
  1360272556,	
  
	
  	
  "eventType"	
  :	
  "ADD_PARTITION",	
  
	
  	
  "server"	
  	
  	
  	
  :	
  "thebestcluster-­‐hcat.dc1.grid.yahoo.com",	
  
	
  	
  "servicePrincipal"	
  :	
  "hcat/thebestcluster-­‐hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM",	
  
	
  	
  "db"	
  	
  	
  	
  	
  	
  	
  	
  :	
  "xyz",	
  
	
  	
  "table"	
  	
  	
  	
  	
  :	
  "search",	
  
	
  	
  "partitions":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "locale"	
  :	
  "US",	
  "datestamp"	
  :	
  "20140602"	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "locale"	
  :	
  "UK",	
  "datestamp"	
  :	
  "20140602"	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "locale"	
  :	
  "IN",	
  "datestamp"	
  :	
  "20140602"	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ]	
  
}	
  
§  HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database,
add_table, add_partition, drop_partition, drop_table, and drop_database
§  Notifications can be extended for schema change notifications (proposed)
HCat
Client
HCat
MetaStore
ActiveMQ
Server
Register Channel Publish to listener channels
Subscribers
Oozie, HCatalog, and Messaging Integration
18 2014 Hadoop Summit, San Jose, California
Oozie
Message
Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data
Producer
HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
Data Discovery with HCatalog
19 2014 Hadoop Summit, San Jose, California
§  HCatalog instances become a unifying metastore for all data at
Yahoo
§  Discovery is about
o  Browsing / inspecting metadata
o  Searching for datasets
§  It helps to solve
o  Schema knowledge across the company
o  Schema evolution
o  Lineage
o  Ownerships
o  Data type – dev or prod
Data Discovery Physical View
20 2014 Hadoop Summit, San Jose, California
Global View of
All Data in HCatalog
DC1-C1
DC1-C2
DCn-Cn
.
.
.
DC2-C1
DC2-C2
DCm-Cm
.
.
.
Discovery UI
Data Center 1 Data Center 2
HCat REST
(Templeton)
HCat REST
(Templeton)
HCat REST
(Templeton)
HCatREST
(Templeton)
HCatREST
(Templeton)
HCat
REST
(Templeton)
ILLUSTRATIVE
Data Discovery Features
21 2014 Hadoop Summit, San Jose, California
§  Browsing
o  Tables / Databases
o  Schema, format, properties
o  Partitions and metadata about each partition
§  Searches for tables
o  Table name (regex) or Comments
o  Column name or comments
o  Ownership, File format
o  Location
o  Properties (Dev/Prod)
Discovery UI
22 2014 Hadoop Summit, San Jose, California
Search Tables Search
The Best Cluster
audience_db	
  
tumblr_db	
  
user_db	
  
adv_warehouse	
  
flickr_db	
  
page_clicks	
   Hourly	
  clickstream	
  table	
  
ad_clicks	
   Hourly	
  ad	
  clicks	
  table	
  	
  
user_info	
   User	
  registration	
  info	
  
session_info	
   Session	
  feed	
  info	
  
audience_info	
   Primary	
  audience	
  table	
  
GLOBAL HCATALOG DASHBOARD
Available Databases
Available Tables (audience_db)
Search the HCat tables
Browse
the DBs
by
cluster
Search
results
or
browse
db
results
1 2 Next 1 2 Next
ILLUSTRATIVE
Table Display UI
23 2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
GLOBAL HCATALOG DASHBOARD
HCat Instance The	
  Best	
  Cluster	
  
Database audience_db	
  
Table page_clicks	
  
Owner Awesome	
  Yahoo	
  
Schema
…more table information and properties (e.g. data format etc.)
Partitions
…list of partitions
Column Type Description
bcookie	
   string	
   Standard	
  browser	
  cookie	
  
timestamp	
   string	
   DD-­‐MON-­‐YYYY	
  HH:MI:SS	
  (AM/PM)	
  
uid	
   string	
   User	
  id	
  
.
.
.
Data Discovery Design Approach
24 2014 Hadoop Summit, San Jose, California
§  A single web interface connects to all HCatalog instances (same and
cross-colo)
§  Select an appropriate HCat instance and browse all metadata
o  Each HCatalog instance runs a webserver (Templeton/ WebHCat) to read
metadata
o  All reads audited
o  ACL’s apply
§  Search functionality will be added to Templeton and HCatalog
o  New Thrift interface to support search
o  All searches audited
o  ACL’s apply
§  Long term design
o  Read and Write HCatalog instances
Data Discovery Going Forward
25 2014 Hadoop Summit, San Jose, California
§  Lineage
o  Source datasets
o  Derived datasets
§  Data Quality
o  Statistics help in heuristics instead of running a job
Table 1 /
Partition 1
HBase
ORC Table
Partition 1
Dimension
Table
Statistics/
Agg. Table
Daily Stats
Table
Copied by
distcp / external
registrar
Hourly
ILLUSTRATIVE
Data Discovery Going Forward (cont’d)
26 2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
Schema
Column Type Description
bcookie	
   string	
   Standard	
  browser	
  cookie	
  
timestamp	
   string	
   DD-­‐MON-­‐YYYY	
  HH:MI:SS	
  (AM/PM)	
  
uid	
   string	
   User	
  id	
  
File Format
ORC	
  
Table Properties
Compression	
  
Type	
  
zlib	
  
External	
  
§  User ‘awesome_yahoo’
added ‘foo string’ to the
table on May 29, 2014 at
‘1:10 AM’
§  User ‘me_too’ added table
properties
‘orc.compress=ZLIB’ on
May 30, 2014 at ‘9:00 AM’
§  User ‘me_too’ changed the
file format from ‘RCFile’ to
‘ORC’ on Jun 1, 2014 at
‘10:30 AM’
.
.
.
.
.
.
HCatalog is Part of a Broader Solution Set
27 2014 Hadoop Summit, San Jose, California
Hive
HiveServer2
HCatalog
§  Data warehousing software that facilitates querying and managing large
datasets in HDFS
§  Provides a mechanism to project structure onto HDFS data and query the
data using a SQL-like language called HiveQL
§  Server process (Thrift-based RPC interface) to support concurrent clients
connecting over ODBC/JDBC
§  Provides authentication and enforces authorization for ODBC/JDBC clients
for metadata access
§  Table and storage management layer that enables users with different tools
(Pig, M/R, and Hive) to more easily share data
§  Presents a relational view of data in HDFS, abstracts where or in what
format data is stored, and enables notifications of data availability
Starling
§  Hadoop log warehouse for analytics on grid usage (job history, tasks, job
counters etc.)
§  1TB of raw logs processed / day, 24 TB of processed data
Product Role in the Grid Stack
28
Deployment Layout
Tez and MapReduce
on YARN
+
HDFS
Oracle
DBMS
LoadBalancer
HCatalog
Thrift
HS2
ODBC/JDBC
Launcher Gateway
LoadBalancer
Data Out Client
Client/ CLI
HiveQL
M/R Jobs
Pig M/R
Cloud
Messaging
ActiveMQ
notifications
HiveServer2
Hadoop
Hive
HCatalog
2014 Hadoop Summit, San Jose, California
29 2014 Hadoop Summit, San Jose, California
Hive for Both Batch and Interactive Adhoc Analytics
Tez
§  Computation expressed as a dataflow graph
with reusable primitives
§  No intermediate outputs to HDFS
§  Built on top of YARN
§  Hive generates Tez plans for lower latency
Query Engine Improvements
§  Cost-based optimizations
§  In-memory joins
§  Caching hot tables
§  Vectorized processing
Better Columnar Store
§  ORCFile with predicate pushdown
§  Built for both speed and storage efficiency
Tez Service
§  Always-on pool of AMs / container re-use
Improved Latency and Throughput
Analytics Functions
§  SQL 2003 Compliant
§  OVER with PARTITION BY and ORDER BY
§  Wide variety of windowing functions:
o  RANK
o  LEAD/LAG
o  ROW_NUMBER
o  FIRST_VALUE
o  LAST_VALUE
o  Many more
§  Aligns well with BI ecosystem
Improving SQL Coverage
§  Non-correlated sub-queries using IN in
WHERE
§  Expanded SQL types including DATETIME,
VARCHAR, etc.
Extended Analytical Ability
HiveServer2 as ODBC / JDBC Endpoint
§  Gateway that Hive clients
can talk to
§  Supports concurrent clients
§  User/ global session/
configuration information
§  Support for secure clusters
and encryption
§  DoAs support allows Hive
queries to run as the
requester
30 2014 Hadoop Summit, San Jose, California
31 2014 Hadoop Summit, San Jose, California
Data to Desktop (D2D) – BI and Reporting on ODBC
HiveServer2
Hive
Hadoop
Desktop Web
Intelligence Server
Metadata Database
Grid ODBC driver
32 2014 Hadoop Summit, San Jose, California
DataOut – Data to Any Off-Grid Destination on JDBC
HiveSplit HiveSplit
HiveServer2M
S
FS/DB
S
FS/DB
HiveSplit
S
FS/DB
Execute Query
Prepare Splits
Fetch Splits
Legend:
M – Master, S – Slave, FS/ DB – Filesystem/ Database
§  DataOut is an efficient
method of moving data off
the grid
§  Advantages:
o  API based on well-known
JDBC interface
o  Works with HCatalog / Hive
o  Agnostic to the underlying
storage format
o  Parts of the whole data can
be pulled in parallel
SQL-based Authorization for Controlled Access
33 2014 Hadoop Summit, San Jose, California
§  SQL-compliant authorization model (Users, Roles, Privileges, Objects)
§  Fine-grain authorization and access control patterns (row and column in
conjunction with views)
§  Can be used in conjunction with storage-based authorization
Privileges Access Control
§  Objects consist of databases, tables,
and views
§  Privileges are GRANTed on objects
o  SELECT: read access to an object
o  INSERT: write (insert) access to an
object
o  UPDATE: write (update) access to an
object
o  DELETE: delete access for an object
o  ALL PRIVILEGES: all privileges
§  Roles can be associated with objects
§  Privileges are associated with roles
§  CREATE, DROP, and SET ROLE
statements manipulate roles and
membership
§  SUPERUSER role for databases can
grant access control to users or roles
(not limited to HDFS permissions)
§  PUBLIC role includes all users
§  Prevents undesirable operations on
objects by unauthorized users
Starling (Log Warehouse) for Historical Analysis and Trends
34 2014 Hadoop Summit, San Jose, California
Cluster 1 Cluster 2 Cluster 3 Cluster N
Oozie
HCatalog HDFS
Hive
Starling
Dashboard
Discovery
Portal
Query
Server
Source
Clusters
Warehouse
Clusters
35 2014 Hadoop Summit, San Jose, California
SQL on Hadoop the Fastest Growing Product on Grid
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
0
5
10
15
20
25
30
Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14
HiveJobs(%ofAllJobs)
AllGridJobs(inMillions)
All Jobs Hive (% of all jobs)
2.5 million
queries
In Summary
36 2014 Hadoop Summit, San Jose, California
Data shared across tools such as MR, Pig, and Hive Apache HCatalog
Schema and semantics knowledge across the
company
Data Discovery
Support for schema evolution and downstream
change communication
Apache HCatalog
Fine-grained access controls (row / column) vs. all
or nothing
SQL-based
Authorization
Clear ownership of data Data Discovery
Data lineage and integrity Data Discovery / Starling
Audits and compliance (e.g. SOX) Data Discovery / Starling
Retention, duplication, and waste Data Discovery / Starling
✔
✔
✔
✔
✔
✔
✔
✔
Acknowledge
37 2014 Hadoop Summit, San Jose, California
1 Apache Hive (and HiveServer2, HCatalog) Community
http://hive.apache.org/people.html
2 HCatalog and Hive Development Team at Yahoo
Olga Natkovich Annie Lin Fangyue Wang
Chris Drome Jin Sun Selina Zhang
Mithun Radhakrishnan Viraj Bhat
3 Oozie Development Team
Rohini Palaniswamy Ryota Egashira Purshotam Shah
Mona Chitnis Michelle Chiang
4 Grid Data Management (GDM) Team
Mark Holderbaugh Aaron Gresch Lawrence Prem Kumar
Scott Preece Yan Braun
5 Service Engineering and Data Operations
Rob Realini David Kuder Chuck Sheldon
Rajiv Chittajallu Vineeth Vadrevu Andy Rhee
6 Product Management
Sid Shaik Amrit Lal Kimsukh Kundu
Thank You
@thiruvel
@sumeetksingh
We are hiring!
Stop by Kiosk P9
or reach out to us at
bigdata@yahoo-inc.com.

More Related Content

What's hot

report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008athusoo
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillMapR Technologies
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentJim Mlodgenski
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009prasadc
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoopChirag Ahuja
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesDavid Tjahjono,MD,MBA(UK)
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
 

What's hot (20)

report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
SQL in Hadoop
SQL in HadoopSQL in Hadoop
SQL in Hadoop
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
Leveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL EnvironmentLeveraging Hadoop in your PostgreSQL Environment
Leveraging Hadoop in your PostgreSQL Environment
 
Hive Percona 2009
Hive Percona 2009Hive Percona 2009
Hive Percona 2009
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Harnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution TimesHarnessing Hadoop and Big Data to Reduce Execution Times
Harnessing Hadoop and Big Data to Reduce Execution Times
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 

Similar to Hadoop Summit San Jose 2014: Data Discovery on Hadoop

Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainYahoo Developer Network
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoopmarklpollack
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQpivotalny
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRABhadra Gowdra
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseJonathan Bloom
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comEdward D. Kim
 

Similar to Hadoop Summit San Jose 2014: Data Discovery on Hadoop (20)

Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit JainApache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 

More from Sumeet Singh

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckSumeet Singh
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

More from Sumeet Singh (14)

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 

Hadoop Summit San Jose 2014: Data Discovery on Hadoop

  • 1. Data Discovery on Hadoop - Realizing the Full Potential of Your Data P R E S E N T E D B Y T h i r u v e l T h i r u m o o l a n , S u m e e t S i n g h ⎪ J u n e 3 , 2 0 1 4 2014 Hadoop Summit, San Jose, California
  • 2. Introduction 2 2014 Hadoop Summit, San Jose, California Sumeet Singh Senior Director, Product Management Hadoop and Big Data Platforms Cloud Engineering Group Thiruvel Thirumoolan Principal Engineer Hadoop and Big Data Platforms Cloud Engineering Group §  Developer in the Hive-HCatalog team, and active contributor to Apache Hive §  Responsible for Hive, HiveServer2 and HCatalog across all Hadoop clusters and ensuring they work at scale for the usage patterns of Yahoo §  Loves mining the trove of Hadoop logs for usage patterns and insights §  Bachelors degree from Anna University 701 First Avenue, Sunnyvale, CA 94089 USA @thiruvel §  Manages Hadoop products team at Yahoo! §  Responsible for Product Management, Strategy and Customer Engagements §  Managed Cloud Services products team and headed Strategy functions for the Cloud Platform Group at Yahoo §  M.B.A. from UCLA and M.S. from Rensselaer(RPI) 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh
  • 3. Agenda 3 The Data Management Challenge1 Apache HCatalog to Rescue2 Data Registration and Discovery3 Opening Up Adhoc Access to Data4 Summary and Q&A5 2014 Hadoop Summit, San Jose, California
  • 4. Hadoop Grid as the Source of Truth for Data 4 2014 Hadoop Summit, San Jose, California TV PC Phone Tablet Pushed Data Pulled Data Web Crawl Social Email 3rd Party Content Data Advertising Content User Profiles / No-SQL Serving Stores Serving Data Highway Feeds Hadoop Grid BI, Reporting, Adhoc Analytics ILLUSTRATIVE
  • 5. 5 2014 Hadoop Summit, San Jose, California 34,000 servers 478 PB 0 100 200 300 400 500 600 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 RawHDFSStorage(inPB) NumberofServers Year Servers 1 Across all Hadoop (16 clusters, 32,500 servers, 455 PB) and HBase (7 clusters, 1,500 servers, 23 PB) clusters, May 23, 2014 Growth in HDFS1 1.25 billion files & dir
  • 6. Processing and Analyzing Data with Hadoop…Then 6 2014 Hadoop Summit, San Jose, California HDFS MapReduce (YARN) Pig Hive Java MR APIs InputFormat/ OutputFormat Load / Store SerDe MetaStore Client Hive MetaStore Hadoop Streaming Oozie
  • 7. Processing and Analyzing Data with HBase…Then 7 2014 Hadoop Summit, San Jose, California HDFS HBase Pig HiveJava MR APIs TableInputFormat/ TableOutputFormat HBaseStorage MetaStore Client Hive MetaStore HBaseStorage Handler Oozie
  • 8. Hadoop Jobs on the Platform Today 8 2014 Hadoop Summit, San Jose, California 100% (21.5 M) 1%4% 9% 10% 31% 45% All Jobs Pig Oozie Launcher Java MR Hive GDM Streaming, distcp, Spark Job Distribution (May 1 – May 26, 2014)
  • 9. Challenges in Managing Data on Multi-tenant Platforms 9 2014 Hadoop Summit, San Jose, California Data Producers Platform Services Data Consumers §  Data shared across tools such as MR, Pig, and Hive §  Schema and semantics knowledge across the company §  Support for schema evolution and downstream change communication §  Fine-grained access controls (row / column) vs. all or nothing §  Clear ownership of data §  Data lineage and integrity §  Audits and compliance (e.g. SOX) §  Retention, duplication, and waste Data Economy Challenges Apache HCatalog & Data Discovery
  • 10. Apache HCatalog in the Technology Stack at Yahoo 10 2014 Hadoop Summit, San Jose, California Compute Services Storage Infrastructure Services HivePig Oozie HDFS ProxyGDM YARN MapReduce HDFS HBase Zookeeper Support Shop Monitoring Starling Messaging Service HCatalog Storm SparkTez
  • 11. HCatalog Facilitates Interoperability…Now 11 2014 Hadoop Summit, San Jose, California HDFS MapReduce (YARN) Pig HiveJava MR APIs InputFormat/ OutputFormat SerDe & Storage Handler MetaStore Client HCatalog MetaStore HCatInputFormat / HCatOutputFormat HCatLoader/ HCatStorer HDFS HBase Notifications Oozie
  • 12. 12 2014 Hadoop Summit, San Jose, California Data Model Database (namespace) Table (schema) Table (schema) Partitions Partitions Buckets Buckets Skewed Unskewed Optional per table Partitions, buckets, and skews facilitate faster, more direct access to data Note on Buckets §  It is hard to guess the right number of buckets that can also change overtime, hard to coordinate and align for joins §  Community is working on dynamic bucketing that would have the same benefit without the need for static partitioning
  • 13. Sample Table Registration 13 2014 Hadoop Summit, San Jose, California Select project database USE  xyz;     Create table CREATE  EXTERNAL  TABLE  search  (   bcookie  string    COMMENT  ‘Standard  browser  cookie’,   time_stamp  int    COMMENT  ‘DD-­‐MON-­‐YYYY  HH:MI:SS  (AM/PM)’,   uid  string      COMMENT  ‘User  id’,   ip  string    COMMENT  ‘...’,     pg_spaceid  string  COMMENT  ‘...’,     ...)   PARTITIONED  BY  (   locale  string      COMMENT  ‘Country  of  origin’,     datestamp  string  COMMENT  ‘Date  in  YYYYMMDD  format’)   STORED  AS  ORC   LOCATION  ‘/projects/search/...’;   Add partitions manually, (if you choose to) ALTER  TABLE  search  ADD  PARTITION  (  locale=‘US’,  datestamp=‘20130201’)     LOCATION  ‘/projects/search/...’;   All your company’s data (metadata) can be registered with HCatalog irrespective of the tool used.
  • 14. Getting Data into HCatalog – DML and DDL 14 2014 Hadoop Summit, San Jose, California LOAD Files into tables Load operations are copy/move operations from HDFS or local filesystem that move datafiles into locations corresponding to HCat tables. File format must agree with the table format. LOAD  DATA  [LOCAL]  INPATH  'filepath'  [OVERWRITE]  INTO  TABLE  tablename     [PARTITION  (partcol1=val1,  partcol2=val2  ...)];   INSERT data from a query into tables Query results can be inserted into tables of file system directories by using the insert clause. INSERT  OVERWRITE  TABLE  tablename1  [PARTITION  (partcol1=val1,  partcol2=val2  ...)  [IF  NOT  EXISTS]]   select_statement1  FROM  from_statement;     INSERT  INTO  TABLE  tablename1  [PARTITION  (partcol1=val1,  partcol2=val2  ...)]  select_statement1  FROM   from_statement;   HCat also supports multiple inserts in the same statement or dynamic partition inserts. ALTER TABLE ADD PARTITIONS You can use ALTER TABLE ADD PARTITION to add partitions to a table. The location must be a directory inside of which data files reside. If new partitions are directly added to HDFS, HCat will not be aware of these. ALTER  TABLE  table_name  ADD  PARTITION  (partCol  =  'value1')  location  'loc1’;  
  • 15. Getting Data into HCatalog – HCat APIs 15 2014 Hadoop Summit, San Jose, California Pig HCatLoader is used with Pig scripts to read data from HCatalog-managed tables, and HCatStorer is used with Pig scripts to write data to HCatalog-managed tables.    A  =  load  '$DB.$TABLE'  using  org.apache.hcatalog.pig.HCatLoader();      B  =  FILTER  A  BY  $FILTER;      C  =  foreach  B  generate  foo,  bar;      store  C  into  '$OUTPUT_DB.$OUTPUT_TABLE'  USING  org.apache.hcatalog.pig.HCatStorer                                               ('$OUTPUT_PARTITION');     MapReduce The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables. HCatOutputFormat is used with MapReduce jobs to write data to HCatalog-managed tables. Map<String,  String>  partitionValues  =  new  HashMap<String,  String>();   partitionValues.put("a",  "1");   partitionValues.put("b",  "1");   HCatTableInfo  info  =  HCatTableInfo.getOutputTableInfo(dbName,  tblName,  partitionValues);   HCatOutputFormat.setOutput(job,  info);        
  • 16. HCatalog Integration with Data Mgmt. Platform (GDM) 16 2014 Hadoop Summit, San Jose, California HCatalog MetaStore Cluster 1 - Colo 1 HDFS Cluster 2 – Colo 2 HDFS Grid Data Management Feed Acquisition Feed Replication HCatalog MetaStore Feed datasets as partitioned external tables Growl extracts schema for backfill HCatClient. addPartitions(…) Mark LOAD_DONE HCatClient. addPartitions(…) Mark LOAD_DONE Partitions are dropped with (HCatClient.dropPartitions(…)) after retention expiration with a drop_partition notification add_partition event notification add_partition event notification
  • 17. HCatalog Notification 17 2014 Hadoop Summit, San Jose, California Namespace:  E.g.  “hcat.thebestcluster”   JMS  Topic:  E.g.  “<dbname>.<tablename>”   Sample  JMS  Notification   {      "timestamp"  :  1360272556,      "eventType"  :  "ADD_PARTITION",      "server"        :  "thebestcluster-­‐hcat.dc1.grid.yahoo.com",      "servicePrincipal"  :  "hcat/thebestcluster-­‐hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM",      "db"                :  "xyz",      "table"          :  "search",      "partitions":  [                                        {  "locale"  :  "US",  "datestamp"  :  "20140602"  },                                        {  "locale"  :  "UK",  "datestamp"  :  "20140602"  },                                        {  "locale"  :  "IN",  "datestamp"  :  "20140602"  }                                  ]   }   §  HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database, add_table, add_partition, drop_partition, drop_table, and drop_database §  Notifications can be extended for schema change notifications (proposed) HCat Client HCat MetaStore ActiveMQ Server Register Channel Publish to listener channels Subscribers
  • 18. Oozie, HCatalog, and Messaging Integration 18 2014 Hadoop Summit, San Jose, California Oozie Message Bus HCatalog 3. Push notification <New Partition> 2. Register Topic 4. Notify New Partition Data Producer HDFS Produce data (distcp, pig, M/R..) /data/click/2014/06/02 1. Query/Poll Partition Start workflow Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’)
  • 19. Data Discovery with HCatalog 19 2014 Hadoop Summit, San Jose, California §  HCatalog instances become a unifying metastore for all data at Yahoo §  Discovery is about o  Browsing / inspecting metadata o  Searching for datasets §  It helps to solve o  Schema knowledge across the company o  Schema evolution o  Lineage o  Ownerships o  Data type – dev or prod
  • 20. Data Discovery Physical View 20 2014 Hadoop Summit, San Jose, California Global View of All Data in HCatalog DC1-C1 DC1-C2 DCn-Cn . . . DC2-C1 DC2-C2 DCm-Cm . . . Discovery UI Data Center 1 Data Center 2 HCat REST (Templeton) HCat REST (Templeton) HCat REST (Templeton) HCatREST (Templeton) HCatREST (Templeton) HCat REST (Templeton) ILLUSTRATIVE
  • 21. Data Discovery Features 21 2014 Hadoop Summit, San Jose, California §  Browsing o  Tables / Databases o  Schema, format, properties o  Partitions and metadata about each partition §  Searches for tables o  Table name (regex) or Comments o  Column name or comments o  Ownership, File format o  Location o  Properties (Dev/Prod)
  • 22. Discovery UI 22 2014 Hadoop Summit, San Jose, California Search Tables Search The Best Cluster audience_db   tumblr_db   user_db   adv_warehouse   flickr_db   page_clicks   Hourly  clickstream  table   ad_clicks   Hourly  ad  clicks  table     user_info   User  registration  info   session_info   Session  feed  info   audience_info   Primary  audience  table   GLOBAL HCATALOG DASHBOARD Available Databases Available Tables (audience_db) Search the HCat tables Browse the DBs by cluster Search results or browse db results 1 2 Next 1 2 Next ILLUSTRATIVE
  • 23. Table Display UI 23 2014 Hadoop Summit, San Jose, California ILLUSTRATIVE GLOBAL HCATALOG DASHBOARD HCat Instance The  Best  Cluster   Database audience_db   Table page_clicks   Owner Awesome  Yahoo   Schema …more table information and properties (e.g. data format etc.) Partitions …list of partitions Column Type Description bcookie   string   Standard  browser  cookie   timestamp   string   DD-­‐MON-­‐YYYY  HH:MI:SS  (AM/PM)   uid   string   User  id   . . .
  • 24. Data Discovery Design Approach 24 2014 Hadoop Summit, San Jose, California §  A single web interface connects to all HCatalog instances (same and cross-colo) §  Select an appropriate HCat instance and browse all metadata o  Each HCatalog instance runs a webserver (Templeton/ WebHCat) to read metadata o  All reads audited o  ACL’s apply §  Search functionality will be added to Templeton and HCatalog o  New Thrift interface to support search o  All searches audited o  ACL’s apply §  Long term design o  Read and Write HCatalog instances
  • 25. Data Discovery Going Forward 25 2014 Hadoop Summit, San Jose, California §  Lineage o  Source datasets o  Derived datasets §  Data Quality o  Statistics help in heuristics instead of running a job Table 1 / Partition 1 HBase ORC Table Partition 1 Dimension Table Statistics/ Agg. Table Daily Stats Table Copied by distcp / external registrar Hourly ILLUSTRATIVE
  • 26. Data Discovery Going Forward (cont’d) 26 2014 Hadoop Summit, San Jose, California ILLUSTRATIVE Schema Column Type Description bcookie   string   Standard  browser  cookie   timestamp   string   DD-­‐MON-­‐YYYY  HH:MI:SS  (AM/PM)   uid   string   User  id   File Format ORC   Table Properties Compression   Type   zlib   External   §  User ‘awesome_yahoo’ added ‘foo string’ to the table on May 29, 2014 at ‘1:10 AM’ §  User ‘me_too’ added table properties ‘orc.compress=ZLIB’ on May 30, 2014 at ‘9:00 AM’ §  User ‘me_too’ changed the file format from ‘RCFile’ to ‘ORC’ on Jun 1, 2014 at ‘10:30 AM’ . . . . . .
  • 27. HCatalog is Part of a Broader Solution Set 27 2014 Hadoop Summit, San Jose, California Hive HiveServer2 HCatalog §  Data warehousing software that facilitates querying and managing large datasets in HDFS §  Provides a mechanism to project structure onto HDFS data and query the data using a SQL-like language called HiveQL §  Server process (Thrift-based RPC interface) to support concurrent clients connecting over ODBC/JDBC §  Provides authentication and enforces authorization for ODBC/JDBC clients for metadata access §  Table and storage management layer that enables users with different tools (Pig, M/R, and Hive) to more easily share data §  Presents a relational view of data in HDFS, abstracts where or in what format data is stored, and enables notifications of data availability Starling §  Hadoop log warehouse for analytics on grid usage (job history, tasks, job counters etc.) §  1TB of raw logs processed / day, 24 TB of processed data Product Role in the Grid Stack
  • 28. 28 Deployment Layout Tez and MapReduce on YARN + HDFS Oracle DBMS LoadBalancer HCatalog Thrift HS2 ODBC/JDBC Launcher Gateway LoadBalancer Data Out Client Client/ CLI HiveQL M/R Jobs Pig M/R Cloud Messaging ActiveMQ notifications HiveServer2 Hadoop Hive HCatalog 2014 Hadoop Summit, San Jose, California
  • 29. 29 2014 Hadoop Summit, San Jose, California Hive for Both Batch and Interactive Adhoc Analytics Tez §  Computation expressed as a dataflow graph with reusable primitives §  No intermediate outputs to HDFS §  Built on top of YARN §  Hive generates Tez plans for lower latency Query Engine Improvements §  Cost-based optimizations §  In-memory joins §  Caching hot tables §  Vectorized processing Better Columnar Store §  ORCFile with predicate pushdown §  Built for both speed and storage efficiency Tez Service §  Always-on pool of AMs / container re-use Improved Latency and Throughput Analytics Functions §  SQL 2003 Compliant §  OVER with PARTITION BY and ORDER BY §  Wide variety of windowing functions: o  RANK o  LEAD/LAG o  ROW_NUMBER o  FIRST_VALUE o  LAST_VALUE o  Many more §  Aligns well with BI ecosystem Improving SQL Coverage §  Non-correlated sub-queries using IN in WHERE §  Expanded SQL types including DATETIME, VARCHAR, etc. Extended Analytical Ability
  • 30. HiveServer2 as ODBC / JDBC Endpoint §  Gateway that Hive clients can talk to §  Supports concurrent clients §  User/ global session/ configuration information §  Support for secure clusters and encryption §  DoAs support allows Hive queries to run as the requester 30 2014 Hadoop Summit, San Jose, California
  • 31. 31 2014 Hadoop Summit, San Jose, California Data to Desktop (D2D) – BI and Reporting on ODBC HiveServer2 Hive Hadoop Desktop Web Intelligence Server Metadata Database Grid ODBC driver
  • 32. 32 2014 Hadoop Summit, San Jose, California DataOut – Data to Any Off-Grid Destination on JDBC HiveSplit HiveSplit HiveServer2M S FS/DB S FS/DB HiveSplit S FS/DB Execute Query Prepare Splits Fetch Splits Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database §  DataOut is an efficient method of moving data off the grid §  Advantages: o  API based on well-known JDBC interface o  Works with HCatalog / Hive o  Agnostic to the underlying storage format o  Parts of the whole data can be pulled in parallel
  • 33. SQL-based Authorization for Controlled Access 33 2014 Hadoop Summit, San Jose, California §  SQL-compliant authorization model (Users, Roles, Privileges, Objects) §  Fine-grain authorization and access control patterns (row and column in conjunction with views) §  Can be used in conjunction with storage-based authorization Privileges Access Control §  Objects consist of databases, tables, and views §  Privileges are GRANTed on objects o  SELECT: read access to an object o  INSERT: write (insert) access to an object o  UPDATE: write (update) access to an object o  DELETE: delete access for an object o  ALL PRIVILEGES: all privileges §  Roles can be associated with objects §  Privileges are associated with roles §  CREATE, DROP, and SET ROLE statements manipulate roles and membership §  SUPERUSER role for databases can grant access control to users or roles (not limited to HDFS permissions) §  PUBLIC role includes all users §  Prevents undesirable operations on objects by unauthorized users
  • 34. Starling (Log Warehouse) for Historical Analysis and Trends 34 2014 Hadoop Summit, San Jose, California Cluster 1 Cluster 2 Cluster 3 Cluster N Oozie HCatalog HDFS Hive Starling Dashboard Discovery Portal Query Server Source Clusters Warehouse Clusters
  • 35. 35 2014 Hadoop Summit, San Jose, California SQL on Hadoop the Fastest Growing Product on Grid 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0% 0 5 10 15 20 25 30 Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14 HiveJobs(%ofAllJobs) AllGridJobs(inMillions) All Jobs Hive (% of all jobs) 2.5 million queries
  • 36. In Summary 36 2014 Hadoop Summit, San Jose, California Data shared across tools such as MR, Pig, and Hive Apache HCatalog Schema and semantics knowledge across the company Data Discovery Support for schema evolution and downstream change communication Apache HCatalog Fine-grained access controls (row / column) vs. all or nothing SQL-based Authorization Clear ownership of data Data Discovery Data lineage and integrity Data Discovery / Starling Audits and compliance (e.g. SOX) Data Discovery / Starling Retention, duplication, and waste Data Discovery / Starling ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • 37. Acknowledge 37 2014 Hadoop Summit, San Jose, California 1 Apache Hive (and HiveServer2, HCatalog) Community http://hive.apache.org/people.html 2 HCatalog and Hive Development Team at Yahoo Olga Natkovich Annie Lin Fangyue Wang Chris Drome Jin Sun Selina Zhang Mithun Radhakrishnan Viraj Bhat 3 Oozie Development Team Rohini Palaniswamy Ryota Egashira Purshotam Shah Mona Chitnis Michelle Chiang 4 Grid Data Management (GDM) Team Mark Holderbaugh Aaron Gresch Lawrence Prem Kumar Scott Preece Yan Braun 5 Service Engineering and Data Operations Rob Realini David Kuder Chuck Sheldon Rajiv Chittajallu Vineeth Vadrevu Andy Rhee 6 Product Management Sid Shaik Amrit Lal Kimsukh Kundu
  • 38. Thank You @thiruvel @sumeetksingh We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.