SlideShare a Scribd company logo
1 of 36
Data Discovery on Hadoop
PRESENTED BY Sumeet Singh, Thiruvel Thirumoolan ⎪ February 19, 2015
S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 5 , S a n J o s e
Introduction
2
§  Developer in the Hive-HCatalog team, and active
contributor to Apache Hive
§  Responsible for Hive, HiveServer2 and HCatalog
across all Hadoop clusters and ensuring they work
at scale for the usage patterns of Yahoo
§  Loves mining the trove of Hadoop logs for usage
patterns and insights
§  Bachelors degree from Anna University
Thiruvel Thirumoolan
Principal Engineer
Hadoop and Big Data Platforms
Platforms and Personalization Products
701 First Avenue,
Sunnyvale, CA 94089 USA
@thiruvel
§  Manages Hadoop products team at Yahoo
§  Responsible for Product Management, Strategy and
Customer Engagements
§  Managed Cloud Services products team and headed
Strategy functions for the Cloud Platform Group at
Yahoo
§  MBA from UCLA and MS from Rensselaer
Polytechnic Institute (RPI)
Sumeet Singh
Sr. Director, Product Management
Cloud and Big Data Platforms
Platforms and Personalization Products
701 First Avenue,
Sunnyvale, CA 94089 USA
@sumeetksingh
Agenda
3
The Data Management Challenge1
Apache HCatalog to Rescue
Data Registration and Discovery
Opening up Access to Data
Q&A
2
3
4
5
Hadoop as the Source of Truth for All Data
4
TV
PC
Phone
Tablet
Pushed Data
Pulled Data
Web Crawl
Social
Email
3rd Party Content
Data Highway
Hadoop Grid
BI, Reporting, Adhoc Analytics
Data
Content
Ads
No-SQL
Serving Stores
Serving
ILLUSTRATIVE
5
42,300
servers
600 PB
0
100
200
300
400
500
600
700
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
RawHDFSStorage(inPB)
NumberofServers
Year
Servers Storage
1 Across all Hadoop (18 clusters, 40,080 servers, 565 PB) and HBase (9 clusters, 2,250 servers, 35 PB) clusters, Feb 16, 2015
1.5 billion
files & dir
Growth in HDFS1 in the Last 10 Years
+70 PB
+66 PB
+64 PB
+80 PB
+100 PB
Last 5 years: 22.2% CAGR
Processing and Analyzing Data with Hadoop…Then
6
HDFS
MapReduce (YARN)
Pig HiveJava MR APIs
InputFormat/ OutputFormat
Load / Store SerDe
MetaStore
Client
Hive
MetaStore
Hadoop
Streaming
Oozie
Processing and Analyzing Data with HBase…Then
7
HDFS
HBase
Pig HiveJava MR APIs
TableInputFormat/
TableOutputFormat
MetaStore
Client
Hive
MetaStore
Oozie
HBaseStorage
HBaseStorage
Handler
Hadoop Jobs on the Platform Today
8
100%
(28.9 M)
1%4%
5%
7%
35%
47%
All Jobs Pig Oozie Launcher Java MR Hive GDM Streaming, distcp,
Spark
Job Distribution (Jan 2015)
Challenges in Managing Data on Multi-tenant Platforms
9
Data Producers
Platform Services
Data Consumers
§  Data shared across tools such as MR, Pig, and Hive
§  Schema and semantics knowledge across the
company
§  Support for schema evolution and downstream
change communication
§  Fine-grained access controls (row / column) vs. all
or nothing
§  Clear ownership of data
§  Data lineage and integrity
§  Audits and compliance (e.g. SOX)
§  Retention, duplication, and waste
Data Economy Challenges
Apache
HCatalog
&
Data Discovery
Apache HCatalog in the Technology Stack
10
Compute
Services
Storage
Infrastructure
Services
HivePig Oozie Grid UI
GDM &
Proxies
YARN MapReduce
HDFS HBase
Zookeeper
Support
Shop
Monitoring Starling
Messaging
Service
HCatalog
Storm SparkTez
HCatalog Facilitates Interoperability…Now
11
HDFS
MapReduce (YARN)
Pig HiveJava MR APIs
InputFormat/ OutputFormat
SerDe & Storage Handler
MetaStore
Client
HCatalog
MetaStore
HCatInputFormat /
HCatOutputFormat
HCatLoader/
HCatStorer
HDFS
HBase Notifications
Oozie
Data Model
12
Database
(namespace)
Table
(schema)
Table
(schema)
Partitions Partitions
Buckets
Buckets
Skewed Unskewed
Optional
per table
Partitions, buckets, and skews facilitate faster, more direct access to data
Sample Table Registration
13
Select project database
USE	
  xyz;	
  	
  
Create table
CREATE	
  EXTERNAL	
  TABLE	
  search	
  (	
  
bcookie	
  string 	
   	
  COMMENT	
  ‘Standard	
  browser	
  cookie’,	
  
time_stamp	
  int	
   	
   	
  COMMENT	
  ‘DD-­‐MON-­‐YYYY	
  HH:MI:SS	
  (AM/PM)’,	
  
uid	
  string	
   	
   	
  COMMENT	
  ‘User	
  id’,	
  
ip	
  string 	
   	
  COMMENT	
  ‘...’,	
  	
  
pg_spaceid	
  string	
   	
  COMMENT	
  ‘...’,	
  	
  
...)	
  
PARTITIONED	
  BY	
  (	
  
locale	
  string	
   	
   	
  COMMENT	
  ‘Country	
  of	
  origin’,	
  	
  
datestamp	
  string 	
   	
  COMMENT	
  ‘Date	
  in	
  YYYYMMDD	
  format’)	
  
STORED	
  AS	
  ORC	
  
LOCATION	
  ‘/projects/search/...’;	
  
Add partitions manually, (if you choose to)
ALTER	
  TABLE	
  search	
  ADD	
  PARTITION	
  (	
  locale=‘US’,	
  datestamp=‘20130201’)	
  	
  
LOCATION	
  ‘/projects/search/...’;	
  
Your company’s data (metadata) can be registered with HCatalog irrespective of the tool used
Getting Data into HCatalog – DML and DDL
14
LOAD Files into tables
Copy / move data from HDFS or local filesystem into HCatalog tables
LOAD	
  DATA	
  [LOCAL]	
  INPATH	
  'filepath'	
  [OVERWRITE]	
  INTO	
  TABLE	
  tablename	
  	
  
[PARTITION	
  (partcol1=val1,	
  partcol2=val2	
  ...)];
INSERT data from a query into tables
Query results can be inserted into tables of file system directories by using the insert clause.
INSERT	
  OVERWRITE	
  TABLE	
  tablename1	
  [PARTITION	
  (partcol1=val1,	
  partcol2=val2	
  ...)	
  [IF	
  NOT	
  EXISTS]]	
  
select_statement1	
  FROM	
  from_statement;	
  
	
  
INSERT	
  INTO	
  TABLE	
  tablename1	
  [PARTITION	
  (partcol1=val1,	
  partcol2=val2	
  ...)]	
  select_statement1	
  FROM	
  
from_statement;	
  
HCatalog also supports multiple inserts in the same statement or dynamic partition inserts.
ALTER TABLE ADD PARTITIONS
ALTER	
  TABLE	
  table_name	
  ADD	
  PARTITION	
  (partCol	
  =	
  'value1')	
  location	
  'loc1’;	
  
Getting Data into HCatalog – HCatalog APIs
15
Pig
HCatLoader and HCatStorer is used with Pig scripts to read from and write data to HCatalog-managed tables
A	
  =	
  load	
  '$DB.$TABLE'	
  using	
  org.apache.hcatalog.pig.HCatLoader();	
  
B	
  =	
  FILTER	
  A	
  BY	
  $FILTER;	
  
C	
  =	
  foreach	
  B	
  generate	
  foo,	
  bar;	
  
store	
  C	
  into	
  '$OUTPUT_DB.$OUTPUT_TABLE'	
  USING	
  org.apache.hcatalog.pig.HCatStorer	
  ('$OUTPUT_PARTITION');	
  
MapReduce
HCatInputFormat and HCatOutputFormat is used with MapReduce to read from and write data to HCatalog-managed tables.
Map<String,	
  String>	
  partitionValues	
  =	
  new	
  HashMap<String,	
  String>();	
  
partitionValues.put("a",	
  "1");	
  
partitionValues.put("b",	
  "1");	
  
HCatTableInfo	
  info	
  =	
  HCatTableInfo.getOutputTableInfo(dbName,	
  tblName,	
  partitionValues);	
  
HCatOutputFormat.setOutput(job,	
  info);	
  
	
  
	
  
	
  
HCatalog Integration with Data Mgmt. Platform (GDM)
16
MetaStore
Cluster 1 - Colo 1
HDFS
Cluster 2 – Colo 2
HDFS
Grid Data
Management
Feed Acquisition
Feed
Replication
MetaStore
Feed datasets as
partitioned external
tables
Growl extracts
schema for backfill
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
HCatClient.
addPartitions(…)
Mark
LOAD_DONE
Partitions are dropped with
(HCatClient.dropPartitions(…)) after
retention expiration with a
drop_partition notification
add_partition
event notification
add_partition
event notification
HCatalog Notifications
17
Namespace:	
  E.g.	
  “hcat.thebestcluster”	
  
JMS	
  Topic:	
  E.g.	
  “<dbname>.<tablename>”	
  
Sample	
  JMS	
  Notification	
  
{	
  
	
  	
  "timestamp"	
  :	
  1360272556,	
  
	
  	
  "eventType"	
  :	
  "ADD_PARTITION",	
  
	
  	
  "server"	
  	
  	
  	
  :	
  "thebestcluster-­‐hcat.dc1.grid.yahoo.com",	
  
	
  	
  "servicePrincipal"	
  :	
  "hcat/thebestcluster-­‐hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM",	
  
	
  	
  "db"	
  	
  	
  	
  	
  	
  	
  	
  :	
  "xyz",	
  
	
  	
  "table"	
  	
  	
  	
  	
  :	
  "search",	
  
	
  	
  "partitions":	
  [	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "locale"	
  :	
  "US",	
  "datestamp"	
  :	
  "20140602"	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "locale"	
  :	
  "UK",	
  "datestamp"	
  :	
  "20140602"	
  },	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  {	
  "locale"	
  :	
  "IN",	
  "datestamp"	
  :	
  "20140602"	
  }	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  ]	
  
}	
  
HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database, add_table, add_partition, drop_partition,
drop_table, and drop_database. Notifications can be extended for schema change communication
HCat
Client
HCat
MetaStore
ActiveMQ
Server
Register Channel Publish to listener channels
Subscribers
Oozie, HCatalog, and Messaging Integration
18
Oozie
Message Bus
HCatalog
3. Push notification
<New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata
(ALTER TABLE click ADD PARTITION(data=‘2014/06/02’)
location ’hdfs://data/click/2014/06/02’)
Data Discovery with HCatalog
19
§  Unified metadata store for all data at Yahoo
§  Discovery is about
o  Browsing / inspecting metadata and data
o  Searching for datasets
§  It helps to solve
o  Schema knowledge across the company
o  Ownerships
o  Data type – dev or prod
o  Understand data
o  Schema evolution
o  Lineage
Data Discovery Features
20
§  Browsing
o  Tables / Databases
o  Schema, format, properties
o  Partitions and metadata about each partition
§  Searches for tables
o  Table name (regex) or Comments
o  Column name or comments
o  Ownership, File format
o  Location
o  Properties (Dev/Prod)
Data Discovery UI in Production
21
Search Tables Search
The Best Cluster
audience_db	
  
tumblr_db	
  
user_db	
  
flickr_db	
  
page_clicks	
   Hourly	
  clickstream	
  table	
  
ad_clicks	
   Hourly	
  ad	
  clicks	
  table	
  	
  
user_info	
   User	
  registration	
  info	
  
session_info	
   Session	
  feed	
  info	
  
audience_info	
   Primary	
  audience	
  table	
  
GLOBAL HCATALOG DASHBOARD
Available Databases
Available Tables (audience_db)
Search the HCat tables
Browse
the DBs
by
cluster
Search
results
or
browse
db
results
1 2 Next 1 2 Next
ILLUSTRATIVE
Table Display UI
22
GLOBAL HCATALOG DASHBOARD
HCat Instance The	
  Best	
  Cluster	
  
Database audience_db	
  
Table page_clicks	
  
Owner Awesome	
  Yahoo	
  
Schema
Partitions
Column Type Description
bcookie	
   string	
   Standard	
  browser	
  cookie	
  
timestamp	
   string	
   DD-­‐MON-­‐YYYY	
  HH:MI:SS	
  (AM/PM)	
  
.
.
Column Type Description
dt	
   string	
   Date	
  in	
  YYYY_MM_DD	
  format	
  	
  
ILLUSTRATIVE
Data Discovery Physical View
23
Discovery UI
Global View of
All Data in Metastore
DC1-C1
DC1-C2
DCn-Cn
.
.
.
DC2-C1
DC2-C2
DCm-Cm
.
.
.
Data Center 1 Data Center 2
MS WebServer
HCat API
MS WebServer
HCat API
MS WebServer
HCat API
MSWebServer
HCat API
MS
WebServer
HCat API
WebServer
HCat API
MS
ILLUSTRATIVE
Data Discovery Design
24
§  A single web interface connects to all Metastore instances (all datacenters)
§  Select an appropriate cluster and browse all metadata
o  A webserver runs on each Metastore
o  All reads audited
o  ACLs (future)
§  Search functionality will be added to web interface and Metastore
o  New Thrift interface to support search
o  All searches audited
§  Long term design
o  Load on production
o  Read and Write HCatalog instances
Data Discovery Design – APIs
25
§  Search
o  Searches across various fields in order
o  Simple ranking
o  Search order for multiple keywords
o  Optimized implementation for database
o  Will be contributed back
§  Unique partition values
o  One or more partition keys
o  Filtering and Ordering supported
o  HIVE-7604 (https://issues.apache.org/jira/browse/HIVE-7604)
Data Discovery Design – Optimizations
26
§  Allows to peek into the data (select * limit n)
§  Existing implementations costly
o  Too much client and server resources
o  Timeouts and failures
§  Optimized partition objects and used names
§  New implementation takes a few seconds at most
§  HIVE-9573 (https://issues.apache.org/jira/browse/HIVE-9573)
Going Forward – Lineage
27
Advantages Challenges
Bottleneck
Ownership
Quality
Offline / Real Time
Data Flow / Control Flow
Software Stack
Going Forward – Lineage
28
Statistics help in heuristics instead of running a job
Table 1 /
Partition 1
(Stage-1)
HBase
ORC Table
Partition 1
(Stage-2)
Dimension
Table
Statistics/
Agg. Table
(Stage-3)
Daily Stats
Table
(Stage-4)
Copied by
distcp / external
registrar
Hourly
ILLUSTRATIVE
Going Forward – Schema Versioning
29
Schema
Column Type Description
bcookie	
   string	
   Standard	
  browser	
  cookie	
  
timestamp	
   string	
   DD-­‐MON-­‐YYYY	
  HH:MI:SS	
  (AM/PM)	
  
uid	
   string	
   User	
  id	
  
File Format
ORC	
  
Table Properties
Compression	
  
Type	
  
zlib	
  
External	
  
.
.
§  User ‘awesome_yahoo’
added ‘foo string’ to the
table on May 29, 2014 at
‘1:10 AM’
§  User ‘me_too’ added table
properties
‘orc.compress=ZLIB’ on
May 30, 2014 at ‘9:00 AM’
§  User ‘me_too’ changed the
file format from ‘RCFile’ to
‘ORC’ on Jun 1, 2014 at
‘10:30 AM’
.
.
.
ILLUSTRATIVE
HCatalog is Part of a Broader Solution Set
30
Hive
HiveServer2
HCatalog
§  Data warehousing to facilitate querying and managing large datasets in HDFS
§  Mechanism to project structure onto HDFS data and query using a SQL-like language
§  Server process (Thrift-based RPC) for concurrent clients connecting over ODBC/JDBC
§  Authentication and authorization for ODBC/JDBC clients for metadata access
§  Table and storage management layer for Hadoop tools to easily share data
§  Relational view of data, storage location and format abstraction, notifications of availability
Starling
§  Hadoop log warehouse for analytics on grid usage (job history, tasks, job counters etc.)
§  1TB of raw logs processed / day, 24 TB of processed data
Product Role in the Grid Stack
Deployment Layout
31
ILLUSTRATIVE
Batch &
Interactive SQL
Tez and MapReduce
on YARN
+
HDFS
RDBMS
LoadBalancer
HCatalog
Thrift
HS2
ODBC/JDBC
Launcher Gateway
LoadBalancer
Data Out Client
Client/ CLI
HiveQL
M/R / Tez Jobs
Pig M/R
Cloud
Messaging
HiveServer2
Hadoop
Hive
HCatalog
BI, Reporting,
DataOut,
Dev UI –
Data to
Desktop (D2D)
Data Governance
32
Data Access
Public
Non-sensitive
Financial
Restricted
$
Governance
Classification
No addn. reqmt.
LMS Integration
Stock Admin
Integration
Owner Review
Manager
approves
Employee
acknowledges
SQL-based Authorization for Controlled Access
33
§  SQL-compliant authorization model (Users, Roles, Privileges, Objects)
§  Fine-grain authorization and access control patterns (row and column in conjunction with views)
§  Can be used in conjunction with storage-based authorization
Privileges Access Control
§  Objects consist of databases, tables, and
views
§  Privileges are GRANTed on objects
o  SELECT: read access to an object
o  INSERT: write (insert) access to an
object
o  UPDATE: write (update) access to an
object
o  DELETE: delete access for an object
o  ALL PRIVILEGES: all privileges
§  Roles can be associated with objects
§  Privileges are associated with roles
§  CREATE, DROP, and SET ROLE
statements manipulate roles and
membership
§  SUPERUSER role for databases can grant
access control to users or roles (not limited
to HDFS permissions)
§  PUBLIC role includes all users
§  Prevents undesirable operations on objects
by unauthorized users
Audits, Compliance, and Efficiency
34
Starling
FS, Job, Task logs
Cluster 1 Cluster 2 Cluster n...
CF, Region, Action, Query Stats
Cluster 1 Cluster 2 Cluster n...
DB, Tbl., Part., Colmn. Access Stats
...MS 1 MS 2 MS n
GDM
Data Defn., Flow, Feed, Source
F 1 F 2 F n
Log Warehouse
Log Sources
In Summary
35
Data shared across tools such as MR, Pig, and Hive Apache HCatalog
Schema and semantics knowledge across the company Data Discovery
Support for schema evolution and downstream change
communication
Apache HCatalog
Fine-grained access controls (row / column) vs. all or nothing SQL-based Authorization
Clear ownership of data Data Discovery
Data lineage and integrity Data Discovery / Starling
Audits and compliance (e.g. SOX) Data Discovery / Starling
Retention, duplication, and waste Data Discovery / Starling
✔
✔
✔
✔
✔
✔
✔
✔
Thank You
@sumeetksingh
@thiruvel
Office Hours
4:00pm–4:40pm, Table A

More Related Content

What's hot

Hive Demo Paper at VLDB 2009
Hive Demo Paper at VLDB 2009Hive Demo Paper at VLDB 2009
Hive Demo Paper at VLDB 2009Namit Jain
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008athusoo
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamZheng Shao
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Hortonworks
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databasesJulian Hyde
 
HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011Hortonworks
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveZheng Shao
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 
Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookZheng Shao
 

What's hot (20)

Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
Hive Demo Paper at VLDB 2009
Hive Demo Paper at VLDB 2009Hive Demo Paper at VLDB 2009
Hive Demo Paper at VLDB 2009
 
Hive Apachecon 2008
Hive Apachecon 2008Hive Apachecon 2008
Hive Apachecon 2008
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
SQL in Hadoop
SQL in HadoopSQL in Hadoop
SQL in Hadoop
 
Hive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive TeamHive User Meeting March 2010 - Hive Team
Hive User Meeting March 2010 - Hive Team
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
20081030linkedin
20081030linkedin20081030linkedin
20081030linkedin
 
HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Hive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 FacebookHive User Meeting 2009 8 Facebook
Hive User Meeting 2009 8 Facebook
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Jan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalogJan 2012 HUG: HCatalog
Jan 2012 HUG: HCatalog
 

Similar to Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop

Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataDataWorks Summit
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRABhadra Gowdra
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoopmarklpollack
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Casesnzhang
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zingzingopen
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoopMaulik Thaker
 
Valtech - Big Data & NoSQL : au-delà du nouveau buzz
Valtech  - Big Data & NoSQL : au-delà du nouveau buzzValtech  - Big Data & NoSQL : au-delà du nouveau buzz
Valtech - Big Data & NoSQL : au-delà du nouveau buzzValtech
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data Amar kumar
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseHenk van der Valk
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseJonathan Bloom
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014rpbrehm
 

Similar to Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop (20)

Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
Analysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRAAnalysis of historical movie data by BHADRA
Analysis of historical movie data by BHADRA
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Hive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use CasesHive Training -- Motivations and Real World Use Cases
Hive Training -- Motivations and Real World Use Cases
 
hadoop&zing
hadoop&zinghadoop&zing
hadoop&zing
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
ETL big data with apache hadoop
ETL big data with apache hadoopETL big data with apache hadoop
ETL big data with apache hadoop
 
Valtech - Big Data & NoSQL : au-delà du nouveau buzz
Valtech  - Big Data & NoSQL : au-delà du nouveau buzzValtech  - Big Data & NoSQL : au-delà du nouveau buzz
Valtech - Big Data & NoSQL : au-delà du nouveau buzz
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Exploring sql server 2016 bi
Exploring sql server 2016 biExploring sql server 2016 bi
Exploring sql server 2016 bi
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
BigData_Krishna Kumar Sharma
BigData_Krishna Kumar SharmaBigData_Krishna Kumar Sharma
BigData_Krishna Kumar Sharma
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

More from Sumeet Singh

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckSumeet Singh
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Sumeet Singh
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Sumeet Singh
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Sumeet Singh
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! Sumeet Singh
 

More from Sumeet Singh (14)

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
Hadoop Summit Dublin 2016: Hadoop Platform at Yahoo - A Year in Review
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...
 
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations Hadoop Summit San Jose 2014: Costing Your Big Data Operations
Hadoop Summit San Jose 2014: Costing Your Big Data Operations
 
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
Hadoop Summit Brussels 2015: Architecting a Scalable Hadoop Platform - Top 10...
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Recently uploaded

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 

Recently uploaded (20)

Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 

Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop

  • 1. Data Discovery on Hadoop PRESENTED BY Sumeet Singh, Thiruvel Thirumoolan ⎪ February 19, 2015 S t r a t a C o n f e r e n c e + H a d o o p W o r l d 2 0 1 5 , S a n J o s e
  • 2. Introduction 2 §  Developer in the Hive-HCatalog team, and active contributor to Apache Hive §  Responsible for Hive, HiveServer2 and HCatalog across all Hadoop clusters and ensuring they work at scale for the usage patterns of Yahoo §  Loves mining the trove of Hadoop logs for usage patterns and insights §  Bachelors degree from Anna University Thiruvel Thirumoolan Principal Engineer Hadoop and Big Data Platforms Platforms and Personalization Products 701 First Avenue, Sunnyvale, CA 94089 USA @thiruvel §  Manages Hadoop products team at Yahoo §  Responsible for Product Management, Strategy and Customer Engagements §  Managed Cloud Services products team and headed Strategy functions for the Cloud Platform Group at Yahoo §  MBA from UCLA and MS from Rensselaer Polytechnic Institute (RPI) Sumeet Singh Sr. Director, Product Management Cloud and Big Data Platforms Platforms and Personalization Products 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh
  • 3. Agenda 3 The Data Management Challenge1 Apache HCatalog to Rescue Data Registration and Discovery Opening up Access to Data Q&A 2 3 4 5
  • 4. Hadoop as the Source of Truth for All Data 4 TV PC Phone Tablet Pushed Data Pulled Data Web Crawl Social Email 3rd Party Content Data Highway Hadoop Grid BI, Reporting, Adhoc Analytics Data Content Ads No-SQL Serving Stores Serving ILLUSTRATIVE
  • 5. 5 42,300 servers 600 PB 0 100 200 300 400 500 600 700 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 RawHDFSStorage(inPB) NumberofServers Year Servers Storage 1 Across all Hadoop (18 clusters, 40,080 servers, 565 PB) and HBase (9 clusters, 2,250 servers, 35 PB) clusters, Feb 16, 2015 1.5 billion files & dir Growth in HDFS1 in the Last 10 Years +70 PB +66 PB +64 PB +80 PB +100 PB Last 5 years: 22.2% CAGR
  • 6. Processing and Analyzing Data with Hadoop…Then 6 HDFS MapReduce (YARN) Pig HiveJava MR APIs InputFormat/ OutputFormat Load / Store SerDe MetaStore Client Hive MetaStore Hadoop Streaming Oozie
  • 7. Processing and Analyzing Data with HBase…Then 7 HDFS HBase Pig HiveJava MR APIs TableInputFormat/ TableOutputFormat MetaStore Client Hive MetaStore Oozie HBaseStorage HBaseStorage Handler
  • 8. Hadoop Jobs on the Platform Today 8 100% (28.9 M) 1%4% 5% 7% 35% 47% All Jobs Pig Oozie Launcher Java MR Hive GDM Streaming, distcp, Spark Job Distribution (Jan 2015)
  • 9. Challenges in Managing Data on Multi-tenant Platforms 9 Data Producers Platform Services Data Consumers §  Data shared across tools such as MR, Pig, and Hive §  Schema and semantics knowledge across the company §  Support for schema evolution and downstream change communication §  Fine-grained access controls (row / column) vs. all or nothing §  Clear ownership of data §  Data lineage and integrity §  Audits and compliance (e.g. SOX) §  Retention, duplication, and waste Data Economy Challenges Apache HCatalog & Data Discovery
  • 10. Apache HCatalog in the Technology Stack 10 Compute Services Storage Infrastructure Services HivePig Oozie Grid UI GDM & Proxies YARN MapReduce HDFS HBase Zookeeper Support Shop Monitoring Starling Messaging Service HCatalog Storm SparkTez
  • 11. HCatalog Facilitates Interoperability…Now 11 HDFS MapReduce (YARN) Pig HiveJava MR APIs InputFormat/ OutputFormat SerDe & Storage Handler MetaStore Client HCatalog MetaStore HCatInputFormat / HCatOutputFormat HCatLoader/ HCatStorer HDFS HBase Notifications Oozie
  • 12. Data Model 12 Database (namespace) Table (schema) Table (schema) Partitions Partitions Buckets Buckets Skewed Unskewed Optional per table Partitions, buckets, and skews facilitate faster, more direct access to data
  • 13. Sample Table Registration 13 Select project database USE  xyz;     Create table CREATE  EXTERNAL  TABLE  search  (   bcookie  string    COMMENT  ‘Standard  browser  cookie’,   time_stamp  int      COMMENT  ‘DD-­‐MON-­‐YYYY  HH:MI:SS  (AM/PM)’,   uid  string      COMMENT  ‘User  id’,   ip  string    COMMENT  ‘...’,     pg_spaceid  string    COMMENT  ‘...’,     ...)   PARTITIONED  BY  (   locale  string      COMMENT  ‘Country  of  origin’,     datestamp  string    COMMENT  ‘Date  in  YYYYMMDD  format’)   STORED  AS  ORC   LOCATION  ‘/projects/search/...’;   Add partitions manually, (if you choose to) ALTER  TABLE  search  ADD  PARTITION  (  locale=‘US’,  datestamp=‘20130201’)     LOCATION  ‘/projects/search/...’;   Your company’s data (metadata) can be registered with HCatalog irrespective of the tool used
  • 14. Getting Data into HCatalog – DML and DDL 14 LOAD Files into tables Copy / move data from HDFS or local filesystem into HCatalog tables LOAD  DATA  [LOCAL]  INPATH  'filepath'  [OVERWRITE]  INTO  TABLE  tablename     [PARTITION  (partcol1=val1,  partcol2=val2  ...)]; INSERT data from a query into tables Query results can be inserted into tables of file system directories by using the insert clause. INSERT  OVERWRITE  TABLE  tablename1  [PARTITION  (partcol1=val1,  partcol2=val2  ...)  [IF  NOT  EXISTS]]   select_statement1  FROM  from_statement;     INSERT  INTO  TABLE  tablename1  [PARTITION  (partcol1=val1,  partcol2=val2  ...)]  select_statement1  FROM   from_statement;   HCatalog also supports multiple inserts in the same statement or dynamic partition inserts. ALTER TABLE ADD PARTITIONS ALTER  TABLE  table_name  ADD  PARTITION  (partCol  =  'value1')  location  'loc1’;  
  • 15. Getting Data into HCatalog – HCatalog APIs 15 Pig HCatLoader and HCatStorer is used with Pig scripts to read from and write data to HCatalog-managed tables A  =  load  '$DB.$TABLE'  using  org.apache.hcatalog.pig.HCatLoader();   B  =  FILTER  A  BY  $FILTER;   C  =  foreach  B  generate  foo,  bar;   store  C  into  '$OUTPUT_DB.$OUTPUT_TABLE'  USING  org.apache.hcatalog.pig.HCatStorer  ('$OUTPUT_PARTITION');   MapReduce HCatInputFormat and HCatOutputFormat is used with MapReduce to read from and write data to HCatalog-managed tables. Map<String,  String>  partitionValues  =  new  HashMap<String,  String>();   partitionValues.put("a",  "1");   partitionValues.put("b",  "1");   HCatTableInfo  info  =  HCatTableInfo.getOutputTableInfo(dbName,  tblName,  partitionValues);   HCatOutputFormat.setOutput(job,  info);        
  • 16. HCatalog Integration with Data Mgmt. Platform (GDM) 16 MetaStore Cluster 1 - Colo 1 HDFS Cluster 2 – Colo 2 HDFS Grid Data Management Feed Acquisition Feed Replication MetaStore Feed datasets as partitioned external tables Growl extracts schema for backfill HCatClient. addPartitions(…) Mark LOAD_DONE HCatClient. addPartitions(…) Mark LOAD_DONE Partitions are dropped with (HCatClient.dropPartitions(…)) after retention expiration with a drop_partition notification add_partition event notification add_partition event notification
  • 17. HCatalog Notifications 17 Namespace:  E.g.  “hcat.thebestcluster”   JMS  Topic:  E.g.  “<dbname>.<tablename>”   Sample  JMS  Notification   {      "timestamp"  :  1360272556,      "eventType"  :  "ADD_PARTITION",      "server"        :  "thebestcluster-­‐hcat.dc1.grid.yahoo.com",      "servicePrincipal"  :  "hcat/thebestcluster-­‐hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM",      "db"                :  "xyz",      "table"          :  "search",      "partitions":  [                                        {  "locale"  :  "US",  "datestamp"  :  "20140602"  },                                        {  "locale"  :  "UK",  "datestamp"  :  "20140602"  },                                        {  "locale"  :  "IN",  "datestamp"  :  "20140602"  }                                  ]   }   HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database, add_table, add_partition, drop_partition, drop_table, and drop_database. Notifications can be extended for schema change communication HCat Client HCat MetaStore ActiveMQ Server Register Channel Publish to listener channels Subscribers
  • 18. Oozie, HCatalog, and Messaging Integration 18 Oozie Message Bus HCatalog 3. Push notification <New Partition> 2. Register Topic 4. Notify New Partition Data Producer HDFS Produce data (distcp, pig, M/R..) /data/click/2014/06/02 1. Query/Poll Partition Start workflow Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’)
  • 19. Data Discovery with HCatalog 19 §  Unified metadata store for all data at Yahoo §  Discovery is about o  Browsing / inspecting metadata and data o  Searching for datasets §  It helps to solve o  Schema knowledge across the company o  Ownerships o  Data type – dev or prod o  Understand data o  Schema evolution o  Lineage
  • 20. Data Discovery Features 20 §  Browsing o  Tables / Databases o  Schema, format, properties o  Partitions and metadata about each partition §  Searches for tables o  Table name (regex) or Comments o  Column name or comments o  Ownership, File format o  Location o  Properties (Dev/Prod)
  • 21. Data Discovery UI in Production 21 Search Tables Search The Best Cluster audience_db   tumblr_db   user_db   flickr_db   page_clicks   Hourly  clickstream  table   ad_clicks   Hourly  ad  clicks  table     user_info   User  registration  info   session_info   Session  feed  info   audience_info   Primary  audience  table   GLOBAL HCATALOG DASHBOARD Available Databases Available Tables (audience_db) Search the HCat tables Browse the DBs by cluster Search results or browse db results 1 2 Next 1 2 Next ILLUSTRATIVE
  • 22. Table Display UI 22 GLOBAL HCATALOG DASHBOARD HCat Instance The  Best  Cluster   Database audience_db   Table page_clicks   Owner Awesome  Yahoo   Schema Partitions Column Type Description bcookie   string   Standard  browser  cookie   timestamp   string   DD-­‐MON-­‐YYYY  HH:MI:SS  (AM/PM)   . . Column Type Description dt   string   Date  in  YYYY_MM_DD  format     ILLUSTRATIVE
  • 23. Data Discovery Physical View 23 Discovery UI Global View of All Data in Metastore DC1-C1 DC1-C2 DCn-Cn . . . DC2-C1 DC2-C2 DCm-Cm . . . Data Center 1 Data Center 2 MS WebServer HCat API MS WebServer HCat API MS WebServer HCat API MSWebServer HCat API MS WebServer HCat API WebServer HCat API MS ILLUSTRATIVE
  • 24. Data Discovery Design 24 §  A single web interface connects to all Metastore instances (all datacenters) §  Select an appropriate cluster and browse all metadata o  A webserver runs on each Metastore o  All reads audited o  ACLs (future) §  Search functionality will be added to web interface and Metastore o  New Thrift interface to support search o  All searches audited §  Long term design o  Load on production o  Read and Write HCatalog instances
  • 25. Data Discovery Design – APIs 25 §  Search o  Searches across various fields in order o  Simple ranking o  Search order for multiple keywords o  Optimized implementation for database o  Will be contributed back §  Unique partition values o  One or more partition keys o  Filtering and Ordering supported o  HIVE-7604 (https://issues.apache.org/jira/browse/HIVE-7604)
  • 26. Data Discovery Design – Optimizations 26 §  Allows to peek into the data (select * limit n) §  Existing implementations costly o  Too much client and server resources o  Timeouts and failures §  Optimized partition objects and used names §  New implementation takes a few seconds at most §  HIVE-9573 (https://issues.apache.org/jira/browse/HIVE-9573)
  • 27. Going Forward – Lineage 27 Advantages Challenges Bottleneck Ownership Quality Offline / Real Time Data Flow / Control Flow Software Stack
  • 28. Going Forward – Lineage 28 Statistics help in heuristics instead of running a job Table 1 / Partition 1 (Stage-1) HBase ORC Table Partition 1 (Stage-2) Dimension Table Statistics/ Agg. Table (Stage-3) Daily Stats Table (Stage-4) Copied by distcp / external registrar Hourly ILLUSTRATIVE
  • 29. Going Forward – Schema Versioning 29 Schema Column Type Description bcookie   string   Standard  browser  cookie   timestamp   string   DD-­‐MON-­‐YYYY  HH:MI:SS  (AM/PM)   uid   string   User  id   File Format ORC   Table Properties Compression   Type   zlib   External   . . §  User ‘awesome_yahoo’ added ‘foo string’ to the table on May 29, 2014 at ‘1:10 AM’ §  User ‘me_too’ added table properties ‘orc.compress=ZLIB’ on May 30, 2014 at ‘9:00 AM’ §  User ‘me_too’ changed the file format from ‘RCFile’ to ‘ORC’ on Jun 1, 2014 at ‘10:30 AM’ . . . ILLUSTRATIVE
  • 30. HCatalog is Part of a Broader Solution Set 30 Hive HiveServer2 HCatalog §  Data warehousing to facilitate querying and managing large datasets in HDFS §  Mechanism to project structure onto HDFS data and query using a SQL-like language §  Server process (Thrift-based RPC) for concurrent clients connecting over ODBC/JDBC §  Authentication and authorization for ODBC/JDBC clients for metadata access §  Table and storage management layer for Hadoop tools to easily share data §  Relational view of data, storage location and format abstraction, notifications of availability Starling §  Hadoop log warehouse for analytics on grid usage (job history, tasks, job counters etc.) §  1TB of raw logs processed / day, 24 TB of processed data Product Role in the Grid Stack
  • 31. Deployment Layout 31 ILLUSTRATIVE Batch & Interactive SQL Tez and MapReduce on YARN + HDFS RDBMS LoadBalancer HCatalog Thrift HS2 ODBC/JDBC Launcher Gateway LoadBalancer Data Out Client Client/ CLI HiveQL M/R / Tez Jobs Pig M/R Cloud Messaging HiveServer2 Hadoop Hive HCatalog BI, Reporting, DataOut, Dev UI – Data to Desktop (D2D)
  • 32. Data Governance 32 Data Access Public Non-sensitive Financial Restricted $ Governance Classification No addn. reqmt. LMS Integration Stock Admin Integration Owner Review Manager approves Employee acknowledges
  • 33. SQL-based Authorization for Controlled Access 33 §  SQL-compliant authorization model (Users, Roles, Privileges, Objects) §  Fine-grain authorization and access control patterns (row and column in conjunction with views) §  Can be used in conjunction with storage-based authorization Privileges Access Control §  Objects consist of databases, tables, and views §  Privileges are GRANTed on objects o  SELECT: read access to an object o  INSERT: write (insert) access to an object o  UPDATE: write (update) access to an object o  DELETE: delete access for an object o  ALL PRIVILEGES: all privileges §  Roles can be associated with objects §  Privileges are associated with roles §  CREATE, DROP, and SET ROLE statements manipulate roles and membership §  SUPERUSER role for databases can grant access control to users or roles (not limited to HDFS permissions) §  PUBLIC role includes all users §  Prevents undesirable operations on objects by unauthorized users
  • 34. Audits, Compliance, and Efficiency 34 Starling FS, Job, Task logs Cluster 1 Cluster 2 Cluster n... CF, Region, Action, Query Stats Cluster 1 Cluster 2 Cluster n... DB, Tbl., Part., Colmn. Access Stats ...MS 1 MS 2 MS n GDM Data Defn., Flow, Feed, Source F 1 F 2 F n Log Warehouse Log Sources
  • 35. In Summary 35 Data shared across tools such as MR, Pig, and Hive Apache HCatalog Schema and semantics knowledge across the company Data Discovery Support for schema evolution and downstream change communication Apache HCatalog Fine-grained access controls (row / column) vs. all or nothing SQL-based Authorization Clear ownership of data Data Discovery Data lineage and integrity Data Discovery / Starling Audits and compliance (e.g. SOX) Data Discovery / Starling Retention, duplication, and waste Data Discovery / Starling ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔