• Save
Data Discovery on Hadoop - Realizing the Full Potential of your Data
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Data Discovery on Hadoop - Realizing the Full Potential of your Data

on

  • 887 views

 

Statistics

Views

Total Views
887
Views on SlideShare
887
Embed Views
0

Actions

Likes
4
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • (30 sec) <br /> <br /> Welcome to data discovery on hadoop. We will explain our approach to realizing the full potential or value from the data in your organizations if you are a hadoop driven business or want to be a hadoop driven business.
  • (30 sec) – 1 min <br /> <br /> Before we begin, let us introduce ourselves. My name is Sumeet Singh, I am a Sr. Director of Product Management and I head the PM functions for Hadoop at Yahoo. Thiruvel is a Principal Engineer in the Hive team at Yahoo who works across Hive, HCatalog, HiveServer2 and Starling at Yahoo. With that, let’s get into the details.
  • (1 min) – 2 min <br /> <br /> Let me walk you through the agenda we have. <br /> We will explain the challenges with data management, particularly as it relates to Hadoop at Yahoo. <br /> We will then introduce HCatalog and explain why it is a great solution for the challenges we face in data management. <br /> We will then describe how to get all your data into HCatalog, and the specifics of data discovery. <br /> Once your data is in a central repository and you can discover it, we will explain ways you can open it up by exercising controlled access to that data so that the entire organization can benefit from data. <br /> We will then summarize and open it up for Q&A.
  • (1 min) – 3 min <br /> <br /> Yahoo products and properties across devices generate a lot of data that is of immense value to us in driving new and interesting use experiences across devices. All that data comes to Hadoop that acts as a single source of truth for all data at Yahoo. <br /> A wide variety of other data also gets pulled into the Hadoop Grid from various sources as shown. The idea is to consolidated data from all over was the company from disparate sources at once place so that it can be (a) shared (b) enriched (c) de-duped (d) kept up to date. <br /> That data once processed is applied back or served as value to our consumers in the form of personalized experiences across our products and properties. And of course is used for reporting and analytics. All of this is done while keeping the web scale economics and cost in mind. <br />
  • (30 sec) – 3.5 mins <br /> <br /> As a result, our infrastructure and in particular HDFS continues to grow and as of last month accounts for almost 480 PB across Hadoop and HBase clusters. There are 1.25 billions files and directories on this infrastructure as of last month (NN keep tracks of that along with blocks). I am not sure you want to know about all about the 1.25 billions files, but you should have the ability to if you wanted to. This talk is really all about that.
  • (30 sec) – 4 mins <br /> <br /> All that data in HDFS gets processed and analyzed through a variety of tools such as MapReduce, Pig, and Hive. Most of our users use Oozie to automate the scheduling of these jobs. Pig and MR have the schema, format and location encoded in the app or the script. Hive on the other hand introduced an additional component, the metastore, to read the data from metadata.
  • (30 sec) – 4.5 <br /> <br /> HBase also provided access to data stored to these tools through table i/p and o/p format and storage handlers, but the story largely stayed the same as hadoop. <br />
  • (30 secs – 5 mins) <br /> <br /> Just to put things in perspective, this is how the hadoop platform and data in HDFS get used in terms of jobs that are run on that platform to process data. A wide variety of tools on Hadoop such of Pig, MapReduce, Hive all read and write data on the HDFS. Pig, MR, and Hive continue to dominate the job mix at Yahoo with Oozie scheduling most of those jobs which is why you see the Oozie launcher numbers so high in the total job mix. <br />
  • (1.5 min) – 6.5 mins <br /> <br /> Just so you understand the data economy, I like that term, producers with ETL produce the data on the platform that is then consumed off of the platform by downstream consumers for analytics or serving. And, managing a platform of this magnitude and the data volume at scale of course has its challenges. <br /> Sharing of data, schema and semantics knowledge across the company, schema evolution and change awareness, access controls, lineage, integrity or data quality, audits and compliance, and finally reducing HDFS waste. <br /> We believe that HCatalog and Data Discovery solves almost all of these to take full advantage of company’s data for research, insights, driving product performance, and coming up new user experiences. <br />
  • (30 secs) – 10.5 min <br /> <br /> Registrations are generally External tables as we are getting legacy HDFS data into Hive. The data is not managed by Hive. It is useful for sharing data, e.g. that created by Pig but queried with Hive without giving ownership to Hive <br /> Also useful when data is already processed and in a usable state in HDFS. Dropping tables does not delete the data. Manually clean up after dropping tables / partitions. <br /> <br /> Partitions can be added manually or through automation with data movement tools that I will describe in just a second. <br />
  • (1 min) – 11.5 min <br /> <br /> Getting data registered using HCat DML (External tables). New data can be internal then on etc. <br /> DML: LOAD and INSERT from a query <br /> DDL: ADD PARTITIONS <br /> <br />
  • (1 min) – 12.5 mins <br /> <br /> Explicit data-paths: When data organization on HDFS changes, scripts need modification. <br /> Explicit file/record format: Prone to change. Needs script change. <br /> Explicit schema during consumption: When schema evolves, this needs change. <br /> <br />
  • (1.5 mins) – 14 mins <br /> <br /> Talk about what GDM is <br /> Approaches here: <br /> How GDM HCat registration is accomplished <br /> New partitions and old partitions backfill <br /> <br /> Add JIRA numbers for the work Yahoo has done. <br /> Feed registrations and partition availability publication <br /> Extracting schema info from existing HDFS files (e.g. using growl) <br /> <br />
  • (1 min) – 15 mins <br /> <br /> Partition-message consumption <br /> Federation-layer atop ActiveMQ <br /> Arranges ActiveMQ servers into “namespaces” <br /> Manages access-control for messages sent to topics in a namespace. <br /> <br />
  • (1 min) – 16 mins <br /> <br /> GDM Acquisition copies data onto one cluster. <br /> Oozie consumes by polling (say) every hour for daily-feed, in directory. <br /> Load on the name-node. 2000 GDM feeds. 5 minute frequencies. <br /> Latency: Worst-case, max poll interval. <br /> Notion of Done: _DONE_ files vs. existence of data in directory. Source of truth. <br /> When data is available: Launch appropriate workflow. <br /> Name-node is hammered. <br /> Data-consumption latency must be balanced against #1. Worst-case latency == poll_freq. <br /> Explain completeness of a dataset-instance (which differs from the notion of a partition) -> Partition-set support in HCatalog. <br /> Problems with using empty file-markers <br /> Further NN pressure. <br /> <br /> Oozie notified on partition availability via JMS messages, to trigger workflows immediately <br /> Oozie and HCatalog interoperate via Cloud Messaging System (CMS) for messaging <br /> <br /> <br /> <br />
  • (1.5 min) – 17.5 mins <br /> <br /> Thiruvel
  • (1 min) – 18.5 mins <br /> <br /> Thiruvel <br />
  • (1.5 mins) – 20 mins <br /> <br /> Thiruvel <br />
  • (1 min) – 21 mins <br /> <br /> Thiruvel <br />
  • (1 min) – 22 mins <br /> <br /> Thiruvel <br />
  • (2 mins) – 24 mins <br /> <br /> Thiruvel <br /> <br />
  • (1.5 mins) – 25.5 mins <br /> <br /> Thiruvel <br />
  • (1 min) – 26.5 mins <br /> <br /> Thiruvel <br />
  • (1 min) – 27.5 mins <br /> <br />
  • (1 min) – 28.5 mins
  • (1 min) – 29.5 mins <br /> <br /> Based on expressing a computation as a dataflow graph with reusable primitives (e.g. sort, merge etc.) <br /> Hive SQL can be expressed as a single job (no interruptions for efficient pipeline) <br /> No intermediate outputs to HDFS (speed and network/disk usage savings) <br /> <br /> Vectorization allows Hive to process a batch of rows together <br /> <br /> MapReduce query startup is very expensive. Job and task launch latencies can add up to 5 to 30 seconds, not good for short queries. Container pre-allocation or warm containers (container pre-launch) eliminates task launch overhead to serve queries <br /> <br /> CBO: Hive has table and column level statistics. Used to determine parallelism, join selection <br />

Data Discovery on Hadoop - Realizing the Full Potential of your Data Presentation Transcript

  • 1. Data Discovery on Hadoop - Realizing the Full Potential of Your Data P R E S E N T E D B Y T h i r u v e l T h i r u m o o l a n , S u m e e t S i n g h ⎪ J u n e 3 , 2 0 1 4 2014 Hadoop Summit, San Jose, California
  • 2. Introduction 2 2014 Hadoop Summit, San Jose, California Sumeet Singh Senior Director, Product Management Hadoop and Big Data Platforms Cloud Engineering Group Thiruvel Thirumoolan Principal Engineer Hadoop and Big Data Platforms Cloud Engineering Group  Developer in the Hive-HCatalog team, and active contributor to Apache Hive  Responsible for Hive, HiveServer2 and HCatalog across all Hadoop clusters and ensuring they work at scale for the usage patterns of Yahoo  Loves mining the trove of Hadoop logs for usage patterns and insights  Bachelors degree from Anna University 701 First Avenue, Sunnyvale, CA 94089 USA @thiruvel  Manages Hadoop products team at Yahoo!  Responsible for Product Management, Strategy and Customer Engagements  Managed Cloud Services products team and headed Strategy functions for the Cloud Platform Group at Yahoo  M.B.A. from UCLA and M.S. from Rensselaer(RPI) 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh
  • 3. Agenda 3 The Data Management Challenge1 Apache HCatalog to Rescue2 Data Registration and Discovery3 Opening Up Adhoc Access to Data4 Summary and Q&A5 2014 Hadoop Summit, San Jose, California
  • 4. Hadoop Grid as the Source of Truth for Data 4 2014 Hadoop Summit, San Jose, California TV PC Phone Tablet Pushed Data Pulled Data Web Crawl Social Email 3rd Party Content Data Advertising Content User Profiles / No-SQL Serving Stores Serving Data Highway Feeds Hadoop Grid BI, Reporting, Adhoc Analytics ILLUSTRATIVE
  • 5. 5 2014 Hadoop Summit, San Jose, California 34,000 servers 478 PB 0 100 200 300 400 500 600 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000 2006 2007 2008 2009 2010 2011 2012 2013 2014 RawHDFSStorage(inPB) NumberofServers Year Servers 1 Across all Hadoop (16 clusters, 32,500 servers, 455 PB) and HBase (7 clusters, 1,500 servers, 23 PB) clusters, May 23, 2014 Growth in HDFS1 1.25 billion files & dir
  • 6. Processing and Analyzing Data with Hadoop…Then 6 2014 Hadoop Summit, San Jose, California HDFS MapReduce (YARN) Pig Hive Java MR APIs InputFormat/ OutputFormat Load / Store SerDe MetaStore Client Hive MetaStore Hadoop Streaming Oozie
  • 7. Processing and Analyzing Data with HBase…Then 7 2014 Hadoop Summit, San Jose, California HDFS HBase Pig HiveJava MR APIs TableInputFormat/ TableOutputFormat HBaseStorage MetaStore Client Hive MetaStore HBaseStorage Handler Oozie
  • 8. Hadoop Jobs on the Platform Today 8 2014 Hadoop Summit, San Jose, California 100% (21.5 M) 1%4% 9% 10% 31% 45% All Jobs Pig Oozie Launcher Java MR Hive GDM Streaming, distcp, Spark Job Distribution (May 1 – May 26, 2014)
  • 9. Challenges in Managing Data on Multi-tenant Platforms 9 2014 Hadoop Summit, San Jose, California Data Producers Platform Services Data Consumers  Data shared across tools such as MR, Pig, and Hive  Schema and semantics knowledge across the company  Support for schema evolution and downstream change communication  Fine-grained access controls (row / column) vs. all or nothing  Clear ownership of data  Data lineage and integrity  Audits and compliance (e.g. SOX)  Retention, duplication, and waste Data Economy Challenges Apache HCatalog & Data Discovery
  • 10. Apache HCatalog in the Technology Stack at Yahoo 10 2014 Hadoop Summit, San Jose, California Compute Services Storage Infrastructure Services HivePig Oozie HDFS ProxyGDM YARN MapReduce HDFS HBase Zookeeper Support Shop Monitoring Starling Messaging Service HCatalog Storm SparkTez
  • 11. HCatalog Facilitates Interoperability…Now 11 2014 Hadoop Summit, San Jose, California HDFS MapReduce (YARN) Pig HiveJava MR APIs InputFormat/ OutputFormat SerDe & Storage Handler MetaStore Client HCatalog MetaStore HCatInputFormat / HCatOutputFormat HCatLoader/ HCatStorer HDFS HBase Notifications Oozie
  • 12. 12 2014 Hadoop Summit, San Jose, California Data Model Database (namespace) Table (schema) Table (schema) Partition s Partition s Buckets Buckets Skewed Unskewed Optional per table Partitions, buckets, and skews facilitate faster, more direct access to data Note on Buckets  It is hard to guess the right number of buckets that can also change overtime, hard to coordinate and align for joins  Community is working on dynamic bucketing that would have the same benefit without the need for static partitioning
  • 13. Sample Table Registration 13 2014 Hadoop Summit, San Jose, California Select project database USE xyz; Create table CREATE EXTERNAL TABLE search ( bcookie string COMMENT ‘Standard browser cookie’, time_stamp int COMMENT ‘DD-MON-YYYY HH:MI:SS (AM/PM)’, uid string COMMENT ‘User id’, ip string COMMENT ‘...’, pg_spaceid string COMMENT ‘...’, ...) PARTITIONED BY ( locale string COMMENT ‘Country of origin’, datestamp string COMMENT ‘Date in YYYYMMDD format’) STORED AS ORC LOCATION ‘/projects/search/...’; Add partitions manually, (if you choose to) ALTER TABLE search ADD PARTITION ( locale=‘US’, datestamp=‘20130201’) LOCATION ‘/projects/search/...’; All your company’s data (metadata) can be registered with HCatalog irrespective of the tool used.
  • 14. Getting Data into HCatalog – DML and DDL 14 2014 Hadoop Summit, San Jose, California LOAD Files into tables Load operations are copy/move operations from HDFS or local filesystem that move datafiles into locations corresponding to HCat tables. File format must agree with the table format. LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]; INSERT data from a query into tables Query results can be inserted into tables of file system directories by using the insert clause. INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; HCat also supports multiple inserts in the same statement or dynamic partition inserts. ALTER TABLE ADD PARTITIONS You can use ALTER TABLE ADD PARTITION to add partitions to a table. The location must be a directory inside of which data files reside. If new partitions are directly added to HDFS, HCat will not be aware of these. ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1’;
  • 15. Getting Data into HCatalog – HCat APIs 15 2014 Hadoop Summit, San Jose, California Pig HCatLoader is used with Pig scripts to read data from HCatalog-managed tables, and HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. A = load '$DB.$TABLE' using org.apache.hcatalog.pig.HCatLoader(); B = FILTER A BY $FILTER; C = foreach B generate foo, bar; store C into '$OUTPUT_DB.$OUTPUT_TABLE' USING org.apache.hcatalog.pig.HCatStorer ('$OUTPUT_PARTITION'); MapReduce The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables. HCatOutputFormat is used with MapReduce jobs to write data to HCatalog-managed tables. Map<String, String> partitionValues = new HashMap<String, String>(); partitionValues.put("a", "1"); partitionValues.put("b", "1"); HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues); HCatOutputFormat.setOutput(job, info);
  • 16. HCatalog Integration with Data Mgmt. Platform (GDM) 16 2014 Hadoop Summit, San Jose, California HCatalog MetaStore Cluster 1 - Colo 1 HDFS Cluster 2 – Colo 2 HDFS Grid Data Management Feed Acquisition Feed Replication HCatalog MetaStore Feed datasets as partitioned external tables Growl extracts schema for backfill HCatClient. addPartitions(…) Mark LOAD_DONE HCatClient. addPartitions(…) Mark LOAD_DONE Partitions are dropped with (HCatClient.dropPartitions(…)) after retention expiration with a drop_partition notification add_partition event notification add_partition event notification
  • 17. HCatalog Notification 17 2014 Hadoop Summit, San Jose, California Namespace: E.g. “hcat.thebestcluster” JMS Topic: E.g. “<dbname>.<tablename>” Sample JMS Notification { "timestamp" : 1360272556, "eventType" : "ADD_PARTITION", "server" : "thebestcluster-hcat.dc1.grid.yahoo.com", "servicePrincipal" : "hcat/thebestcluster-hcat.dc1.grid.yahoo.com@GRID.YAHOO.COM", "db" : "xyz", "table" : "search", "partitions": [ { "locale" : "US", "datestamp" : "20140602" }, { "locale" : "UK", "datestamp" : "20140602" }, { "locale" : "IN", "datestamp" : "20140602" } ] }  HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database, add_table, add_partition, drop_partition, drop_table, and drop_database  Notifications can be extended for schema change notifications (proposed) HCat Client HCat MetaStore ActiveMQ Server Register Channel Publish to listener channels Subscribers
  • 18. Oozie, HCatalog, and Messaging Integration 18 2014 Hadoop Summit, San Jose, California Oozie Message Bus HCatalog 3. Push notification <New Partition> 2. Register Topic 4. Notify New Partition Data Producer HDFS Produce data (distcp, pig, M/R..) /data/click/2014/06/02 1. Query/Poll Partition Start workflow Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’)
  • 19. Data Discovery with HCatalog 19 2014 Hadoop Summit, San Jose, California  HCatalog instances become a unifying metastore for all data at Yahoo  Discovery is about o Browsing / inspecting metadata o Searching for datasets  It helps to solve o Schema knowledge across the company o Schema evolution o Lineage o Ownerships o Data type – dev or prod
  • 20. Data Discovery Physical View 20 2014 Hadoop Summit, San Jose, California Global View of All Data in HCatalog DC1-C1 DC1-C2 DCn-Cn . . . DC2-C1 DC2-C2 DCm-Cm . . . Discovery UI Data Center 1 Data Center 2 HCat REST (Templeton ) HCat REST (Templeton ) HCat REST (Templeton ) HCatREST (Templeton ) HCatREST (Templeton ) HCat REST (Templeton ) ILLUSTRATIVE
  • 21. Data Discovery Features 21 2014 Hadoop Summit, San Jose, California  Browsing o Tables / Databases o Schema, format, properties o Partitions and metadata about each partition  Searches for tables o Table name (regex) or Comments o Column name or comments o Ownership, File format o Location o Properties (Dev/Prod)
  • 22. Discovery UI 22 2014 Hadoop Summit, San Jose, California Search Tables Search The Best Cluster audience_db tumblr_db user_db adv_warehouse flickr_db page_clicks Hourly clickstream table ad_clicks Hourly ad clicks table user_info User registration info session_info Session feed info audience_info Primary audience table GLOBAL HCATALOG DASHBOARD Available Databases Available Tables (audience_db) Search the HCat tables Browse the DBs by cluster Search results or browse db results 1 2 Next 1 2 Next ILLUSTRATIVE
  • 23. Table Display UI 23 2014 Hadoop Summit, San Jose, California ILLUSTRATIVE GLOBAL HCATALOG DASHBOARD HCat Instance The Best Cluster Database audience_db Table page_clicks Owner Awesome Yahoo Schema …more table information and properties (e.g. data format etc.) Partitions …list of partitions Column Type Description bcookie string Standard browser cookie timestamp string DD-MON-YYYY HH:MI:SS (AM/PM) uid string User id . . .
  • 24. Data Discovery Design Approach 24 2014 Hadoop Summit, San Jose, California  A single web interface connects to all HCatalog instances (same and cross-colo)  Select an appropriate HCat instance and browse all metadata o Each HCatalog instance runs a webserver (Templeton/ WebHCat) to read metadata o All reads audited o ACL’s apply  Search functionality will be added to Templeton and HCatalog o New Thrift interface to support search o All searches audited o ACL’s apply  Long term design o Read and Write HCatalog instances
  • 25. Data Discovery Going Forward 25 2014 Hadoop Summit, San Jose, California  Lineage o Source datasets o Derived datasets  Data Quality o Statistics help in heuristics instead of running a job Table 1 / Partition 1 HBase ORC Table Partition 1 Dimension Table Statistics/ Agg. Table Daily Stats Table Copied by distcp / external registrar Hourly ILLUSTRATIVE
  • 26. Data Discovery Going Forward (cont’d) 26 2014 Hadoop Summit, San Jose, California ILLUSTRATIVE Schema Column Type Description bcookie string Standard browser cookie timestamp string DD-MON-YYYY HH:MI:SS (AM/PM) uid string User id File Format ORC Table Properties Compression Type zlib External  User ‘awesome_yahoo’ added ‘foo string’ to the table on May 29, 2014 at ‘1:10 AM’  User ‘me_too’ added table properties ‘orc.compress=ZLIB’ on May 30, 2014 at ‘9:00 AM’  User ‘me_too’ changed the file format from ‘RCFile’ to ‘ORC’ on Jun 1, 2014 at ‘10:30 AM’ . . . . . .
  • 27. HCatalog is Part of a Broader Solution Set 27 2014 Hadoop Summit, San Jose, California Hive HiveServer2 HCatalog  Data warehousing software that facilitates querying and managing large datasets in HDFS  Provides a mechanism to project structure onto HDFS data and query the data using a SQL-like language called HiveQL  Server process (Thrift-based RPC interface) to support concurrent clients connecting over ODBC/JDBC  Provides authentication and enforces authorization for ODBC/JDBC clients for metadata access  Table and storage management layer that enables users with different tools (Pig, M/R, and Hive) to more easily share data  Presents a relational view of data in HDFS, abstracts where or in what format data is stored, and enables notifications of data availability Starling  Hadoop log warehouse for analytics on grid usage (job history, tasks, job counters etc.)  1TB of raw logs processed / day, 24 TB of processed data Product Role in the Grid Stack
  • 28. 28 Deployment Layout Tez and MapReduce on YARN + HDFS Oracle DBMS LoadBalancer HCatalog Thrift HS2 ODBC/JDBC Launcher Gateway LoadBalancer Data Out Client Client/ CLI HiveQL M/R Jobs Pig M/R Cloud Messaging ActiveMQ notifications HiveServer2 Hadoop Hive HCatalog 2014 Hadoop Summit, San Jose, California
  • 29. 29 2014 Hadoop Summit, San Jose, California Hive for Both Batch and Interactive Adhoc Analytics Tez  Computation expressed as a dataflow graph with reusable primitives  No intermediate outputs to HDFS  Built on top of YARN  Hive generates Tez plans for lower latency Query Engine Improvements  Cost-based optimizations  In-memory joins  Caching hot tables  Vectorized processing Better Columnar Store  ORCFile with predicate pushdown  Built for both speed and storage efficiency Tez Service  Always-on pool of AMs / container re-use Improved Latency and Throughput Analytics Functions  SQL 2003 Compliant  OVER with PARTITION BY and ORDER BY  Wide variety of windowing functions: o RANK o LEAD/LAG o ROW_NUMBER o FIRST_VALUE o LAST_VALUE o Many more  Aligns well with BI ecosystem Improving SQL Coverage  Non-correlated sub-queries using IN in WHERE  Expanded SQL types including DATETIME, VARCHAR, etc. Extended Analytical Ability
  • 30. HiveServer2 as ODBC / JDBC Endpoint  Gateway that Hive clients can talk to  Supports concurrent clients  User/ global session/configuration information  Support for secure clusters and encryption  DoAs support allows Hive queries to run as the requester 30 2014 Hadoop Summit, San Jose, California
  • 31. 31 2014 Hadoop Summit, San Jose, California Data to Desktop (D2D) – BI and Reporting on ODBC HiveServer2 Hive Hadoop Desktop Web Intelligence Server Metadata Database Grid ODBC driver
  • 32. 32 2014 Hadoop Summit, San Jose, California DataOut – Data to Any Off-Grid Destination on JDBC HiveSplit HiveSplit HiveServer2M S FS/DB S FS/DB HiveSplit S FS/DB Execute Query Prepare Splits Fetch Splits Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database  DataOut is an efficient method of moving data off the grid  Advantages: o API based on well-known JDBC interface o Works with HCatalog / Hive o Agnostic to the underlying storage format o Parts of the whole data can be pulled in parallel
  • 33. SQL-based Authorization for Controlled Access 33 2014 Hadoop Summit, San Jose, California  SQL-compliant authorization model (Users, Roles, Privileges, Objects)  Fine-grain authorization and access control patterns (row and column in conjunction with views)  Can be used in conjunction with storage-based authorization Privileges Access Control  Objects consist of databases, tables, and views  Privileges are GRANTed on objects o SELECT: read access to an object o INSERT: write (insert) access to an object o UPDATE: write (update) access to an object o DELETE: delete access for an object o ALL PRIVILEGES: all privileges  Roles can be associated with objects  Privileges are associated with roles  CREATE, DROP, and SET ROLE statements manipulate roles and membership  SUPERUSER role for databases can grant access control to users or roles (not limited to HDFS permissions)  PUBLIC role includes all users  Prevents undesirable operations on objects by unauthorized users
  • 34. Starling (Log Warehouse) for Historical Analysis and Trends 34 2014 Hadoop Summit, San Jose, California Cluster 1 Cluster 2 Cluster 3 Cluster N Oozie HCatalog HDFS Hive Starling Dashboard Discovery Portal Query Server Source Clusters Warehouse Clusters
  • 35. 35 2014 Hadoop Summit, San Jose, California SQL on Hadoop the Fastest Growing Product on Grid 0.0% 1.0% 2.0% 3.0% 4.0% 5.0% 6.0% 7.0% 8.0% 9.0% 10.0% 0 5 10 15 20 25 30 Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14 HiveJobs(%ofAllJobs) AllGridJobs(inMillions) All Jobs Hive (% of all jobs) 2.5 million queries
  • 36. In Summary 36 2014 Hadoop Summit, San Jose, California Data shared across tools such as MR, Pig, and Hive Apache HCatalog Schema and semantics knowledge across the company Data Discovery Support for schema evolution and downstream change communication Apache HCatalog Fine-grained access controls (row / column) vs. all or nothing SQL-based Authorization Clear ownership of data Data Discovery Data lineage and integrity Data Discovery / Starling Audits and compliance (e.g. SOX) Data Discovery / Starling Retention, duplication, and waste Data Discovery / Starling ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔
  • 37. Acknowledge 37 2014 Hadoop Summit, San Jose, California 1 Apache Hive (and HiveServer2, HCatalog) Community http://hive.apache.org/people.html 2 HCatalog and Hive Development Team at Yahoo Olga Natkovich Annie Lin Fangyue Wang Chris Drome Jin Sun Selina Zhang Mithun Radhakrishnan Viraj Bhat 3 Oozie Development Team Rohini Palaniswamy Ryota Egashira Purshotam Shah Mona Chitnis Michelle Chiang 4 Grid Data Management (GDM) Team Mark Holderbaugh Aaron Gresch Lawrence Prem Kumar Scott Preece Yan Braun 5 Service Engineering and Data Operations Rob Realini David Kuder Chuck Sheldon Rajiv Chittajallu Vineeth Vadrevu Andy Rhee 6 Product Management Sid Shaik Amrit Lal Kimsukh Kundu
  • 38. Thank You @thiruvel @sumeetksingh We are hiring! Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.