May 2013 HUG: HCatalog/Hive Data Out
Upcoming SlideShare
Loading in...5

May 2013 HUG: HCatalog/Hive Data Out



Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as ...

Yahoo! Hadoop grid makes use of a managed service to get the data pulled into the clusters. However, when it comes to getting the data-out of the clusters, the choices are limited to proxies such as HDFSProxy and HTTPProxy. With the introduction of HCatalog services, customers of the grid now have their data represented in a central metadata repository. HCatalog abstracts out file locations and underlying storage format of data for the users, along with several other advantages such as sharing of data among MapReduce, Pig, and Hive. In this talk, we will focus on how the ODBC/JDBC interface of HiveServer2 accomplished the use case of getting data out of the clusters when HCatalog is in use and users no longer want to worry about the files, partitions and their location. We will also demo the data out capabilities, and go through other nice properties of the data out feature.

Sumeet Singh, Director, Product Management, Yahoo!
Chris Drome, Technical Yahoo!



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

May 2013 HUG: HCatalog/Hive Data Out May 2013 HUG: HCatalog/Hive Data Out Presentation Transcript

  • HCatalog/ Hive DataOutBay Area Hadoop User Group MeetupMay 15, 2013
  • Moving Data Out of Hadoop Clusters Today2Yahoo! Presentation, ConfidentialClient’sMachineHTTPClientHTTPServerLauncher/GatewayHDFSProxy1HTTPProxyM/R onYARNHDFSHadoop RPCHadoop RPCSSHHTTPSHTTPSM/R onYARNCustomProxyHTTPSHTTPServerFilersHTTPSHDFSM/R onYARNDistCpClients Multi-tenant Hadoop Clusters Managed Data-loading1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTPSSH
  • SQLLDRTypical Data Out Scenario3Yahoo! Presentation, ConfidentialHDFSProxyHDFS§  Data (to be pulled out) is stored in a predefined directory structure as files§  Client determines (through a custom interface) if a particular data feed of interest iscommitted or not§  If committed, client gets the list of files first, and then pulls them out (file-by-file)through HDFSProxyCustomInterfaceFiler Temp TableMain TablecURLdata copyINSERTOracle DBExt. TableMain Tabledelimited files
  • Pros and Cons of the Data Out Approach4Yahoo! Presentation, ConfidentialPros§  Security of DB passwords – password not stored in the grid§  Compression – cross-colo network bandwidth is expensive and compression is not possible withJDBC drivers§  Encryption – data out of the grids has to be encrypted as it may be cross-colo§  ACLs – DB hosts are not accessible from grid nodes, and hence the proxyCons§  Directory structure – has to be predefined and known to downstream consumers of data§  Data discovery – availability of data for consumption requires polling or other hooks§  Overhead – Use of DONE files§  Maintenance – Separate schema files and schema file formatsThe introduction of HCatalog and JMS notifications solves the problem
  • Hadoop – One Platform, Many ToolsYahoo! Presentation, Confidential 5MetastoreHDFSHiveMetastore ClientInputFormat/OuputFormatSerDeInputFormat/OuputFormatMapReduce PigLoad/StoreSource: Alan Gates on HCatalog, Hadoop Summit, 2012MapReduce/ Pig§  Pipelines§  Iterative Processing§  ResearchData WarehouseHive§  BI Tools§  Analysis
  • HCatLoader/HCatStorerHCatalog – Opening Up the Hive MetastoreYahoo! Presentation, Confidential 6MetastoreHDFSMetastore ClientInputFormat/OuputFormatSerDeHCatInputFormat/HCatOuputFormatMapReduce PigSource: Alan Gates on HCatalog, Hadoop Summit, 2012HiveRESTExternalSystem
  • HCatalog Value PropositionYahoo! Presentation, Confidential 7Source: Alan Gates on HCatalog, Hadoop Summit, 2012§  Centralized metadata service for Hadoop§  Facilitates interoperability among tools such as Pig, Hive, M/R, allows forsharing of data§  Provides DB-like abstractions (databases, tables, and partitions) andsupports schema evolution§  Abstracts out the file storage format and data location
  • HiveServer2 with HCatalogYahoo! Presentation, Confidential 8HDFS(ODBC)HiveServer2(ODBC/ JDBC)Data Out Client(JDBC)HCatalog Server(Metastore)MessagingService(ActiveMQ)HiveServer2JobsHive Jobs(CLI)HCat Jobs(Pig, M/R)doAs(user)doAs(user)JMS notification (Producer)Notification (Consumer)
  • Issues Solved9Yahoo! Presentation, ConfidentialDirectory structure – has to be predefined and known to downstreamconsumers of dataData discovery – availability of data for consumption requires polling orother hooksOverhead – Use of DONE filesMaintenance – Separate schema files and schema file formats✔✔✔✔
  • DataOut Motivation10Yahoo! Presentation, Confidential§  Many ways to load and manage data on the grid§  HCatalog/Hive§  Pig§  Hadoop MR§  Sqoop§  GDM§  Fewer ways of getting data off the cluster§  Sqoop§  HDFSProxy§  HDFS copy to local file system§  distcp between clusters§  Challenges§  Underlying file format§  Size of data§  SLA
  • DataOut Overview11Yahoo! Presentation, Confidential§  What is DataOut?§  Efficient method of moving data off the grid§  API exposes a programmatic interface§  What are the advantages of DataOut?§  API based on well-known JDBC API§  Works with HCatalog/Hive§  Agnostic to the underlying storage format§  Parts of the whole data can be pulled in parallel§  What are the limitations of DataOut?§  Queries must be SELECT * FROM type queries
  • DataOut Deployment12Yahoo! Presentation, ConfidentialHDFSHS2 HS2 … HS2 HS2DataOutClientQuery Data
  • How DataOut Works13Yahoo! Presentation, ConfidentialHiveServer2MHiveSplitSFS/DBHiveSplitSFS/DBHiveSplitSFS/DBExecute QueryPrepare SplitsFetch SplitsLegend:M – Master, S – Slave, FS/ DB – Filesystem/ Database
  • Code to Prepare the HiveSplits14Yahoo! Presentation, ConfidentialDataOut  dataout  =  new  DataOut();  HiveConnection  c  =  dataout.getConnection();    Statement  s  =  c.createGenerateSplitStatement();  ResultSet  rs  =  s.executeQuery(sql);    while(  {  HiveSplit  split  =  (HiveSplit)  rs.getObject(1);  /*  Launch  job  to  fetch  the  split  data.  */  }    /*  Synchronize  on  fetch  jobs.  */    rs.close();  s.close();  c.close();  
  • Code to Retrieve the HiveSplits15Yahoo! Presentation, ConfidentialDataOut  dataout  =  new  DataOut();  HiveConnection  c  =  dataout.getConnection();    PreparedStatement  ps  =  c.prepareFetchSplitStatement(split);  ResultSet  rs  =  ps.executeQuery();    while(  {  /*  Process  row  data.  */  }    rs.close();  ps.close();  c.close();    /*  Communicate  with  master  process.  */  
  • DataOut DemoYahoo! Presentation, Confidential 16
  • HS2 Performance – Single Client Connection17Yahoo! Presentation, Confidential
  • HS2 Performance – Five Concurrent Clients18Yahoo! Presentation, Confidential
  • HS2 Performance Summary19Yahoo! Presentation, Confidential§  Throughput scales linearly§  Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s§  Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s§  Throughput is affected by fetch size§  Sweet spot around ~200 rows§  Average row size may affect this number (pending further testing)§  HiveServer2 is capable of handling multiple clients§  Throughput of 10GB in ~20 minutes with five client connections§  Drop-off in throughput is expected and reasonable§  5x increase in concurrent connections = 2x increase in transfer time§  Goal of 50GB in 5min§  Achievable with ~10 HiveServer2 instances streaming data