SlideShare a Scribd company logo
HCatalog/ Hive DataOut
Bay Area Hadoop User Group Meetup
May 15, 2013
Moving Data Out of Hadoop Clusters Today
2Yahoo! Presentation, Confidential
Client’s
Machine
HTTP
Client
HTTP
Server
Launcher/
Gateway
HDFS
Proxy1
HTTP
Proxy
M/R on
YARN
HDFS
Hadoop RPC
Hadoop RPC
SSH
HTTPS
HTTPS
M/R on
YARN
Custom
Proxy
HTTPS
HTTP
Server
Filers
HTTPS
HDFS
M/R on
YARN
DistCp
Clients Multi-tenant Hadoop Clusters Managed Data-loading
1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP
SSH
SQLLDR
Typical Data Out Scenario
3Yahoo! Presentation, Confidential
HDFS
ProxyHDFS
§  Data (to be pulled out) is stored in a predefined directory structure as files
§  Client determines (through a custom interface) if a particular data feed of interest is
committed or not
§  If committed, client gets the list of files first, and then pulls them out (file-by-file)
through HDFSProxy
CustomInterface
Filer Temp Table
Main Table
cURL
data copy
INSERT
Oracle DB
Ext. Table
Main Table
delimited files
Pros and Cons of the Data Out Approach
4Yahoo! Presentation, Confidential
Pros
§  Security of DB passwords – password not stored in the grid
§  Compression – cross-colo network bandwidth is expensive and compression is not possible with
JDBC drivers
§  Encryption – data out of the grids has to be encrypted as it may be cross-colo
§  ACLs – DB hosts are not accessible from grid nodes, and hence the proxy
Cons
§  Directory structure – has to be predefined and known to downstream consumers of data
§  Data discovery – availability of data for consumption requires polling or other hooks
§  Overhead – Use of DONE files
§  Maintenance – Separate schema files and schema file formats
The introduction of HCatalog and JMS notifications solves the problem
Hadoop – One Platform, Many Tools
Yahoo! Presentation, Confidential 5
Metastore
HDFS
Hive
Metastore Client
InputFormat/
OuputFormat
SerDe
InputFormat/
OuputFormat
MapReduce Pig
Load/
Store
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
MapReduce/ Pig
§  Pipelines
§  Iterative Processing
§  Research
Data Warehouse
Hive
§  BI Tools
§  Analysis
HCatLoader/
HCatStorer
HCatalog – Opening Up the Hive Metastore
Yahoo! Presentation, Confidential 6
Metastore
HDFS
Metastore Client
InputFormat/
OuputFormat
SerDe
HCatInputFormat/
HCatOuputFormat
MapReduce Pig
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
Hive
REST
External
System
HCatalog Value Proposition
Yahoo! Presentation, Confidential 7
Source: Alan Gates on HCatalog, Hadoop Summit, 2012
§  Centralized metadata service for Hadoop
§  Facilitates interoperability among tools such as Pig, Hive, M/R, allows for
sharing of data
§  Provides DB-like abstractions (databases, tables, and partitions) and
supports schema evolution
§  Abstracts out the file storage format and data location
HiveServer2 with HCatalog
Yahoo! Presentation, Confidential 8
HDFS
(ODBC)
HiveServer2
(ODBC/ JDBC)
Data Out Client
(JDBC)
HCatalog Server
(Metastore)
Messaging
Service
(ActiveMQ)
HiveServer2
Jobs
Hive Jobs
(CLI)
HCat Jobs
(Pig, M/R)
doAs(user)
doAs(user)
JMS notification (Producer)
Notification (Consumer)
Issues Solved
9Yahoo! Presentation, Confidential
Directory structure – has to be predefined and known to downstream
consumers of data
Data discovery – availability of data for consumption requires polling or
other hooks
Overhead – Use of DONE files
Maintenance – Separate schema files and schema file formats
✔
✔
✔
✔
DataOut Motivation
10Yahoo! Presentation, Confidential
§  Many ways to load and manage data on the grid
§  HCatalog/Hive
§  Pig
§  Hadoop MR
§  Sqoop
§  GDM
§  Fewer ways of getting data off the cluster
§  Sqoop
§  HDFSProxy
§  HDFS copy to local file system
§  distcp between clusters
§  Challenges
§  Underlying file format
§  Size of data
§  SLA
DataOut Overview
11Yahoo! Presentation, Confidential
§  What is DataOut?
§  Efficient method of moving data off the grid
§  API exposes a programmatic interface
§  What are the advantages of DataOut?
§  API based on well-known JDBC API
§  Works with HCatalog/Hive
§  Agnostic to the underlying storage format
§  Parts of the whole data can be pulled in parallel
§  What are the limitations of DataOut?
§  Queries must be SELECT * FROM type queries
DataOut Deployment
12Yahoo! Presentation, Confidential
HDFS
HS2 HS2 … HS2 HS2
DataOut
Client
Query Data
How DataOut Works
13Yahoo! Presentation, Confidential
HiveServer2M
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
HiveSplit
S
FS/DB
Execute Query
Prepare Splits
Fetch Splits
Legend:
M – Master, S – Slave, FS/ DB – Filesystem/ Database
Code to Prepare the HiveSplits
14Yahoo! Presentation, Confidential
DataOut	
  dataout	
  =	
  new	
  DataOut();	
  
HiveConnection	
  c	
  =	
  dataout.getConnection();	
  
	
  
Statement	
  s	
  =	
  c.createGenerateSplitStatement();	
  
ResultSet	
  rs	
  =	
  s.executeQuery(sql);	
  
	
  
while(rs.next())	
  {	
  
HiveSplit	
  split	
  =	
  (HiveSplit)	
  rs.getObject(1);	
  
/*	
  Launch	
  job	
  to	
  fetch	
  the	
  split	
  data.	
  */	
  
}	
  
	
  
/*	
  Synchronize	
  on	
  fetch	
  jobs.	
  */	
  
	
  
rs.close();	
  
s.close();	
  
c.close();	
  
Code to Retrieve the HiveSplits
15Yahoo! Presentation, Confidential
DataOut	
  dataout	
  =	
  new	
  DataOut();	
  
HiveConnection	
  c	
  =	
  dataout.getConnection();	
  
	
  
PreparedStatement	
  ps	
  =	
  c.prepareFetchSplitStatement(split);	
  
ResultSet	
  rs	
  =	
  ps.executeQuery();	
  
	
  
while(rs.next())	
  {	
  
/*	
  Process	
  row	
  data.	
  */	
  
}	
  
	
  
rs.close();	
  
ps.close();	
  
c.close();	
  
	
  
/*	
  Communicate	
  with	
  master	
  process.	
  */	
  
DataOut Demo
Yahoo! Presentation, Confidential 16
HS2 Performance – Single Client Connection
17Yahoo! Presentation, Confidential
HS2 Performance – Five Concurrent Clients
18Yahoo! Presentation, Confidential
HS2 Performance Summary
19Yahoo! Presentation, Confidential
§  Throughput scales linearly
§  Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s
§  Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s
§  Throughput is affected by fetch size
§  Sweet spot around ~200 rows
§  Average row size may affect this number (pending further testing)
§  HiveServer2 is capable of handling multiple clients
§  Throughput of 10GB in ~20 minutes with five client connections
§  Drop-off in throughput is expected and reasonable
§  5x increase in concurrent connections = 2x increase in transfer time
§  Goal of 50GB in 5min
§  Achievable with ~10 HiveServer2 instances streaming data
HUG Meetup 2013: HCatalog / Hive Data Out

More Related Content

What's hot

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
tshiran
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
DataWorks Summit
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
Cloudera, Inc.
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Giovanna Roda
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
Ambuj Kumar
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
DataWorks Summit
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
Hortonworks
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Continuent
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
OReillyStrata
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
Giovanna Roda
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
Milind Bhandarkar
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
Modern Data Stack France
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
Amazon Web Services
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
Yahoo Developer Network
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
MapR Technologies
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Sumeet Singh
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 

What's hot (20)

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
HW09 Hadoop Vaidya
HW09 Hadoop VaidyaHW09 Hadoop Vaidya
HW09 Hadoop Vaidya
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
 
Enabling R on Hadoop
Enabling R on HadoopEnabling R on Hadoop
Enabling R on Hadoop
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Introduction to Hadoop part 2
Introduction to Hadoop part 2Introduction to Hadoop part 2
Introduction to Hadoop part 2
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 

Similar to HUG Meetup 2013: HCatalog / Hive Data Out

Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
DataWorks Summit
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Sumeet Singh
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
thiruvel
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Sumeet Singh
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
Danairat Thanabodithammachari
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
Hortonworks
 
Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости Hadoop
Positive Hack Days
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
Danairat Thanabodithammachari
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
Asis Mohanty
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
ryancox
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
jerrin joseph
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
marklpollack
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Laxmi Rauth
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
Slim Bouguerra
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
Sudarshan Pant
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
Sudar Muthu
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentation
HarshitaKamboj
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
pivotalny
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
Wes Floyd
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
IJRESJOURNAL
 

Similar to HUG Meetup 2013: HCatalog / Hive Data Out (20)

Data Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your DataData Discovery on Hadoop - Realizing the Full Potential of your Data
Data Discovery on Hadoop - Realizing the Full Potential of your Data
 
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop Hadoop Summit San Jose 2014: Data Discovery on Hadoop
Hadoop Summit San Jose 2014: Data Discovery on Hadoop
 
Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014Data discoveryonhadoop@yahoo! hadoopsummit2014
Data discoveryonhadoop@yahoo! hadoopsummit2014
 
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
Strata Conference + Hadoop World San Jose 2015: Data Discovery on Hadoop
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012Web Services Hadoop Summit 2012
Web Services Hadoop Summit 2012
 
Охота на уязвимости Hadoop
Охота на уязвимости HadoopОхота на уязвимости Hadoop
Охота на уязвимости Hadoop
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Introduction to Hadoop
Introduction to Hadoop Introduction to Hadoop
Introduction to Hadoop
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Big Data Summer training presentation
Big Data Summer training presentationBig Data Summer training presentation
Big Data Summer training presentation
 
SQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQSQL and Machine Learning on Hadoop using HAWQ
SQL and Machine Learning on Hadoop using HAWQ
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 

More from Sumeet Singh

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
Sumeet Singh
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Sumeet Singh
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Sumeet Singh
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Sumeet Singh
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Sumeet Singh
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Sumeet Singh
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
Sumeet Singh
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Sumeet Singh
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
Sumeet Singh
 

More from Sumeet Singh (9)

Hadoop Summit Kiosk Deck
Hadoop Summit Kiosk DeckHadoop Summit Kiosk Deck
Hadoop Summit Kiosk Deck
 
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
Keynote Hadoop Summit San Jose 2017 : Shaping Data Platform To Create Lasting...
 
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
Keynote Hadoop Summit Dublin 2016: Hadoop Platform Innovations - Pushing The ...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
Hadoop Summit San Jose 2013: Compression Options in Hadoop - A Tale of Tradeo...
 
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo! SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
SAP Technology Services Conference 2013: Big Data and The Cloud at Yahoo!
 
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
Strata Conference + Hadoop World NY 2013: Running On-premise Hadoop as a Busi...
 
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo! HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
HBaseCon 2013: Multi-tenant Apache HBase at Yahoo!
 

Recently uploaded

Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
aryanpankaj78
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
21UME003TUSHARDEB
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
Divyanshu
 
Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
cannyengineerings
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
Addu25809
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
uqyfuc
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
VICTOR MAESTRE RAMIREZ
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
harshapolam10
 
Introduction to verilog basic modeling .ppt
Introduction to verilog basic modeling   .pptIntroduction to verilog basic modeling   .ppt
Introduction to verilog basic modeling .ppt
AmitKumar730022
 
Mechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineeringMechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineering
sachin chaurasia
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
PKavitha10
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
um7474492
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
UReason
 
Gas agency management system project report.pdf
Gas agency management system project report.pdfGas agency management system project report.pdf
Gas agency management system project report.pdf
Kamal Acharya
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
mahaffeycheryld
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
Gino153088
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
PriyankaKilaniya
 
5g-5G SA reg. -standalone-access-registration.pdf
5g-5G SA reg. -standalone-access-registration.pdf5g-5G SA reg. -standalone-access-registration.pdf
5g-5G SA reg. -standalone-access-registration.pdf
devtomar25
 

Recently uploaded (20)

Digital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptxDigital Twins Computer Networking Paper Presentation.pptx
Digital Twins Computer Networking Paper Presentation.pptx
 
Mechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdfMechanical Engineering on AAI Summer Training Report-003.pdf
Mechanical Engineering on AAI Summer Training Report-003.pdf
 
Null Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAMNull Bangalore | Pentesters Approach to AWS IAM
Null Bangalore | Pentesters Approach to AWS IAM
 
Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...Pressure Relief valve used in flow line to release the over pressure at our d...
Pressure Relief valve used in flow line to release the over pressure at our d...
 
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENTNATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
NATURAL DEEP EUTECTIC SOLVENTS AS ANTI-FREEZING AGENT
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student MemberIEEE Aerospace and Electronic Systems Society as a Graduate Student Member
IEEE Aerospace and Electronic Systems Society as a Graduate Student Member
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by AnantLLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
 
SCALING OF MOS CIRCUITS m .pptx
SCALING OF MOS CIRCUITS m                 .pptxSCALING OF MOS CIRCUITS m                 .pptx
SCALING OF MOS CIRCUITS m .pptx
 
Introduction to verilog basic modeling .ppt
Introduction to verilog basic modeling   .pptIntroduction to verilog basic modeling   .ppt
Introduction to verilog basic modeling .ppt
 
Mechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineeringMechatronics material . Mechanical engineering
Mechatronics material . Mechanical engineering
 
CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1CEC 352 - SATELLITE COMMUNICATION UNIT 1
CEC 352 - SATELLITE COMMUNICATION UNIT 1
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...smart pill dispenser is designed to improve medication adherence and safety f...
smart pill dispenser is designed to improve medication adherence and safety f...
 
Data Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason WebinarData Driven Maintenance | UReason Webinar
Data Driven Maintenance | UReason Webinar
 
Gas agency management system project report.pdf
Gas agency management system project report.pdfGas agency management system project report.pdf
Gas agency management system project report.pdf
 
Generative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdfGenerative AI Use cases applications solutions and implementation.pdf
Generative AI Use cases applications solutions and implementation.pdf
 
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
4. Mosca vol I -Fisica-Tipler-5ta-Edicion-Vol-1.pdf
 
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
Prediction of Electrical Energy Efficiency Using Information on Consumer's Ac...
 
5g-5G SA reg. -standalone-access-registration.pdf
5g-5G SA reg. -standalone-access-registration.pdf5g-5G SA reg. -standalone-access-registration.pdf
5g-5G SA reg. -standalone-access-registration.pdf
 

HUG Meetup 2013: HCatalog / Hive Data Out

  • 1. HCatalog/ Hive DataOut Bay Area Hadoop User Group Meetup May 15, 2013
  • 2. Moving Data Out of Hadoop Clusters Today 2Yahoo! Presentation, Confidential Client’s Machine HTTP Client HTTP Server Launcher/ Gateway HDFS Proxy1 HTTP Proxy M/R on YARN HDFS Hadoop RPC Hadoop RPC SSH HTTPS HTTPS M/R on YARN Custom Proxy HTTPS HTTP Server Filers HTTPS HDFS M/R on YARN DistCp Clients Multi-tenant Hadoop Clusters Managed Data-loading 1Similar to HttpFS Gateway/ Hoop in Hadoop 2.0 – Hadoop HDFS over HTTP SSH
  • 3. SQLLDR Typical Data Out Scenario 3Yahoo! Presentation, Confidential HDFS ProxyHDFS §  Data (to be pulled out) is stored in a predefined directory structure as files §  Client determines (through a custom interface) if a particular data feed of interest is committed or not §  If committed, client gets the list of files first, and then pulls them out (file-by-file) through HDFSProxy CustomInterface Filer Temp Table Main Table cURL data copy INSERT Oracle DB Ext. Table Main Table delimited files
  • 4. Pros and Cons of the Data Out Approach 4Yahoo! Presentation, Confidential Pros §  Security of DB passwords – password not stored in the grid §  Compression – cross-colo network bandwidth is expensive and compression is not possible with JDBC drivers §  Encryption – data out of the grids has to be encrypted as it may be cross-colo §  ACLs – DB hosts are not accessible from grid nodes, and hence the proxy Cons §  Directory structure – has to be predefined and known to downstream consumers of data §  Data discovery – availability of data for consumption requires polling or other hooks §  Overhead – Use of DONE files §  Maintenance – Separate schema files and schema file formats The introduction of HCatalog and JMS notifications solves the problem
  • 5. Hadoop – One Platform, Many Tools Yahoo! Presentation, Confidential 5 Metastore HDFS Hive Metastore Client InputFormat/ OuputFormat SerDe InputFormat/ OuputFormat MapReduce Pig Load/ Store Source: Alan Gates on HCatalog, Hadoop Summit, 2012 MapReduce/ Pig §  Pipelines §  Iterative Processing §  Research Data Warehouse Hive §  BI Tools §  Analysis
  • 6. HCatLoader/ HCatStorer HCatalog – Opening Up the Hive Metastore Yahoo! Presentation, Confidential 6 Metastore HDFS Metastore Client InputFormat/ OuputFormat SerDe HCatInputFormat/ HCatOuputFormat MapReduce Pig Source: Alan Gates on HCatalog, Hadoop Summit, 2012 Hive REST External System
  • 7. HCatalog Value Proposition Yahoo! Presentation, Confidential 7 Source: Alan Gates on HCatalog, Hadoop Summit, 2012 §  Centralized metadata service for Hadoop §  Facilitates interoperability among tools such as Pig, Hive, M/R, allows for sharing of data §  Provides DB-like abstractions (databases, tables, and partitions) and supports schema evolution §  Abstracts out the file storage format and data location
  • 8. HiveServer2 with HCatalog Yahoo! Presentation, Confidential 8 HDFS (ODBC) HiveServer2 (ODBC/ JDBC) Data Out Client (JDBC) HCatalog Server (Metastore) Messaging Service (ActiveMQ) HiveServer2 Jobs Hive Jobs (CLI) HCat Jobs (Pig, M/R) doAs(user) doAs(user) JMS notification (Producer) Notification (Consumer)
  • 9. Issues Solved 9Yahoo! Presentation, Confidential Directory structure – has to be predefined and known to downstream consumers of data Data discovery – availability of data for consumption requires polling or other hooks Overhead – Use of DONE files Maintenance – Separate schema files and schema file formats ✔ ✔ ✔ ✔
  • 10. DataOut Motivation 10Yahoo! Presentation, Confidential §  Many ways to load and manage data on the grid §  HCatalog/Hive §  Pig §  Hadoop MR §  Sqoop §  GDM §  Fewer ways of getting data off the cluster §  Sqoop §  HDFSProxy §  HDFS copy to local file system §  distcp between clusters §  Challenges §  Underlying file format §  Size of data §  SLA
  • 11. DataOut Overview 11Yahoo! Presentation, Confidential §  What is DataOut? §  Efficient method of moving data off the grid §  API exposes a programmatic interface §  What are the advantages of DataOut? §  API based on well-known JDBC API §  Works with HCatalog/Hive §  Agnostic to the underlying storage format §  Parts of the whole data can be pulled in parallel §  What are the limitations of DataOut? §  Queries must be SELECT * FROM type queries
  • 12. DataOut Deployment 12Yahoo! Presentation, Confidential HDFS HS2 HS2 … HS2 HS2 DataOut Client Query Data
  • 13. How DataOut Works 13Yahoo! Presentation, Confidential HiveServer2M HiveSplit S FS/DB HiveSplit S FS/DB HiveSplit S FS/DB Execute Query Prepare Splits Fetch Splits Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database
  • 14. Code to Prepare the HiveSplits 14Yahoo! Presentation, Confidential DataOut  dataout  =  new  DataOut();   HiveConnection  c  =  dataout.getConnection();     Statement  s  =  c.createGenerateSplitStatement();   ResultSet  rs  =  s.executeQuery(sql);     while(rs.next())  {   HiveSplit  split  =  (HiveSplit)  rs.getObject(1);   /*  Launch  job  to  fetch  the  split  data.  */   }     /*  Synchronize  on  fetch  jobs.  */     rs.close();   s.close();   c.close();  
  • 15. Code to Retrieve the HiveSplits 15Yahoo! Presentation, Confidential DataOut  dataout  =  new  DataOut();   HiveConnection  c  =  dataout.getConnection();     PreparedStatement  ps  =  c.prepareFetchSplitStatement(split);   ResultSet  rs  =  ps.executeQuery();     while(rs.next())  {   /*  Process  row  data.  */   }     rs.close();   ps.close();   c.close();     /*  Communicate  with  master  process.  */  
  • 17. HS2 Performance – Single Client Connection 17Yahoo! Presentation, Confidential
  • 18. HS2 Performance – Five Concurrent Clients 18Yahoo! Presentation, Confidential
  • 19. HS2 Performance Summary 19Yahoo! Presentation, Confidential §  Throughput scales linearly §  Single client: 1GB: 60s, 5GB: 250s, 10GB: 500s §  Multiple clients: 1GB: 120s, 5GB: 600s, 10GB: 1200s §  Throughput is affected by fetch size §  Sweet spot around ~200 rows §  Average row size may affect this number (pending further testing) §  HiveServer2 is capable of handling multiple clients §  Throughput of 10GB in ~20 minutes with five client connections §  Drop-off in throughput is expected and reasonable §  5x increase in concurrent connections = 2x increase in transfer time §  Goal of 50GB in 5min §  Achievable with ~10 HiveServer2 instances streaming data