SlideShare a Scribd company logo
1 of 55
DEBUGGING HIVE WITH
HADOOP IN THE CLOUD
Soam Acharya, David Chaiken, Denis Sheahan, Charles Wimmer
Altiscale, Inc.
#LABDUG @ 20150115T19:30-0800
WHO ARE WE?
•  Altiscale: Infrastructure Nerds!
•  Hadoop As A Service
•  Rack and build our own Hadoop clusters
•  Provide a suite of Hadoop tools
o  Hive, Pig, Oozie
o  Others as needed: R, Python, Spark, Mahout, Impala, etc.
•  Monthly billing plan: compute (YARN), storage (HDFS)
•  https://www.altiscale.com
•  @Altiscale #HadoopSherpa
TALK ROADMAP
•  Our Platform and Perspective
•  Hadoop 2 Primer
•  Hadoop Debugging Tools
•  Accessing Logs in Hadoop 2
•  Hive + Hadoop Architecture
•  Hive Logs
•  Hive Issues + Case Studies
o  Hive + Interactive (DRAM Centric) Processing Engines
•  Conclusion: Making Hive Easier to Use
OUR DYNAMIC PLATFORM
•  Hadoop 2.0.5 => Hadoop 2.2.0 => Hadoop 2.4.1 => …
•  Hive 0.10 => Hive 0.12 => Stinger (Hive 0.13 + Tez) => …
•  Hive, Pig and Oozie most commonly used tools
•  Working with customers on:
Spark, H2O, Trifacta, Impala, Flume, Camus/Kafka, …
ALTISCALE PERSPECTIVE
•  What we do as a service provider…
o  Performance + Reliability: Jobs finish faster, fewer failures
o  Instant Access: Always-on access to HDFS and YARN
o  Hadoop Helpdesk: Tools + experts ensure customer success
o  Secure: Networking, SOC 2 Audit, Kerberos
o  Results: Faster Time-to-Value (TTV), Lower TCO
•  Operational approach in this presentation…
o  How to use Hadoop 2 cluster tools and logs
to debug and to tune Hive
o  This talk will not focus on query optimization
 	
  	
  Hadoop	
  2	
  Cluster	
  
Name	
  Node	
  
	
  
Hadoop	
  Slave	
  
Hadoop	
  Slave	
  
Hadoop	
  Slave	
  
Resource	
  Manager	
  
	
  
Secondary	
  NameNode	
  
	
  
Hadoop	
  Slave	
  
Node	
  Managers	
  
+	
  	
  
Data	
  Nodes	
  
QUICK PRIMER – HADOOP 2
QUICK PRIMER – HADOOP 2 YARN
•  Resource Manager (per cluster)
o  Manages job scheduling and execution
o  Global resource allocation
•  Application Master (per job)
o  Manages task scheduling and execution
o  Local resource allocation
•  Node Manager (per-machine agent)
o  Manages the lifecycle of task containers
o  Reports to RM on health and resource usage
HADOOP 1 VS HADOOP 2
•  No more JobTrackers, TaskTrackers
•  YARN ~ Operating System for Clusters
o  MapReduce is implemented as a YARN application
o  Bring on the applications! (Spark is just the start…)
•  Should be Transparent to Hive users
HADOOP 2 DEBUGGING TOOLS
•  Monitoring
o  System state of cluster:
§  CPU, Memory, Network, Disk
§  Nagios, Ganglia, Sensu!
§  Collectd, statd, Graphite
o  Hadoop level
§  HDFS usage
§  Resource usage:
•  Container memory allocated vs used
•  # of jobs running at the same time
•  Long running tasks
HADOOP 2 DEBUGGING TOOLS
•  Hadoop logs
o  Daemon logs: Resource Manager, NameNode, DataNode
o  Application logs: Application Master, MapReduce tasks
o  Job history file: resources allocated during job lifetime
o  Application configuration files: store all Hadoop application
parameters
•  Source code instrumentation
ACCESSING LOGS IN HADOOP 2
•  To view the logs for a job, click on the link under the ID
column in Resource Manager UI.
ACCESSING LOGS IN HADOOP 2
•  To view application top level logs, click on logs.
•  To view individual logs for the mappers and reducers,
click on History.
ACCESSING LOGS IN HADOOP 2
•  Log output for the entire application.
ACCESSING LOGS IN HADOOP 2
•  Click on the Map link for mapper logs and the Reduce
link for reducer logs.
ACCESSING LOGS IN HADOOP 2
•  Clicking on a single link under Name provides an
overview for that particular map job.
ACCESSING LOGS IN HADOOP 2
•  Finally, clicking on the logs link will take you to the log
output for that map job.
ACCESSING LOGS IN HADOOP 2
•  Fun, fun, donuts, and more fun…
HIVE + HADOOP 2 ARCHITECTURE
•  Hive 0.10+
	
  	
  	
  Hadoop	
  2	
  Cluster	
  
Hive	
  CLI	
   Hive	
  
Metastore	
  
Hiveserver2	
  JDBC/ODBC	
  
Tableau,	
  
KeFle,	
  …	
  
HIVE LOGS
•  Query Log location
•  From /etc/hive/hive-site.xml:


<property>"
<name>hive.querylog.location</name>"
<value>/home/hive/log/${user.name}</value>"
</property>"
"
SessionStart SESSION_ID="soam_201402032341"
TIME="1391470900594""
"
HIVE CLIENT LOGS
•  /etc/hive/hive-log4j.properties:
o  hive.log.dir=/var/log/hive/${user.name}
2014-05-29 19:51:09,830 INFO parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing
command: select count(*) from dogfood_job_data"
2014-05-29 19:51:09,852 INFO parse.ParseDriver (ParseDriver.java:parse(197)) - Parse
Completed"
2014-05-29 19:51:09,852 INFO ql.Driver (PerfLogger.java:PerfLogEnd(124)) - </PERFLOG
method=parse start=1401393069830 end=1401393069852 duration=22>"
2014-05-29 19:51:09,853 INFO ql.Driver (PerfLogger.java:PerfLogBegin(97)) - <PERFLOG
method=semanticAnalyze>"
2014-05-29 19:51:09,890 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:analyzeInternal(8305)) - Starting Semantic Analysis"
2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:analyzeInternal(8340)) - Completed phase 1 of Semantic Analysis"
2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:getMetaData(1060)) - Get metadata for source tables"
2014-05-29 19:51:09,906 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:getMetaData(1167)) - Get metadata for subqueries"
2014-05-29 19:51:09,909 INFO parse.SemanticAnalyzer
(SemanticAnalyzer.java:getMetaData(1187)) - Get metadata for destination tables"
"
HIVE METASTORE LOGS
•  /etc/hive-metastore/hive-log4j.properties:
o  hive.log.dir=/service/log/hive-metastore/${user.name}
2014-05-29 19:50:50,179 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94
get_table : db=default tbl=dogfood_job_data"
2014-05-29 19:50:50,180 INFO HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94
cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data "
2014-05-29 19:50:50,236 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94
get_table : db=default tbl=dogfood_job_data"
2014-05-29 19:50:50,236 INFO HiveMetaStore.audit
(HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94
cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data "
2014-05-29 19:50:50,261 INFO metastore.HiveMetaStore
(HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94
get_table : db=default tbl=dogfood_job_data"
HIVE ISSUES + CASE STUDIES
•  Hive Issues
o  Hive client out of memory
o  Hive map/reduce task out of memory
o  Hive metastore out of memory
o  Hive launches too many tasks
•  Case Studies:
o  Hive “stuck” job
o  Hive “missing directories”
o  Analyze Hive Query Execution
o  Hive + Interactive (DRAM Centric) Processing Engines
HIVE CLIENT OUT OF MEMORY
•  Memory intensive client side hive query (map-side join)
Number of reduce tasks not specified. Estimated from input data size: 999"
In order to change the average load for a reducer (in bytes):"
set hive.exec.reducers.bytes.per.reducer=<number>"
In order to limit the maximum number of reducers:"
set hive.exec.reducers.max=<number>"
In order to set a constant number of reducers:"
set mapred.reduce.tasks=<number>"
java.lang.OutOfMemoryError: Java heap space!
at java.nio.CharBuffer.wrap(CharBuffer.java:350)"
at java.nio.CharBuffer.wrap(CharBuffer.java:373)"
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:
138)"
HIVE CLIENT OUT OF MEMORY
•  Use HADOOP_HEAPSIZE prior to launching Hive client
•  HADOOP_HEAPSIZE=<new heapsize> hive <fileName>"
•  Watch out for HADOOP_CLIENT_OPTS issue in hive-env.sh!
•  Important to know the amount of memory available on
machine running client… Do not exceed or use
disproportionate amount.
$ free -m"
total used free shared buffers cached"
Mem: 1695 1388 306 0 60 424"
-/+ buffers/cache: 903 791"
Swap: 895 101 794"
	
  
	
  
HIVE TASK OUT OF MEMORY
•  Query spawns MapReduce jobs that run out of memory
•  How to find this issue?
o  Hive diagnostic message
o  Hadoop MapReduce logs
HIVE TASK OUT OF MEMORY
•  Fix is to increase task RAM allocation…
set mapreduce.map.memory.mb=<new RAM allocation>; "
set mapreduce.reduce.memory.mb=<new RAM allocation>;"
•  Also watch out for…
set mapreduce.map.java.opts=-Xmx<heap size>m; "
set mapreduce.reduce.java.opts=-Xmx<heap size>m; "
•  Not a magic bullet – requires manual tuning
•  Increase in individual container memory size:
o  Decrease in overall containers that can be run
o  Decrease in overall parallelism
HIVE METASTORE OUT OF MEMORY
•  Out of memory issues not necessarily dumped to logs
•  Metastore can become unresponsive
•  Can’t submit queries
•  Restart with a higher heap size:
export HADOOP_HEAPSIZE in hcat_server.sh
•  After notifying hive users about downtime:
service hcat restart"
HIVE LAUNCHES TOO MANY TASKS
•  Typically a function of the input data set
•  Lots of little files
HIVE LAUNCHES TOO MANY TASKS
•  Set mapred.max.split.size to appropriate fraction of data size
•  Also verify that
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat"
CASE STUDY: HIVE STUCK JOB
From an Altiscale customer:
“This job [jobid] has been running now for
41 hours. Is it still progressing or has
something hung up the map/reduce so it’s
just spinning? Do you have any insight?”
HIVE STUCK JOB
1.  Received jobId,
application_1382973574141_4536, from client
2.  Logged into client cluster.
3.  Pulled up Resource Manager
4.  Entered part of jobId (4536) in the search box.
5.  Clicked on the link that says:
application_1382973574141_4536"
6.  On resulting Application Overview page, clicked on link
next to “Tracking URL” that said Application Master
HIVE STUCK JOB
7.  On resulting MapReduce Application page, we clicked on the
Job Id (job_1382973574141_4536).
8.  The resulting MapReduce Job page displayed detailed status
of the mappers, including 4 failed mappers
9.  We then clicked on the 4 link on the Maps row in the Failed
column.
10. Title of the next page was “FAILED Map attempts in
job_1382973574141_4536.”
11.  Each failed mapper generated an error message.
12. Buried in the 16th line:
Caused by: java.io.FileNotFoundException: File
does not exist: hdfs://opaque_hostname:8020/
HiveTableDir/FileName.log.date.seq !
HIVE STUCK JOB
•  Job was stuck for a day or so, retrying a mapper that
would never finish successfully.
•  During the job, our customers’ colleague realized input
file was corrupted and deleted it.
•  Colleague did not anticipate the affect of removing
corrupted data on a running job
•  Hadoop didn’t make it easy to find out:
o  RM => search => application link => AM overview page => MR
Application Page => MR Job Page => Failed jobs page =>
parse long logs
o  Task retry without hope of success
HIVE “MISSING DIRECTORIES”
From an Altiscale customer:
“One problem we are seeing after the
[Hive Metastore] restart is that we lost
quite a few directories in [HDFS]. Is there
a way to recover these?”
HIVE “MISSING DIRECTORIES”
•  Obtained list of “missing” directories from customer:
o  /hive/biz/prod/*
•  Confirmed they were missing from HDFS
•  Searched through NameNode audit log to get block IDs that
belonged to missing directories.
13/07/24 21:10:08 INFO hdfs.StateChange: BLOCK*
NameSystem.allocateBlock: /hive/biz/prod/
incremental/carryoverstore/postdepuis/
lmt_unmapped_pggroup_schema._COPYING_.
BP-798113632-10.251.255.251-1370812162472
blk_3560522076897293424_2448396{blockUCState=UNDER_C
ONSTRUCTION, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[10.251.255.177:50
010|RBW],
ReplicaUnderConstruction[10.251.255.174:50010|RBW],
ReplicaUnderConstruction[10.251.255.169:50010|RBW]]}"
HIVE “MISSING DIRECTORIES”
•  Used blockID to locate exact time of file deletion from
Namenode logs:
13/07/31 08:10:33 INFO hdfs.StateChange:
BLOCK* addToInvalidates:
blk_3560522076897293424_2448396 to
10.251.255.177:50010 10.251.255.169:50010
10.251.255.174:50010 "
•  Used time of deletion to inspect hive logs
HIVE “MISSING DIRECTORIES”
QueryStart QUERY_STRING="create database biz_weekly location '/hive/biz/
prod'" QUERY_ID=“usrprod_20130731043232_0a40fd32-8c8a-479c-
ba7d-3bd8a2698f4b" TIME="1375245164667"
:
QueryEnd QUERY_STRING="create database biz_weekly location '/hive/biz/
prod'" QUERY_ID=”usrprod_20130731043232_0a40fd32-8c8a-479c-
ba7d-3bd8a2698f4b" QUERY_RET_CODE="0" QUERY_NUM_TASKS="0"
TIME="1375245166203"
:
QueryStart QUERY_STRING="drop database biz_weekly"
QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733"
TIME="1375256014799"
:
QueryEnd QUERY_STRING="drop database biz_weekly"
QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733"
QUERY_NUM_TASKS="0" TIME="1375256014838"
HIVE “MISSING DIRECTORIES”
•  In effect, user “usrprod” issued:
At 2013-07-31 04:32:44: create database biz_weekly
location '/hive/biz/prod'
At 2013-07-31 07:33:24: drop database biz_weekly
•  This is functionally equivalent to:
hdfs dfs -rm -r /hive/biz/prod"
HIVE “MISSING DIRECTORIES”
•  Customer manually placed their own data in /hive –
the warehouse directory managed and controlled by hive
•  Customer used CREATE and DROP db commands in
their code
o  Hive deletes database and table locations in /hive with
impunity
•  Why didn’t deleted data end up in .Trash?
o  Trash collection not turned on in configuration settings
o  It is now, but need a –skipTrash option (HIVE-6469)
HIVE “MISSING DIRECTORIES”
•  Hadoop forensics: piece together disparate sources…
o  Hadoop daemon logs (NameNode)
o  Hive query and metastore logs
o  Hadoop config files
•  Need better tools to correlate the different layers of the
system: hive client, hive metastore, MapReduce job,
YARN, HDFS, operating sytem metrics, …
By the way… Operating any distributed system would be
totally insane without NTP and a standard time zone (UTC).
CASE STUDY – ANALYZE QUERY
•  Customer provided Hive query + data sets
(100GBs to ~5 TBs)
•  Needed help optimizing the query
•  Didn’t rewrite query immediately
•  Wanted to characterize query performance and isolate
bottlenecks first
ANALYZE AND TUNE EXECUTION
•  Ran original query on the datasets in our environment:
o  Two M/R Stages: Stage-1, Stage-2
•  Long running reducers run out of memory
o  set mapreduce.reduce.memory.mb=5120"
o  Reduces slots and extends reduce time
•  Query fails to launch Stage-2 with out of memory
o  set HADOOP_HEAPSIZE=1024 on client machine
•  Query has 250,000 Mappers in Stage-2 which causes
failure
o  set mapred.max.split.size=5368709120

to reduce Mappers
ANALYSIS: HOW TO VISUALIZE?
•  Next challenge - how to visualize job execution?
•  Existing hadoop/hive logs not sufficient for this task
•  Wrote internal tools
o  parse job history files
o  plot mapper and reducer execution
ANALYSIS: MAP STAGE-1
Single	
  reduce	
  task	
  
ANALYSIS: REDUCE STAGE-1
ANALYSIS: MAP STAGE-2
ANALYSIS: REDUCE STAGE-2
ANALYZE EXECUTION: FINDINGS
•  Lone, long running reducer in first stage of query
•  Analyzed input data:
o  Query split input data by userId
o  Bucketizing input data by userId
o  One very large bucket: “invalid” userId
o  Discussed “invalid” userid with customer
•  An error value is a common pattern!
o  Need to differentiate between “Don’t know and don’t care”
or “don’t know and do care.”
INTERACTIVE (DRAM CENTRIC)
PROCESSING SYSTEMS
•  Loading data into DRAM makes processing fast!
•  Examples: Spark, Impala, 0xdata, …, [SAP HANA], …
•  Streaming systems (Storm, DataTorrent) may be similar
•  Need to increase YARN container memory size
•  Caution: larger YARN container settings for interactive
jobs may not be right for batch systems like Hive
•  Container size: needs to combine vcores and memory:
yarn.scheduler.maximum-allocation-vcores

yarn.nodemanager.resource.cpu-vcores ..."
Hive + Interactive: Watch Out for Container Size
HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
•  Attempting to schedule interactive systems and batch
systems like Hive may result in fragmentation
•  Interactive systems may require all-or-nothing scheduling
•  Batch jobs with little tasks may starve interactive jobs
HIVE + INTERACTIVE:
WATCH OUT FOR FRAGMENTATION
Solutions for fragmentation…
•  Reserve interactive nodes before starting batch jobs
•  Reduce interactive container size (if the algorithm permits)
•  Node labels (YARN-2492) and gang scheduling (YARN-624)
CONCLUSIONS
•  Hive + Hadoop debugging can get very complex
o  Sifting through many logs and screens
o  Automatic transmission versus manual transmission
•  Static partitioning induced by Java Virtual Machine has
benefits but also induces challenges.
•  Where there are difficulties, there’s opportunity:
o  Better tooling, instrumentation, integration of logs/metrics
•  YARN still evolving into an operating system
•  Hadoop as a Service: aggregate and share expertise
•  Need to learn from the traditional database community!
QUESTIONS? COMMENTS?

More Related Content

What's hot

Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizonArtem Ervits
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleDataWorks Summit
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcachedJurriaan Persyn
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3DataWorks Summit
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Takrim Ul Islam Laskar
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureSkillspeed
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Simplilearn
 

What's hot (20)

Hive 3 a new horizon
Hive 3  a new horizonHive 3  a new horizon
Hive 3 a new horizon
 
Hive
HiveHive
Hive
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! ScaleHive and Apache Tez: Benchmarked at Yahoo! Scale
Hive and Apache Tez: Benchmarked at Yahoo! Scale
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)Introduction to Apache Hive(Big Data, Final Seminar)
Introduction to Apache Hive(Big Data, Final Seminar)
 
Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive ArchitectureHadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 

Similar to Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale

OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleBig Data Joe™ Rossi
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudSoam Acharya
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDataWorks Summit
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsDataWorks Summit
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsNamuk Park
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airshipdave_revell
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015polo li
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...rhatr
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureRyan Hennig
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsRUHULAMINHAZARIKA
 

Similar to Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale (20)

OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
Debugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-CloudDebugging Hive with Hadoop-in-the-Cloud
Debugging Hive with Hadoop-in-the-Cloud
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Managing growth in Production Hadoop Deployments
Managing growth in Production Hadoop DeploymentsManaging growth in Production Hadoop Deployments
Managing growth in Production Hadoop Deployments
 
Review of Calculation Paradigm and its Components
Review of Calculation Paradigm and its ComponentsReview of Calculation Paradigm and its Components
Review of Calculation Paradigm and its Components
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
HBase and Hadoop at Urban Airship
HBase and Hadoop at Urban AirshipHBase and Hadoop at Urban Airship
HBase and Hadoop at Urban Airship
 
מיכאל
מיכאלמיכאל
מיכאל
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015Hadoop Robot from eBay at China Hadoop Summit 2015
Hadoop Robot from eBay at China Hadoop Summit 2015
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
Building Google-in-a-box: using Apache SolrCloud and Bigtop to index your big...
 
Hadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and FutureHadoop @ eBay: Past, Present, and Future
Hadoop @ eBay: Past, Present, and Future
 
Getting started big data
Getting started big dataGetting started big data
Getting started big data
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
MHUG - YARN
MHUG - YARNMHUG - YARN
MHUG - YARN
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data Analytics
 

More from Data Con LA

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA
 

More from Data Con LA (20)

Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynotes
Data Con LA 2022 KeynotesData Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup ShowcaseData Con LA 2022 - Startup Showcase
Data Con LA 2022 - Startup Showcase
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendationsData Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - Using Google trends data to build product recommendations
 
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - AI EthicsData Con LA 2022 - AI Ethics
Data Con LA 2022 - AI Ethics
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learningData Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - Improving disaster response with machine learning
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and AtlasData Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentationData Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Real world consumer segmentation
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWSData Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Moving Data at Scale to AWS
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AIData Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
 
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data ScienceData Con LA 2022 - Intro to Data Science
Data Con LA 2022 - Intro to Data Science
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing EntertainmentData Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with KafkaData Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 - Data Streaming with Kafka
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 

Debugging Hive with Hadoop-in-the-Cloud by David Chaiken of Altiscale

  • 1. DEBUGGING HIVE WITH HADOOP IN THE CLOUD Soam Acharya, David Chaiken, Denis Sheahan, Charles Wimmer Altiscale, Inc. #LABDUG @ 20150115T19:30-0800
  • 2. WHO ARE WE? •  Altiscale: Infrastructure Nerds! •  Hadoop As A Service •  Rack and build our own Hadoop clusters •  Provide a suite of Hadoop tools o  Hive, Pig, Oozie o  Others as needed: R, Python, Spark, Mahout, Impala, etc. •  Monthly billing plan: compute (YARN), storage (HDFS) •  https://www.altiscale.com •  @Altiscale #HadoopSherpa
  • 3. TALK ROADMAP •  Our Platform and Perspective •  Hadoop 2 Primer •  Hadoop Debugging Tools •  Accessing Logs in Hadoop 2 •  Hive + Hadoop Architecture •  Hive Logs •  Hive Issues + Case Studies o  Hive + Interactive (DRAM Centric) Processing Engines •  Conclusion: Making Hive Easier to Use
  • 4. OUR DYNAMIC PLATFORM •  Hadoop 2.0.5 => Hadoop 2.2.0 => Hadoop 2.4.1 => … •  Hive 0.10 => Hive 0.12 => Stinger (Hive 0.13 + Tez) => … •  Hive, Pig and Oozie most commonly used tools •  Working with customers on: Spark, H2O, Trifacta, Impala, Flume, Camus/Kafka, …
  • 5. ALTISCALE PERSPECTIVE •  What we do as a service provider… o  Performance + Reliability: Jobs finish faster, fewer failures o  Instant Access: Always-on access to HDFS and YARN o  Hadoop Helpdesk: Tools + experts ensure customer success o  Secure: Networking, SOC 2 Audit, Kerberos o  Results: Faster Time-to-Value (TTV), Lower TCO •  Operational approach in this presentation… o  How to use Hadoop 2 cluster tools and logs to debug and to tune Hive o  This talk will not focus on query optimization
  • 6.      Hadoop  2  Cluster   Name  Node     Hadoop  Slave   Hadoop  Slave   Hadoop  Slave   Resource  Manager     Secondary  NameNode     Hadoop  Slave   Node  Managers   +     Data  Nodes   QUICK PRIMER – HADOOP 2
  • 7. QUICK PRIMER – HADOOP 2 YARN •  Resource Manager (per cluster) o  Manages job scheduling and execution o  Global resource allocation •  Application Master (per job) o  Manages task scheduling and execution o  Local resource allocation •  Node Manager (per-machine agent) o  Manages the lifecycle of task containers o  Reports to RM on health and resource usage
  • 8. HADOOP 1 VS HADOOP 2 •  No more JobTrackers, TaskTrackers •  YARN ~ Operating System for Clusters o  MapReduce is implemented as a YARN application o  Bring on the applications! (Spark is just the start…) •  Should be Transparent to Hive users
  • 9. HADOOP 2 DEBUGGING TOOLS •  Monitoring o  System state of cluster: §  CPU, Memory, Network, Disk §  Nagios, Ganglia, Sensu! §  Collectd, statd, Graphite o  Hadoop level §  HDFS usage §  Resource usage: •  Container memory allocated vs used •  # of jobs running at the same time •  Long running tasks
  • 10. HADOOP 2 DEBUGGING TOOLS •  Hadoop logs o  Daemon logs: Resource Manager, NameNode, DataNode o  Application logs: Application Master, MapReduce tasks o  Job history file: resources allocated during job lifetime o  Application configuration files: store all Hadoop application parameters •  Source code instrumentation
  • 11.
  • 12. ACCESSING LOGS IN HADOOP 2 •  To view the logs for a job, click on the link under the ID column in Resource Manager UI.
  • 13. ACCESSING LOGS IN HADOOP 2 •  To view application top level logs, click on logs. •  To view individual logs for the mappers and reducers, click on History.
  • 14. ACCESSING LOGS IN HADOOP 2 •  Log output for the entire application.
  • 15. ACCESSING LOGS IN HADOOP 2 •  Click on the Map link for mapper logs and the Reduce link for reducer logs.
  • 16. ACCESSING LOGS IN HADOOP 2 •  Clicking on a single link under Name provides an overview for that particular map job.
  • 17. ACCESSING LOGS IN HADOOP 2 •  Finally, clicking on the logs link will take you to the log output for that map job.
  • 18. ACCESSING LOGS IN HADOOP 2 •  Fun, fun, donuts, and more fun…
  • 19. HIVE + HADOOP 2 ARCHITECTURE •  Hive 0.10+      Hadoop  2  Cluster   Hive  CLI   Hive   Metastore   Hiveserver2  JDBC/ODBC   Tableau,   KeFle,  …  
  • 20. HIVE LOGS •  Query Log location •  From /etc/hive/hive-site.xml: 
 <property>" <name>hive.querylog.location</name>" <value>/home/hive/log/${user.name}</value>" </property>" " SessionStart SESSION_ID="soam_201402032341" TIME="1391470900594"" "
  • 21. HIVE CLIENT LOGS •  /etc/hive/hive-log4j.properties: o  hive.log.dir=/var/log/hive/${user.name} 2014-05-29 19:51:09,830 INFO parse.ParseDriver (ParseDriver.java:parse(179)) - Parsing command: select count(*) from dogfood_job_data" 2014-05-29 19:51:09,852 INFO parse.ParseDriver (ParseDriver.java:parse(197)) - Parse Completed" 2014-05-29 19:51:09,852 INFO ql.Driver (PerfLogger.java:PerfLogEnd(124)) - </PERFLOG method=parse start=1401393069830 end=1401393069852 duration=22>" 2014-05-29 19:51:09,853 INFO ql.Driver (PerfLogger.java:PerfLogBegin(97)) - <PERFLOG method=semanticAnalyze>" 2014-05-29 19:51:09,890 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:analyzeInternal(8305)) - Starting Semantic Analysis" 2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:analyzeInternal(8340)) - Completed phase 1 of Semantic Analysis" 2014-05-29 19:51:09,892 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1060)) - Get metadata for source tables" 2014-05-29 19:51:09,906 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1167)) - Get metadata for subqueries" 2014-05-29 19:51:09,909 INFO parse.SemanticAnalyzer (SemanticAnalyzer.java:getMetaData(1187)) - Get metadata for destination tables" "
  • 22. HIVE METASTORE LOGS •  /etc/hive-metastore/hive-log4j.properties: o  hive.log.dir=/service/log/hive-metastore/${user.name} 2014-05-29 19:50:50,179 INFO metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data" 2014-05-29 19:50:50,180 INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94 cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data " 2014-05-29 19:50:50,236 INFO metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data" 2014-05-29 19:50:50,236 INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent(239)) - ugi=chaiken ip=/10.252.18.94 cmd=source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data " 2014-05-29 19:50:50,261 INFO metastore.HiveMetaStore (HiveMetaStore.java:logInfo(454)) - 200: source:/10.252.18.94 get_table : db=default tbl=dogfood_job_data"
  • 23. HIVE ISSUES + CASE STUDIES •  Hive Issues o  Hive client out of memory o  Hive map/reduce task out of memory o  Hive metastore out of memory o  Hive launches too many tasks •  Case Studies: o  Hive “stuck” job o  Hive “missing directories” o  Analyze Hive Query Execution o  Hive + Interactive (DRAM Centric) Processing Engines
  • 24. HIVE CLIENT OUT OF MEMORY •  Memory intensive client side hive query (map-side join) Number of reduce tasks not specified. Estimated from input data size: 999" In order to change the average load for a reducer (in bytes):" set hive.exec.reducers.bytes.per.reducer=<number>" In order to limit the maximum number of reducers:" set hive.exec.reducers.max=<number>" In order to set a constant number of reducers:" set mapred.reduce.tasks=<number>" java.lang.OutOfMemoryError: Java heap space! at java.nio.CharBuffer.wrap(CharBuffer.java:350)" at java.nio.CharBuffer.wrap(CharBuffer.java:373)" at java.lang.StringCoding$StringDecoder.decode(StringCoding.java: 138)"
  • 25. HIVE CLIENT OUT OF MEMORY •  Use HADOOP_HEAPSIZE prior to launching Hive client •  HADOOP_HEAPSIZE=<new heapsize> hive <fileName>" •  Watch out for HADOOP_CLIENT_OPTS issue in hive-env.sh! •  Important to know the amount of memory available on machine running client… Do not exceed or use disproportionate amount. $ free -m" total used free shared buffers cached" Mem: 1695 1388 306 0 60 424" -/+ buffers/cache: 903 791" Swap: 895 101 794"    
  • 26. HIVE TASK OUT OF MEMORY •  Query spawns MapReduce jobs that run out of memory •  How to find this issue? o  Hive diagnostic message o  Hadoop MapReduce logs
  • 27. HIVE TASK OUT OF MEMORY •  Fix is to increase task RAM allocation… set mapreduce.map.memory.mb=<new RAM allocation>; " set mapreduce.reduce.memory.mb=<new RAM allocation>;" •  Also watch out for… set mapreduce.map.java.opts=-Xmx<heap size>m; " set mapreduce.reduce.java.opts=-Xmx<heap size>m; " •  Not a magic bullet – requires manual tuning •  Increase in individual container memory size: o  Decrease in overall containers that can be run o  Decrease in overall parallelism
  • 28. HIVE METASTORE OUT OF MEMORY •  Out of memory issues not necessarily dumped to logs •  Metastore can become unresponsive •  Can’t submit queries •  Restart with a higher heap size: export HADOOP_HEAPSIZE in hcat_server.sh •  After notifying hive users about downtime: service hcat restart"
  • 29. HIVE LAUNCHES TOO MANY TASKS •  Typically a function of the input data set •  Lots of little files
  • 30. HIVE LAUNCHES TOO MANY TASKS •  Set mapred.max.split.size to appropriate fraction of data size •  Also verify that hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat"
  • 31. CASE STUDY: HIVE STUCK JOB From an Altiscale customer: “This job [jobid] has been running now for 41 hours. Is it still progressing or has something hung up the map/reduce so it’s just spinning? Do you have any insight?”
  • 32. HIVE STUCK JOB 1.  Received jobId, application_1382973574141_4536, from client 2.  Logged into client cluster. 3.  Pulled up Resource Manager 4.  Entered part of jobId (4536) in the search box. 5.  Clicked on the link that says: application_1382973574141_4536" 6.  On resulting Application Overview page, clicked on link next to “Tracking URL” that said Application Master
  • 33. HIVE STUCK JOB 7.  On resulting MapReduce Application page, we clicked on the Job Id (job_1382973574141_4536). 8.  The resulting MapReduce Job page displayed detailed status of the mappers, including 4 failed mappers 9.  We then clicked on the 4 link on the Maps row in the Failed column. 10. Title of the next page was “FAILED Map attempts in job_1382973574141_4536.” 11.  Each failed mapper generated an error message. 12. Buried in the 16th line: Caused by: java.io.FileNotFoundException: File does not exist: hdfs://opaque_hostname:8020/ HiveTableDir/FileName.log.date.seq !
  • 34. HIVE STUCK JOB •  Job was stuck for a day or so, retrying a mapper that would never finish successfully. •  During the job, our customers’ colleague realized input file was corrupted and deleted it. •  Colleague did not anticipate the affect of removing corrupted data on a running job •  Hadoop didn’t make it easy to find out: o  RM => search => application link => AM overview page => MR Application Page => MR Job Page => Failed jobs page => parse long logs o  Task retry without hope of success
  • 35. HIVE “MISSING DIRECTORIES” From an Altiscale customer: “One problem we are seeing after the [Hive Metastore] restart is that we lost quite a few directories in [HDFS]. Is there a way to recover these?”
  • 36. HIVE “MISSING DIRECTORIES” •  Obtained list of “missing” directories from customer: o  /hive/biz/prod/* •  Confirmed they were missing from HDFS •  Searched through NameNode audit log to get block IDs that belonged to missing directories. 13/07/24 21:10:08 INFO hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hive/biz/prod/ incremental/carryoverstore/postdepuis/ lmt_unmapped_pggroup_schema._COPYING_. BP-798113632-10.251.255.251-1370812162472 blk_3560522076897293424_2448396{blockUCState=UNDER_C ONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[10.251.255.177:50 010|RBW], ReplicaUnderConstruction[10.251.255.174:50010|RBW], ReplicaUnderConstruction[10.251.255.169:50010|RBW]]}"
  • 37. HIVE “MISSING DIRECTORIES” •  Used blockID to locate exact time of file deletion from Namenode logs: 13/07/31 08:10:33 INFO hdfs.StateChange: BLOCK* addToInvalidates: blk_3560522076897293424_2448396 to 10.251.255.177:50010 10.251.255.169:50010 10.251.255.174:50010 " •  Used time of deletion to inspect hive logs
  • 38. HIVE “MISSING DIRECTORIES” QueryStart QUERY_STRING="create database biz_weekly location '/hive/biz/ prod'" QUERY_ID=“usrprod_20130731043232_0a40fd32-8c8a-479c- ba7d-3bd8a2698f4b" TIME="1375245164667" : QueryEnd QUERY_STRING="create database biz_weekly location '/hive/biz/ prod'" QUERY_ID=”usrprod_20130731043232_0a40fd32-8c8a-479c- ba7d-3bd8a2698f4b" QUERY_RET_CODE="0" QUERY_NUM_TASKS="0" TIME="1375245166203" : QueryStart QUERY_STRING="drop database biz_weekly" QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733" TIME="1375256014799" : QueryEnd QUERY_STRING="drop database biz_weekly" QUERY_ID=”usrprod_20130731073333_e9acf35c-4f07-4f12-bd9d-bae137ae0733" QUERY_NUM_TASKS="0" TIME="1375256014838"
  • 39. HIVE “MISSING DIRECTORIES” •  In effect, user “usrprod” issued: At 2013-07-31 04:32:44: create database biz_weekly location '/hive/biz/prod' At 2013-07-31 07:33:24: drop database biz_weekly •  This is functionally equivalent to: hdfs dfs -rm -r /hive/biz/prod"
  • 40. HIVE “MISSING DIRECTORIES” •  Customer manually placed their own data in /hive – the warehouse directory managed and controlled by hive •  Customer used CREATE and DROP db commands in their code o  Hive deletes database and table locations in /hive with impunity •  Why didn’t deleted data end up in .Trash? o  Trash collection not turned on in configuration settings o  It is now, but need a –skipTrash option (HIVE-6469)
  • 41. HIVE “MISSING DIRECTORIES” •  Hadoop forensics: piece together disparate sources… o  Hadoop daemon logs (NameNode) o  Hive query and metastore logs o  Hadoop config files •  Need better tools to correlate the different layers of the system: hive client, hive metastore, MapReduce job, YARN, HDFS, operating sytem metrics, … By the way… Operating any distributed system would be totally insane without NTP and a standard time zone (UTC).
  • 42. CASE STUDY – ANALYZE QUERY •  Customer provided Hive query + data sets (100GBs to ~5 TBs) •  Needed help optimizing the query •  Didn’t rewrite query immediately •  Wanted to characterize query performance and isolate bottlenecks first
  • 43. ANALYZE AND TUNE EXECUTION •  Ran original query on the datasets in our environment: o  Two M/R Stages: Stage-1, Stage-2 •  Long running reducers run out of memory o  set mapreduce.reduce.memory.mb=5120" o  Reduces slots and extends reduce time •  Query fails to launch Stage-2 with out of memory o  set HADOOP_HEAPSIZE=1024 on client machine •  Query has 250,000 Mappers in Stage-2 which causes failure o  set mapred.max.split.size=5368709120
 to reduce Mappers
  • 44. ANALYSIS: HOW TO VISUALIZE? •  Next challenge - how to visualize job execution? •  Existing hadoop/hive logs not sufficient for this task •  Wrote internal tools o  parse job history files o  plot mapper and reducer execution
  • 46. Single  reduce  task   ANALYSIS: REDUCE STAGE-1
  • 49. ANALYZE EXECUTION: FINDINGS •  Lone, long running reducer in first stage of query •  Analyzed input data: o  Query split input data by userId o  Bucketizing input data by userId o  One very large bucket: “invalid” userId o  Discussed “invalid” userid with customer •  An error value is a common pattern! o  Need to differentiate between “Don’t know and don’t care” or “don’t know and do care.”
  • 50. INTERACTIVE (DRAM CENTRIC) PROCESSING SYSTEMS •  Loading data into DRAM makes processing fast! •  Examples: Spark, Impala, 0xdata, …, [SAP HANA], … •  Streaming systems (Storm, DataTorrent) may be similar •  Need to increase YARN container memory size
  • 51. •  Caution: larger YARN container settings for interactive jobs may not be right for batch systems like Hive •  Container size: needs to combine vcores and memory: yarn.scheduler.maximum-allocation-vcores
 yarn.nodemanager.resource.cpu-vcores ..." Hive + Interactive: Watch Out for Container Size
  • 52. HIVE + INTERACTIVE: WATCH OUT FOR FRAGMENTATION •  Attempting to schedule interactive systems and batch systems like Hive may result in fragmentation •  Interactive systems may require all-or-nothing scheduling •  Batch jobs with little tasks may starve interactive jobs
  • 53. HIVE + INTERACTIVE: WATCH OUT FOR FRAGMENTATION Solutions for fragmentation… •  Reserve interactive nodes before starting batch jobs •  Reduce interactive container size (if the algorithm permits) •  Node labels (YARN-2492) and gang scheduling (YARN-624)
  • 54. CONCLUSIONS •  Hive + Hadoop debugging can get very complex o  Sifting through many logs and screens o  Automatic transmission versus manual transmission •  Static partitioning induced by Java Virtual Machine has benefits but also induces challenges. •  Where there are difficulties, there’s opportunity: o  Better tooling, instrumentation, integration of logs/metrics •  YARN still evolving into an operating system •  Hadoop as a Service: aggregate and share expertise •  Need to learn from the traditional database community!