This is the presentation from Bangalore Big Data November Meetup given by Davin Chaiken, AltiScale.
technology.inmobi.com/events/bigdata-meetup
Talk Outline:
- Altiscale Company Introduction and Perspective
- Altiscale Architecture
- Use Cases: Performance, Job Analysis, Scheduling
- Infinite Hadoop
- Challenges to the Hadoop Community
25. Customer Case Study: Analyze Query
Customer provided Hive query + data sets
(100GBs to ~5 TBs)
Needed help optimizing the query
Didn’t rewrite query immediately
Wanted to characterize query performance and
isolate bottlenecks first
26. Analyze and Tune Execution
Ran original query on the datasets in our environment:
• Two M/R Stages: Stage-1, Stage-2
Long running reducers run out of memory
• set mapreduce.reduce.memory.mb=5120!
• Reduces slots and extends reduce time
Query fails to launch Stage-2 with out of memory
• set HADOOP_HEAPSIZE=1024 on client machine
Query has 250,000 Mappers in Stage-2 which causes
failure
• set mapred.max.split.size=5368709120
to reduce Mappers
27. Analysis: Job Execution Characteristics
Next challenge - how to visualize job execution?
Existing hadoop/hive logs not sufficient for this task
Wrote internal tools
• parse job history files
• plot mapper and reducer execution
32. Analysis Execution: Findings
Lone, long running reducer in first stage of query
Analyzed input data:
• Query split input data by userId
• Bucketizing input data by userId
• One very large bucket: “invalid” userId
• Discussed “invalid” userid with customer
An error value is a common pattern!
• Need to differentiate between “Don’t know and don’t care”
or “don’t know and do care.”
33. Interactive (DRAM-centric) Processing Systems
Loading data into DRAM makes processing fast!
Examples: Spark, Impala, 0xdata, …, [SAP HANA], …
Streaming systems (Storm, DataTorrent) may be similar
Need to increase YARN container memory size
34. Hive + Interactive: Watch Out for Container Size
Caution: larger YARN container settings for interactive
jobs may not be right for batch systems like Hive
Container size: needs to combine vcores and memory:
yarn.scheduler.maximum-allocation-vcores
yarn.nodemanager.resource.cpu-vcores ...!
35. Hive + Interactive: Watch Out for Fragmentation
Attempting to schedule interactive systems and batch
systems like Hive may result in fragmentation
Interactive systems may require all-or-nothing
scheduling
Batch jobs with little tasks may starve interactive jobs
36. Hive + Interactive: Watch Out for Fragmentation
Solutions for fragmentation…
Reserve interactive nodes before starting batch jobs
Reduce interactive container size (if the algorithm permits)
Node labels (YARN-726) and gang scheduling (YARN-624)
48. Challenges to the Hadoop Community
Hive + Hadoop debugging can get very complex
• Sifting through many logs and screens
• Automatic transmission versus manual transmission
Static partitioning induced by Java Virtual Machine has
benefits but also induces challenges.
Where there are difficulties, there’s opportunity:
• Better tooling, instrumentation, integration of logs/metrics
YARN still evolving into an operating system
Just starting to build real multitenancy into Hadoop.
Hadoop as a Service: aggregate and share expertise