Wei's notes on hadoop resource awareness


Published on

Published in: Business, Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Source:http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
  • Wei's notes on hadoop resource awareness

    1. 1. Wei’s Notes on Resource Awareness<br />March 2011<br />
    2. 2. Example workloads<br />IO-bound<br />Indexing<br />Searching<br />Grouping<br />Decoding/decompressing<br />Data importing and exporting<br />CPU-bound<br />Machine learning<br />Complex text mining<br />Natural language processing<br />Feature extraction<br />
    3. 3. IO/CPU intensive?<br />How to judge if a job is IO/CPU intensive?<br />Simplify: let user specify<br />Otherwise:<br />Does it make more sense to find the pattern at the job level or task level?<br />Could a job be CPU intensive but with reduce tasks being IO intensive? <br />
    4. 4. Goal<br />Make task/Job placement resource aware<br />Proposal: provide a profiling mechanism to quantify demand and supply per job per task type and per machine periodically, like a 3D score sheet. Any scheduler could generically adopt the score sheet, and sign slot/task based on the weighted task/slot. <br />Job_TaskType<br />time<br />machine<br />
    5. 5. Proposed scheme<br />Quantify resource capacities at cluster start<br />Quantify machine/network variables periodically<br />Profile tasks/jobs resource demand whenever: a job is submitted, first mapper task finishes, mapper done, or first mapper task finishes.<br />Assign score per job per task_type per possible machine placement (all slots on a given machine are homogeneous) based on profiles obtained in 1, 2 and 3 periodically. <br />
    6. 6. Variables<br />*traffic on the link which a given node have to transfer data from<br />
    7. 7. Idle Cluster: 1 Task – M Slots<br />Policy (without Network IO && Picking only, not scoring. ONLY for brainstorming):<br />List<Node> nodes, s. t. <br />availability_io > demand_io && availability_cpu > demand _cpu<br />If nodes.size() = 1<br />DONE!<br />else if nodes.size() > 1<br />for each //try to balance io usage and cpu usage on a machine <br />io_cpu_dist = dist (availability_io - demand_io,availability_cpu - demand _cpu)<br />Pick node with min(io_cpu_dist)<br />DONE!<br />else if nodes.size() = 0<br />for each <br />shortage = dist (availability_io, demand_io) + dist(availability_cpu, demand _cpu)<br />Pick node with min(shortage )<br />DONE <br />
    8. 8. Busy Cluster: 1 Slot – M Tasks<br />Closer to the production clusters usage pattern<br />Similar algo as idle. And the same algo can be extended to assign scores. <br />
    9. 9. Limitations<br />Score sheet only has scores of running tasks (extending to tasks from the same job of the same task type). <br />Doesn’t benefit the very first mapper task or the very first reducer task.<br />
    10. 10. Measurement & Quantification<br />Profile a task type of a job by sampling<br />How to measure IO and CPU of a given machine at a given time?<br />Availability = Capacity – (sum of resource consumption of running task). Capacity?<br />Or better: Availability = (sum of resource consumption of running task) * (1/usage percentage – 1)<br /> *this availability is based on average current running task demand. And step 1 in the proposed scheme could potentially be skipped! Well… but that could come handy when placing the very first task.<br />How to normalize IO and CPU against each other?<br />Use percentage? Then demands has to be normalized with the same multipliers, IO and CPU respectively.<br />