A few months ago, one of our projects was in this exact situation. What was worse was that we had multiple users needing to execute relatively simple and short running jobs in order to meet mini-milestones on their part of the project. We had to find a solution.
Requirements and Validation
We came up with the following requirements and validation criteria concerning the behavior of how jobs should be scheduled. In particular, how to prevent a small job from being starved while a very large job is already in execution, as well as, giving these small jobs 100% of the cluster resources when the resources are free because we definitely don't want to now impose some new limit on a small job.
1) There should be at most one high resource intensive job running, with enough capacity to allow less intensive jobs to run along side it.
2) Should be able to configure separate queues so that jobs executed on one queue have a higher allocation of resources.
3) Should be able to identify a specific queue that is used when a job is executed.
4) If one user launches a job, they should be able to use 100% of the cluster, but if a different user launches a job the first job's utilization of the cluster should be reset.
5) Ability to configure a queue so that all users of the queue get equal access to cluster resources. Multiple users can execute jobs simultaneously and will always be given an equal share of resources until the point at which users would get less than some percentage of the cluster. At that point, jobs will have to wait before execution.
1) Run a ("large") job that uses 100% of the cluster (or would use 100% of the cluster under the current, default config).
2) Next, run a ("small") job that only requires e.g. ~20% of the cluster.
3) Ensure the second job can be run and is not stuck waiting for the first job to finish.
4) Attempt to run another ("large") job.
5) Ensure this second ("large") job does not run until the first ("large") job finishes.
Capacity Scheduler configurations
Here is the default configuration for capacity-scheduler:
yarn.scheduler.capacity.maximum-am-resource-percent=0.2 yarn.scheduler.capacity.maximum-applications=10000 yarn.scheduler.capacity.node-locality-delay=40 yarn.scheduler.capacity.root.acl_administer_queues=* yarn.scheduler.capacity.root.capacity=100 yarn.scheduler.capacity.root.default.acl_administer_jobs=* yarn.scheduler.capacity.root.default.acl_submit_jobs=* yarn.scheduler.capacity.root.default.capacity=100 yarn.scheduler.capacity.root.default.maximum-capacity=100 yarn.scheduler.capacity.root.default.state=RUNNING yarn.scheduler.capacity.root.default.user-limit-factor=1 yarn.scheduler.capacity.root.queues=default yarn.scheduler.capacity.root.unfunded.capacity=50
and here is our 70-30, dual-queue Capacity Scheduler configuration for capacity-scheduler:
yarn.scheduler.capacity.maximum-am-resource-percent=0.2 yarn.scheduler.capacity.maximum-applications=10000 yarn.scheduler.capacity.node-locality-delay=40 yarn.scheduler.capacity.root.acl_administer_queues=* yarn.scheduler.capacity.root.big.acl_administer_jobs=* yarn.scheduler.capacity.root.big.acl_submit_jobs=* yarn.scheduler.capacity.root.big.capacity=70 yarn.scheduler.capacity.root.big.maximum-applications=1 yarn.scheduler.capacity.root.big.maximum-capacity=100 yarn.scheduler.capacity.root.big.state=RUNNING yarn.scheduler.capacity.root.big.user-limit-factor=1.25 yarn.scheduler.capacity.root.capacity=100 yarn.scheduler.capacity.root.default.acl_administer_jobs=* yarn.scheduler.capacity.root.default.acl_submit_jobs=* yarn.scheduler.capacity.root.default.capacity=30 yarn.scheduler.capacity.root.default.maximum-capacity=100 yarn.scheduler.capacity.root.default.minimum-user-limit-percent=50 yarn.scheduler.capacity.root.default.state=RUNNING yarn.scheduler.capacity.root.default.user-limit-factor=2 yarn.scheduler.capacity.root.queues=default,big
The first four requirements are satisfied. For requirement three, if a user needs to submit a long running, highly intense job, they can use the queue named "big". To do this, include -Dmapred.job.queue.name=big along with whatever is the job execution command. For example, this will run a job in the queue named "big":
If this additional java argument is not specified, a users job just goes to the queue named "default" and everything just works as it did before for them. For example, this will run a job in the queue named "default":yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-220.127.116.11.0.6.0-101.jar pi -Dmapred.job.queue.name=big 20 1000000009
Finally, the last requirement describes how the smaller queue (or the "30 in "70-30") behaves. Therefore, requirements are satisfied.yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-18.104.22.168.0.6.0-101.jar pi 20 1000000009
The dual-queue configuration can be validated using the steps defined in an earlier section. When the first large job is submitted, 100% of the cluster will be used. After submitting the small job (in step 2), when a portion of the big job completes and releases a slot back to the cluster, the small job will get that slot and will begin execution. In other words, the small job does not preempt the large job in order to immediately obtain a slot to use. The small job must wait until the next slot frees up. Validation succeeds.
Notes on Capacity Scheduler Configuration Directives
I found it difficult to find documentation about the Capacity Scheduler configuration entries... or at least more resources than just a short explanation or that were sometimes ambiguous or hard to understand for our team.
Here are the docs for Capacity Scheduler configuration entries...
I discovered that the Capacity Scheduler configuration entry containing "unfunded.capacity" (see the last line in the default Capacity Scheduler box, above) means nothing. There was actually a ticket to clean this up (https://issues.apache.org/jira/browse/AMBARI-6269). So you can safely remove this entry. In short, if you're wondering what it means-it says that the capacity of the queue named "unfunded" should be set to 50 (percent of the parent queue's capacity). But there is no queue defined to be named "unfunded". So it means and does nothing.
Here are the docs for Capacity Scheduler configuration entries...
Here are a couple of important Capacity Scheduler configurations and their descriptions...
- The multiple of the queue capacity which can be configured to allow a single user to acquire more resources. By default this is set to 1 which ensures that a single user can never take more than the queue's configured capacity irrespective of how idle the cluster is. Value is specified as a float. 
- If you're running all the job from the same user, by default, you can't take more than the value of the queue. This behavior can be modified by this setting in
- Each queue enforces a limit on the percentage of resources allocated to a user at any given time, if there is demand for resources. The user limit can vary between a minimum and maximum value. The the former (the minimum value) is set to this property value and the latter (the maximum value) depends on the number of users who have submitted applications. For e.g., suppose the value of this property is 25. If two users have submitted applications to a queue, no single user can use more than 50% of the queue resources. If a third user submits an application, no single user can use more than 33% of the queue resources. With 4 or more users, no user can use more than 25% of the queues resources. A value of 100 implies no user limits are imposed. The default is 100. Value is specified as a integer. 
Notes on Fair Scheduler
YARN can be configured to use a different scheduling module altogether. If it turns out that Capacity Scheduler doesn't work for you or maybe you want to try something else, have a look at the Fair Scheduler.