To maximize concurrency, all threads should be programmed to complete their work at the same time. Balancing the load among threads requires the programmers to predict the latency of each task, which is often impossible due to unpredictable OS/hardware effects. Consequently, programmers split the work into small tasks and use work-queues to distribute the work dynamically. Work-queues exist in many programming paradigms like Grand Central Dispatch, Intel TBB, Cilk, Open MP, etc. While work-queues improve load-balancing, they introduce the overhead of adding and removing tasks to and from the work-queue. Thus, if each individual task is too small, the work-queue overhead becomes prohibitive and if its too long then there is risk of imbalance. This post (1) analyzes these trade-offs, (2) provides a method for choosing the best task size at run-time, and (3) explains some recent advances in work-queues that minimize their overhead.