Parallel programming consists of four distinct phases: finding the parallelism, writing the code, debugging the code, and optimizing the code. In my opinion, frameworks like the Apple’s Grand Central Dispatch and Intel’s TBB only help with writing the code; they do not help with finding parallelism, or debugging, or optimizations (e.g., false-sharing, thread waiting, etc). I think that the difficulty in finding the parallelism , which can be an insurmountable barrier for many inexperienced parallel programmers, is often underestimated. In this post, I try to explain this challenge using a couple of parallel programming puzzles.
Whenever I discuss the difficulty of parallel programming, I am always informed about new programming languages and frameworks that are supposed to make parallel programming a no-brainer. While these paradigms ease the work of specifying the parallelism, I feel that they do not solve the other half of the problem: how does a programmer find parallel tasks in a program. This can often be very challenging. My two challenge kernels are as follows:
A simple photo filter
My first example is of an image processing kernel used to smooth pixels in an image. Similar filters exist in many photo editing programs. Graphics engines also require similar filters to smooth the light around the edge of a light source. The code and a graphical illustration of the kernel is as follows:
This kernel is very simple but it is impossible to parallelize as-is because every single operation in this kernel is dependent on the previous operation (as shown by the red line). For completeness, I must add that the domain experts often parallelize this kernel by allowing it to produce “incorrect” results. While this solution is acceptable in some domains, the real challenge is to develop a parallel version which produces correct results. Any suggestions?
Cache of Database Tables
Some SQL databases keep track of the tables that are currently open in a centralized cache. For every open table, the cache contains the table’s meta data and a list of accessors, i.e, a list of transactions (synonymous to threads in most DBs) that are currently accessing the table. Conceptually, the table looks as follows:
In this picture, Thread 0 is accessing Tables A and D. Thread 1 is accessing only Table C, …
The working of this table can be described using the following pseudo code.
Begin Database Transaction For every table needed by the transaction: If table is in the cache: Add thread ID to the accessors list of the table If table is not in the cache: Open the table Add the table to the cache Add thread ID to the accessors list of the table
End Database Transaction For every table needed by the transaction: Delete transaction from the accessors list of the table If the accessors list is empty after the deletion Remove the table from cache Close the table
This data structure requires updates in the tables as well as the accessors dimension. Hence, it is implemented as a two-dimensional linked data structure like the following:
Since this data structure is shared by multiple threads, it must be made thread safe. One option is to protect all access to this structure using a single coarse-grain lock. This solution is functionally correct but performs terribly when the database has a large number of threads. The obvious alternate is to use fine-grain locking. However, since the structure uses links in both dimensions, multiple locks must be acquired and released in a nested fashion on every access. While this has the potential to expose some parallelism, the overhead of lock operations outweighs this benefit. Any suggestions on how to parallelize this?
Looking forward …
The point I am making here is that finding parallelism can sometimes be much harder than specifying parallelism. What I would really like is feedback from those supporting the new paradigms. Perhaps I need to be corrected … . Note: If you have examples of other kernels which are painfully hard to parallelize, please share them with us so awareness can be raised.