Jun 142011
 

Parallel programming consists of four distinct phases: finding the parallelism, writing the code, debugging the code, and optimizing the code. In my opinion, frameworks like the Apple’s Grand Central Dispatch and Intel’s TBB only help with writing the code; they do not help with finding parallelism, or debugging, or optimizations (e.g., false-sharing, thread waiting, etc). I think that the difficulty in finding the parallelism , which can be an insurmountable barrier for many inexperienced parallel programmers, is often underestimated. In this post, I try to explain this challenge using a couple of parallel programming puzzles.

Whenever I discuss the difficulty of parallel programming, I am always informed about new programming languages and frameworks that are supposed to make parallel programming a no-brainer. While these paradigms ease the work of specifying the parallelism, I feel that they do not solve the other half of the problem: how does a programmer find parallel tasks in a program. This can often be very challenging. My two challenge kernels are as follows:

A simple photo filter

My first example is of an image processing kernel used to smooth pixels in an image. Similar filters exist in many photo editing programs. Graphics engines also require similar filters to smooth the light around the edge of a light source. The code and a graphical illustration of the kernel is as follows:

 

Screen shot 2011-06-14 at 12.49.10 AM

This kernel is very simple but it is impossible to parallelize as-is because every single operation in this kernel is dependent on the previous operation (as shown by the red line). For completeness, I must add that the domain experts often parallelize this kernel by allowing it to produce “incorrect” results. While this solution is acceptable in some domains, the real challenge is to develop a parallel version which produces correct results. Any suggestions?

Cache of Database Tables

Some SQL databases keep track of the tables that are currently open in a centralized cache. For every open table, the cache contains the table’s meta data and a list of accessors, i.e, a list of transactions (synonymous to threads in most DBs) that are currently accessing the table. Conceptually, the table looks as follows:

Screen shot 2011-06-14 at 1.31.46 AM

In this picture, Thread 0 is accessing Tables A and D. Thread 1 is accessing only Table C, …

The working of this table can be described using the following pseudo code.

Begin Database Transaction
For every table needed by the transaction:
  If table is in the cache:
    Add thread ID to the accessors list of the table

  If table is not in the cache:
    Open the table
    Add the table to the cache
    Add thread ID to the accessors list of the table
End Database Transaction
  For every table needed by the transaction:
    Delete transaction from the accessors list of the table

    If the accessors list is empty after the deletion
      Remove the table from cache
      Close the table

This data structure requires updates in the tables as well as the accessors dimension. Hence, it is implemented as a two-dimensional linked data structure like the following:

Screen shot 2011-06-14 at 1.32.15 AM

Since this data structure is shared by multiple threads, it must be made thread safe. One option is to protect all access to this structure using a single coarse-grain lock. This solution is functionally correct but performs terribly when the database has a large number of threads.  The obvious alternate is to use fine-grain locking. However, since the structure uses links in both dimensions, multiple locks must be acquired and released in a nested fashion on every access. While this has the potential to expose some parallelism, the overhead of lock operations outweighs this benefit. Any suggestions on how to parallelize this?

Looking forward …

The point I am making here is that finding parallelism can sometimes be much harder than specifying parallelism. What I would really like is feedback from those supporting the new paradigms. Perhaps I need to be corrected … . Note: If you have examples of other kernels which are painfully hard to parallelize, please share them with us so awareness can be raised.

  4 Responses to “Parallel Programming: Why new frameworks only solve a part of the problem?”

  1. Is that first example really what you want? I.E. pixel 0 affects pixel j with strength 2^-j and pixel j+1 does not affect pixel j at all.

    Removing the dependencies gives:

    Result[j] = Pixel[0] * 2^-j + SUM over i=1 to j (Pixel[i] * 2^(i-j-1))

    This looks like the classic carry chain from adder design. So, one could try a block-based approach. Give each processor a block of pixels to process.

    Assuming a given processor starts at pixel S, the result of a pixel in its block can be given by:

    Result[j] = Result[S-1] * 2^(S-j-1) + SUM over i=S to j (Pixel[i] * 2^(i-j-1))

    For each pixel, the second term (SUM) is independent of any pixels outside the block, and can thus be calculated in parallel by the processors. Then the values of the boundary pixels (such as S-1) ripple across the processors. Finally, the processors in parallel calculate the final result for each pixel they own. Of course, suitable cache-aligned datastructures are required.

    Since the producer-consumer relationships are very simple (each processor produces exactly one result to share and each result is consumed by exactly one processor) there should be no locking required.

    In fact, if instead of sharing the Result[S-1] , one shares Result[S-1]/2, one could initialize the communication variables to maxint so that a consumer can tell that the value is valid without needing any synchronization at all. (Of course, FP values could be initialized to NaN.)

  2. For the DB cache one, why a linked list of all open tables? Seems you could at least make a simple hash based on table ID, with a linked list and a lock for each hash bucket…

    Efforts to find something better than that have, as of yet, turned up fruitless.

  3. If you can have lock-free queues surely you can have lock-free lists?

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>