Jun 032011
 

maze I have been busy constructing another parallel programming example this week. My target was to find a problem which can be explained in one sentence, coded in less than 10 lines of C, uses a work-queue, and is simple to parallelize but hard to optimize. The one I finally picked is a maze. I am presenting  my first stab at it today but it is still work in progress. Feedback and suggestions are greatly appreciated.

Acknowledgement: I was  taught this example by Prof. Yale Patt at UT Austin when teaching how to write recursion in assembly language.

Problem: Find a path from the mouse to the cheese in a 2D maze.

Rules: The mouse cannot jump. It is not legal for the mouse to be in a black box. The mouse cannot go outside the maze.

Screen shot 2011-06-03 at 10.19.30 AM

 

Input: a text file which describes the maze. Black and white squares are represented by ‘1’ and ‘0’ respectively.

Output: None as such. We will only measure the time it takes to find the cheese.

(artificial) Constraints to make code simple:

1. The maze is always a square
2. There is always at least one solution to the maze (hence program terminates when it has found the cheese)
3. The mouse is not allowed to cross into a box diagonally
4. Important: The input maze will make all perimeter boxes black, thereby eliminating the need for bounds checking every iteration.

Algorithm

 

There are different algorithms to solve this problem. I will use the following:

1: Solve(BOX):
2:   Terminate if BOX has cheese
3:   For each North, West, East, and South directions:
4:     if the box is NOT black and does NOT have a bread crumb:
5:       Drop a bread crumb in the box
6:       Repeat the above steps for that box

Consider the following example. We will start at box A1.

Screen shot 2011-06-03 at 10.34.24 AM

 

Step 1: From A1, the algorithm will move to A2 because South is the only legal direction.

Step 2: At the next step, the algorithm reaches B2 and A3. Now the algorithm has a choice: it either can process B2 or A3. Say it chooses B2.

Step 3: It will reach B3 and C2. Next, the algorithm can either process A3 (from last time), B3, or C2. Say we chose A3.

The algorithm continues to explore the maze in this fashion until it finds the box with the cheese.

Recursive Solution

Recursion is a natural way to solve such a problem. The idea is to call the function recursively for each of the neighbors of the current box. This code is a simple implementation of the above algorithm.

Note: I know this is more than ten lines of code but I will blame it on C and the braces.

 style="min-height: 15px; overflow: auto; width: 650px; background-color: #ffffff; border: #ffffff 1px solid; padding: 0px;">


  1: void find_recursive(int position){ //position is current box’s position
  2:   if(solved) //solved is a global variable for termination
  3:     return;
  4:   if(position == end) // end is the box with the cheese
  5:     solved = true;
  6:   for(int dir = 0; dir < 4; dir++){ // exploring North, South, East, West
  7:     if(!maze[position+direction_offsets[dir]].val){
  8:         maze[position+direction_offsets[dir]].val = 1;
  9:         find_recursive(position+direction_offsets[dir]);
 10:     }
 11:   }
 12: }

Supplement code: The maze is declared as follows:

struct maze_type {

int val;

};

maze_type *maze;

For reference: I have declared each element of the maze to be a struct because I know that I would later want to make each element cache line aligned to eliminate false-sharing.

Finding parallelism in this maze

In my example above, the algorithm often had choices on which box to explore next. The exploration of these boxes can happen concurrently. However, note that it cannot be completely concurrent because threads need to communicate: work to be done and the bread crumbs to avoid redundant work.

 

Converting to a loop

I always find that recursion is harder to convert into parallel code for two reasons. First, in recursion, the work accumulates on the stack, which by definition is private to each thread. If we want the work to be distributed, we need a shared work-queue.. Second, thinking of parallelism in terms of loops is often easier as one can run iterations in parallel. Lets re-write the above recursion using a loop and a work-queue.

For reference: Any recursion can be converted into a loop and vice-versa.


  1: void find_loop_slow(){
  2:   while(1){
  3:     if(solved)
  4:       break;
  5:
  6:     int position;
  7:     if(!WorkQ.empty()){
  8:       position = WorkQ.front();
  9:       WorkQ.pop_front();
 10:     } else {
 11:       assert(0);
 12:     }
 13:
 14:     if(position == end)
 15:       solved = true;
 16:
 17:     for(int dir = 0; dir < 4; dir++){
 18:       if(!maze[position+direction_offsets[dir]].val){
 19:         maze[position+direction_offsets[dir]].val = 1;
 20:         WorkQ.push_back(position+direction_offsets[dir]);
 21:       }
 22:     }
 23:   }
 24: }

Supplement information: For the WorkQ, I am using an STL dequeu<int>. Each entry contains the id of a maze position encoded in a 32-bit integer. The loop is kick-started by pushing the initial position in the work-queue (just like recursion is kick-started by calling the function with the initial position of the mouse as its argument).

This is the same algorithm except instead of calling the function for recursion, the thread loops until it finds the cheese. In every iteration, it pushes the legal neighboring boxes’ position on the WorkQ.

Note: This algorithm has become breadth-first instead of depth-first. I am “ignoring” this fast as it is not central to the example. If you prefer breadth-first, you can use a work-stack instead of a work-queue to make it depth first.

First draft of Parallel code

Converting this loop to parallel code is actually very simple. There are only two changes:

1. Make open MP (or other) spawn threads that run the find function.  The code is as follows:

WorkQ.push_back(start);

#pragma omp parallel
find_parallel_slow_queue();

2. Make each iteration thread safe by protecting shared data structures. There are two shared data structures here. The WorkQ and the maze elements themselves. Since there is no harm in allowing the maze and the WorkQ to be accessed concurrently, I will use two different critical sections, QUEUE and MAZE, to protect the corresponding structures. Here is the code:


  1: void find_parallel_slow_queue_slow_maze(){
  2:   int tid = omp_get_thread_num();
  3:
  4:   while(1){
  5:     if(solved)
  6:       break;
  7:
  8:     int position;
  9:     bool empty;
 10:
 11: #pragma omp critical (QUEUE) // Protecting the WorkQ
 12:     {
 13:       empty = WorkQ.empty();
 14:       if(!empty){
 15:         position = WorkQ.front();
 16:         WorkQ.pop_front();
 17:       }
 18:     }
 19:
 20:     if(empty) {
 21:       continue;
 22:     }
 23:
 24:     if(position == end)
 25:       solved = true;
 26:
 27:     for(int dir = 0; dir < 4; dir++){
 28:       bool legal = false;
 29:       int maze_index = position+direction_offsets[dir];
 30:
 31:       #pragma omp critical (MAZE) // protecting the maze
 32:       if(!maze[maze_index].val){
 33:         maze[maze_index].val = 1;
 34:         legal = true;
 35:       }
 36:
 37:       if(legal)
 38: #pragma omp critical (QUEUE)
 39:         WorkQ.push_back(position+direction_offsets[dir]);
 40:
 41:     }
 42:   }
 43: }

This is a working parallel implementation of the maze. Now lets do some measurements.

Threads Execution time (secs)
1 4.98
2 5.20
4 8.24

 

The parallel version is getting worse and worse with more threads. I blame critical sections.

Side-note: Critical sections serialize execution and gain NO benefit from multiple threads. However, they incur cache misses for the lock variable and protected data which lengthens their execution when there are more threads (see the first half of this paper for more details). Result: critical sections take longer to execute with more threads. This is why they are my number one suspects every time a program slows down with more threads. Remember that critical sections are particularly bad if the code spends a lot of time in critical sections, which is the case in our code.

I will hold on to my optimizations and let people give me suggestions on how to improve this code……

Conclusions for this part

In my opinion, it is a good idea to start from code that you know how to write (e.g., starting with recursion for the maze) and generate some test cases before you enter the land of unknown. Preparing your single-thread code for parallelization can make parallelization easy. Keep the optimizations in mind when you write code (e.g., I used a struct for maze array elements) because it can reduce the chances of errors later.

Coming attractions…

I plan to take this example in two directions.

Some optimizations: I will explain atomic operations, lock-free queues, fine-grain locking, and more using this example.

Higher-level frameworks: This type of task parallelism is perfectly suited for frameworks like Cilk, Intel TBB, and Apple Grand Central Dispatch. I will extend this program to compare and contrast those frameworks. .

  7 Responses to “Parallel Programming: Parallelizing (somewhat) Complex Code (Part 1)”

  1. Hi Aater,

    The last section of your article is in a scrolled area that means the longer paragraphs are hard to get at. Can you remove the scrolling please?

    I also noticed a minor typo in the comment of line 11 of the first draft parallel code: “Protenting” should be “Protecting”.

    Regards,

    Billy

  2. Since the work being done per job is so short, I’m guessing that putting the jobs into and removing them from the queue is taking most of the time here! Since that work has to be locked (unless the queue implementation is particularly clever) it’s not too surprising that adding threads slows everything down. (Incidentally, how big was the input maze that produced those timings?)

    The first thing I would try is to choose one of the valid search directions and continue the search from there, after adding any other valid directions to the work queue. Only go to the work queue for new work when you hit a dead end. That should reduce contention on the work queue considerably. Only if that failed to provide adequate performance would I think about other improvements.

    Of course (as you imply) you also need to use a more efficient mechanism (atomic operations, probably) to place and check for the breadcrumbs to prevent contention there.

    Incidentally, you may find this interesting, if you haven’t already seen it. It talks about non-blocking algorithms, a bit over my head IIRC but I found it interesting reading.

    • Hi Harry,

      The maze was 4096×4096.I have done some measurements (not written up yet) that show that there is very high contention for the queue so I think I will need to make the queue faster and also reduce queue accesses.

      The algorithm you suggest for reducing queue accesses sounds really good. I haven’t tired anything like this yet so I will make sure I incorporate in my next post (and credit you).

      Thanks for the pointer. I had not read that particular post and its interesting. I think my queue will eventually have to be non-blocking (perhaps Part 3 of this tutorial).

      Thanks for reading and the great suggestion comment!

      Aater

  3. The problem is: you haven’t actually added any parallelism at all.

    The only useful work that the task does is to: check ‘have i been here before’, and ‘can i move in that direction’, and ‘move there’, and you have completely serialised this by putting it all into a critical section.

    Making the atomic operations more efficient wont escape the fact the work is serialised to start with.

    It seems like quite a challenging problem to get decent parallelisation from.

  4. I am still new to parallel programming, so please forgive me.

    You said “I have declared each element of the maze to be a struct because I know that I would later want to make each element cache line aligned to eliminate false-sharing.”

    I…don’t get it. How does making it a struct make each element cache line aligned?

  5. For curiosity sake, what was the execution time of your original algorithm?

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>