May 252011
 

I confess. I have an ulterior motive behind this post. Eventually I want to write a parallel programming tutorial which demonstrates the performance trade-offs in parallel programs. Since the focus of that tutorial is on performance, I prefer using the parallel programming framework with the least syntactic distractions.I think I will choose Open MP because it seems to be the cleanest alternative for parallelizing regular for-loops. This post is to familiarize readers with Open MP before the tutorial such that the tutorial can focus solely on performance optimizations.

I compare pthreads and Open MP because they both produce more or less the same assembly code but their programmer interfaces are very different. Their similarity allows my comparison of the programmer interfaces to be an apples to apples comparison. An example is worth a thousand words …

Problem

Say our program consists of the following loop:

void sum_st(int *A, int *B, int *C){
   int end = 10000000;
   for(int i = 0; i < end; i++)
    A[i] = B[i] + C[i];
}

Executing sum_st 1000 times takes 12.61 seconds to run on my Intel Core2Quad.  /*Update 5/29/11: Special thanks to Anonymous reader  for picking a typo*/

Pthread solution

We can parallelize this loop such that different iterations can run in parallel. First, the pthread version:

 struct params {
  int *A;
  int *B;
  int *C;
  int tid;
  int size;
  int nthreads;
};

void *compute_parallel(void *_p){
  params *p      = (params*) _p;
  int tid        = p->tid;
  int chunk_size = (p->size / p->nthreads);
  int start      = tid * chunk_size;
  int end        = start + chunk_size;
  for(int i = start; i < end; i++)     p->A[i] = p->B[i] + p->C[i];
  return 0;
}

void sum_mt(int *A, int *B, int *C){
  int nthreads = 4;
  int size = 10000000;
  pthread_t threads[nthreads]; //array to hold thread information
  params *thread_params = (params*) malloc(nthreads * sizeof(params));

  for(int i = 0; i < nthreads; i++){
    thread_params[i].A        = A;
    thread_params[i].B        = B;
    thread_params[i].C        = C;
    thread_params[i].tid      = i;
    thread_params[i].size     = size;
    thread_params[i].nthreads = nthreads;
    pthread_create(&threads[i], NULL, compute_parallel, (void*) &thread_params[i]);
  }

  for(int i = 0; i < nthreads; i++){
    pthread_join(threads[i], NULL);
  }
  free(thread_params);

}

There are four important concepts here:

pthread_create: This is the function that spawns new threads. It takes the starting address of the new thread as its third argument and an input to the new thread as its fourth and last argument. The new thread is spawned when pthread_create is called.

pthread_join: This function makes the master thread (which spawned the threads) wait for the spawned child threads. It is called once for each spawned thread.

struct params Since pthread_create allows only one input argument to be passed to the thread, the input parameters are bundled into a struct called params and a pointer to the bundle is the thread’s input argument.

The function compute_parallel: pthread library requires that thread code should be in a separate function and cannot be inline. This requirement made the existence of the function compute_parallel necessary. Compute parallel takes six input parameters bundled in a single struct. Each thread first determines which portion of the array it needs to work on and then performs the operations. Note: in order to keep the code clean, I assume that the number of iterations is always a multiple of the number of threads.

Just for reference, the above pthread code runs in 3.20 seconds with four threads on my Intel Core2Quad.

 

Open MP

Open MP performs code-to-code transformations and also links libraries optimized for common parallel programming constructs. The programmer annotates the code using pragmas and the compiler converts the code into parallel code. To parallelize the same loop with Open MP, you can simply add a pragma before the loop as follows:

#pragma omp parallel for
for(int i = 0; i < 10000000; i++)
  A[i] = B[i] + C[i];

Upon encountering the pragma, the OpenMP compiler does the code transformation we did manually when using pthreads. The run-time is again 3.30 seconds.

Conclusion

When handling simple loops, OpenMP and pthreads have similar speed ups but Open MP is times easier to program.

Note 1: Open MP just makes it easier to write parallel code by hiding the ugly syntax. It does NOT help you in finding parallelism or figuring out the depedencies. Both of these are still the programmer’s burden.

Note 2: Open MP has several advanced features which you can be used for more complex loops.

Note 3: Open MP is not general enough to be used for all kinds of parallelism. It is best equipped with pragmas for loop-level parallelism often found in compute intensive workloads. pthreads is universal and can be used for any type of parallelism, e.g., transactional workloads like the MySQL database. Thus, pthreads will continue to prevail because of such workloads.

Stay tuned for the performance optimization tutorial. I will try to post it within a couple of days. Update: The optimization tutorial went online here.

Update 5/30/2011: @Thrund added an excellent point to this post via twitter:  ”OpenMP is portable, POSIX pthreads not always. We need open standards, not necessarily its implementations.”

  11 Responses to “Open MP vs pthreads”

  1. You’re comparing a high-level data-parallel programming model with a low-level hardware model, using as comparison a data-parallel example. The outcome is not a surprise. In order to do a meaningful comparison, you ought to compare two high-level models, such as OpenMP vs. MPI, and use at least two different examples, such as a data-parallel and a distributed code. By the way, if you’re looking for a unique framework in which to express your tutorial, Ateji PX has both data-parallelism, task parallelism, distribution and message-passing at the language level.

    • Hi Patrick, nice to hear from you again:) good points again!

      Re: comparison. I think its a very important comparison in fact. The intent here is to show how Open MP is easier to read/write for data-parallel workloads compared to vanilla pthreads. My argument is analogous to comparing C and assembly, and making a point that C is easier to read and write. I agree that the results are not surprising, but not surprising for those who have already used them both. It can be useful for someone who hasn’t and these are the people I targeted. I do see a lot of people using pthreads for data-parallel kernels which is a waste of effort IMO.

      For other types of workloads, I would recommend different frameworks too and the articles didn’t draw any conclusions about those workloads (see my disclaimer in Note #3). I will try to cover a different style of workload next week.

      I agree with you that Open MP and MPI is an important comparison as well. I would also like to see a comparison like that.

      Re: Ateji PX, I pointed out in one of my earlier comments to you that I am reading up on it.

      Btw, the tutorial went up this morning (http://www.futurechips.org/tips-for-power-coders/writing-optimizing-parallel-programs-complete.html) feedback?

    • I’d like to add a couple of things here
      - MPI, IMHO, is at the same level of abstraction as pthreads, and therefore, a comparison between OpenMP and MPI would be the same as OpenMP vs. pthreads.
      - Aater is right in saying that OpenMP just hides the syntax. Similarly, you could cook up your own functions that hide the process of creating threads, and then the resulting code between pthreads and OpenMP would look pretty much the same. I imagine there would be libraries out there that already do this. However, OpenMP allows other facilities for how loops are scheduled etc.

      • Hey Amber,

        Welcome to the discussion! You have a good point. MPI does have high level constructs but I would say that MPI is mid-way between pthreads and Open MP because while Open MP hides some ugly syntax via pragmas, MPI requires you to write the ugly syntax similar to pthreads. Is my description of MPI correct (I haven’t used it in a few years).

        I completely agree with your second point though. Actually, I think GOMP used to do a source-to-source transformation and convert Open MP to pthreads so the code was indeed the same. Other libraries do exist but Open MP has an advantage that it uses the compiler pragmas which makes it syntax super clean.

  2. I hope that’s a typo and you actually meant milliseconds and not seconds in your performance reports. Else there is something wrong in the number of iterations / size of the array, or you’re running these benchmarks on a quite different computer…

    • Hi,

      You have a very valid concern. As I point out under the ST code, the loop was run a 1000 times to get these numbers. The point is to rule out random OS effects.

      Thanks.

      • 1000 x 10000 is only 10M integer additions. I would be surprised that this takes as much as ~10s on a modern 2+ GHz processor.

  3. Hi, all

    I have posted a question at
    http://www.futurechips.org/thoughts-for-researchers/parallel-programming-gene-amdahl-said.html#comment-1234

    that deals with OMP vs. pthreads performance. I am seeing negative scaling with OMP, good scaling w/pthreads for the same application.

    Comments very welcome. I am very anxious to find the root cause, something I can prove via pert monitors but have not yet found the right ones.

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>