May 262011
 

This post is a follow up on my post titled why parallel programming is hard. To demonstrate parallel programming, I present a case study of parallelizing a kernel which computes a histogram. I use Open MP for parallelization (see the reason here). The post first introduces some basic parallel programming concepts and then deep dives into performance optimizations.

Update 5/30/2011: I have written my next post on parallel programming. It discusses the motivation, implementation, and trade-offs of dynamic work scheduling.

Histogram Problem

I am using the histogram kernel because it is very simple and clearly demonstrates some very important concepts in parallel programming: thread spawning, critical sections, atomic operations, barriers, false sharing, and thread join. Here is our problem statement:

Problem: Count the number of times each ASCII character occurs on a page of text.

Input: ASCII text stored as an array of characters.

Output: A histogram with 128 buckets –one for each ascii character– where each entry stores the number of occurrences of the corresponding ascii character on the page.

histogram

Single-thread solution

Here is a single-threaded implementation of the histogram kernel:

1: void compute_histogram_st(char *page, int page_size, int *histogram){
2:  for(int i = 0; i < page_size; i++){
3:    char read_character = page[i];
4:    histogram[read_character]++;
5:  }
6: }

This code takes 10.36 seconds to run on my Core2Quad with 4GB of memory. Throughout this tutorial, I set the page size to 10M and executed the loop 100x times to rule out any random effects from the OS. I use the Intel C Compiler version 9.1 with –O3 optimizations.

Finding the parallelism…

Parallelism is relatively obvious in the histogram kernel. We can evenly split the iterations of this loop among threads such that each thread processes only its own portion of the input page. We can use the Open MP parallel for to achieve this (learn about the inside workings of Open MP parallel for here).

histogram-split

1: void compute_histogram_st(char *page, int page_size, int *histogram){
2:  #pragma omp parallel for
3:  for(int i = 0; i < page_size; i++){
4:    char read_character = page[i];
5:    histogram[read_character]++;
6:  }

The above code does not work. Why? Because the increment on line 5 is translated into the following sub-operations:

 

   1:   R0 = MEMORY[input_character]   2:   R0 = R0 + 1   3:   MEMORY[R1] = R0

When two threads run these operations concurrently, there can be data races. For example, if T0 is executing line 2 when T1 executes line 1 then the histogram bucket will miss an increment. (See this other post for more details)

To prevent threads from accessing shared data concurrently, accesses to shared data are encapsulated inside synchronization primitives like locks. The code protected by a lock is called a critical section . A critical section guarantees mutual exclusion, i.e., if one thread is accessing a critical section then any other thread wanting to execute the same critical section must wait. Today’s processors provide hardware support for running some commonly occurring critical sections speedily. Fortunately, incrementing a single memory location (the code on line 5) is one of those atomic instructions in x86. Open MP can be directed to convert the increment into an atomic instruction using #pragma atomic as follows:

1: void compute_histogram_mt2(char *page, int page_size, int *histogram){
2:   #pragma omp parallel for
3:   for(int i = 0; i < page_size; i++){
4:     char read_character = page[i];
5:     #pragma omp atomic
6:     histogram[read_character]++;
7:   }
8: }

This code produces correct results but it takes 114.89 seconds to run on the four cores which is >10x slower than our singe-thread code. Not a good idea :)

The problem is that the atomic increment (lines 5-6) has a longer latency than the regular non-atomic load +  increment + store operations combined. This offsets the benefit of parallelism. How to reduce such critical sections is a major challenge in parallel programming. In the histogram’s case, we can use reduction. In reduction, each thread first updates a thread-private local histogram and then “reduces” (adds) it to the global histogram only once at the end of the loop. Updates to these local histograms can execute in parallel but additions to the global histogram still require atomic operations. The new code is as follows:

1: void compute_histogram_mt3(char *page,
                                   int page_size,
                                   int *histogram,
                                   int num_buckets){
2:  #pragma omp parallel
3:  {
4:    int local_histogram[111][num_buckets];
5:    int tid = omp_get_thread_num();
6:    #pragma omp for nowait
7:    for(int i = 0; i < page_size; i++){
8:      char read_character = page[i];
9:      local_histogram[tid][read_character]++;
10:   }
11:   for(int i = 0; i < num_buckets; i++){
12:     #pragma omp atomic
13:     histogram[i] += local_histogram[tid][i];
14:   }
15: }
16:}

This code runs in 3.8 seconds on my 4-core system, but its speedup still isn’t 4. Why? Because the actual amount of work done by the kernel has increased. Each thread has to not only process the input array but also execute the reduction code in lines 11-14. The fact that this code contains an atomic operation makes it lengthier. On a side note: By default, Open MP always ends a parallel for clause with a barrier, enforcing that no threads can proceed unless all threads have finished their work (more description below). Adding the no wait clause on line 6 removes this restriction as in this example its ok for threads to proceed to the second loop (lines 11-14) without waiting for other threads to finish the first loop (lines 7-10).

Like I pointed out, we should try to reduce the number of atomic adds. The operations need to be atomic only because we are allowing multiple threads to update the global histogram. Instead, we can make one thread in-charge of adding all local histograms to the global histogram. While this approach increases non-parallel code, its benefit may outweigh its cost. Lets try. The code looks as follows:

  void compute_histogram_mt4(char *page,
                                  int page_size,
                                  int *histogram,
                                  int num_buckets){
1:   int num_threads = omp_get_max_threads();
2: #pragma omp parallel
3:  {
4:    __declspec (align(64))  int local_histogram[num_threads+1][num_buckets];
5:    int tid = omp_get_thread_num();

6:    #pragma omp for
7:    for(int i = 0; i < page_size; i++){
8:      char read_character = page[i];
9:      local_histogram[tid][read_character]++;
10:   }
11:   #pragma omp barrier
12:   #pragma omp single
13:   for(int t = 0; t < num_threads; t++){
14:     for(int i = 0; i < num_buckets; i++)
15:       histogram[i] += local_histogram[t][i];
16:   }
17: }
  }

There are three new concepts being introduced in this code. First, the barrier at line 7. The semantics of a barrier dictates that no thread can leave a barrier until all threads have reached the barrier.  In our code, the barrier enforces that the reduction happens only after all threads have finished their work (Note: An explicit barrier isn’t actually needed because OpenMP always inserts one by default at the end of a parallel for; I put it there for explanation purposes). Second. the omp single pragma, which enforces that only one thread is allowed to run the reduction code. Third, the __declspec at line 4, which prevents false sharing among threads. False sharing occurs if two threads need to modify the same cache line concurrently. Cache coherence does not allow that and hence threads’ execution gets serialized. By making arrays cache line aligned, you can avoid false sharing (the solution here is the simplest but not most robust). Hence, the reduction is atomic-free. This code takes 4.42 seconds to execute, which is slower than the previous version.

The speedup does not increase because reduction is purely sequential. Question: Can I parallelize the reduction itself, without inserting atomic operations? The answer is yes. I can parallelize reduction such that different threads are in-charge of reducing different buckets. Since no two threads touch the same buckets, an atomic is not needed. The code is as follows:

   void compute_histogram_mt5(char *page,
                                   int page_size,
                                   int *histogram,
                                   int num_buckets){
1: int num_threads = omp_get_max_threads();
2: #pragma omp parallel
3:  {
4:    int tid = omp_get_thread_num();
5:    __declspec (align(64)) int local_histograms[num_threads+1][128];
6:
7:    #pragma omp for
8:    for(int i = 0; i < page_size; i++){
9:      char read_character = page[i];
10:     local_histograms[tid][read_character]++;
11:  }

12:   #pragma omp for
13:    for(int i = 0; i < num_buckets; i++)
14:      for(int t = 0; t < num_threads; t++){
15:        histogram[i] += local_histograms[t][i];
16:      }
17: }

Now this code runs in 3.60 seconds which is the fastest implementation of the histogram so far. Although not shown here, but I made the histogram array cache line aligned to eliminate any false sharing which may occur among threads during reduction. For this tutorial, I will deem this performance acceptable. My goal was to give the readers flavor of parallel programming. As you can see, parallel program optimization can be an endless loop of trial and error. Moreover, the trade-offs can drastically change if the compiler, the hardware, or the input set changes, often requiring a re-tune of the code.

If you are interested in learning more about parallel programming, stay tuned as I plan to present a more complex case study of parallelizing a branch-and-bound algorithms next week.

 

A tip on the side: To check if the code is spawning the correct number of threads in Linux (with tcsh) you can do:

strace (program-name) |& grep clone | wc –l

The printed number is the total number of times clone (the Linux system call which creates threads) was called which corresponds with how many threads were spawned.

  73 Responses to “Writing and Optimizing Parallel Programs — A complete example”

  1. This is a reduction, so it’s best expressed with reduction syntax. OpenMP, as far as I remember, has a reduction syntax only for predefined operators. With Ateji PX, you can define your own monoid (binary operator) and use it together with the parallel reduction syntax.

    Call histSum the monoid that takes two histograms and sums them, then the reduction is `histSum for||{ hist(i) | int i: 4 }.

    Read this as “the sum, in parallel, of all hist(i) for i between 0 and 3″. hist(i) computes the i-th histogram and the work in split in 4 blocks.

    There is also an advantage over your suggested implementation: there is no need to wait until all local histograms are computed before starting the computation of the local sum, you can sum a local result as soon as it is ready, and even do the sum in parallel using a binary tree. The parallel sum syntax takes this into account, it makes a difference when there are many blocks.

    • I agree that there is a reduction syntax but I would not use it for this post because I wanted to explain performance optimizations in the reduction phase. Letting OpenMP do the reduction would not have brought out the points I made. IMO, such tools are great for using once you understand whats underneath them but they are a bad choice for teaching (you may enjoy this: http://www.futurechips.org/thoughts-for-researchers/csee-professors-graduates-understand-computers.html).

      On a side note, I do not think I could use OpenMP reduction as it only works with single variable reductions (please correct me as my information may be outdated).

      OK. You have got me excited about Ateji PX:) Are you willing to write us all a small introduction to Ateji PX?

  2. I’ve keep reading your article, and I would like so much being able to tryout your code into Objective-C and libdispatch.

  3. Whoa, OpenMP makes it look super easy to code some parallel programs in C: http://is.gd/w12980 #dataparallel

  4. Your histograms appear to be of 8 bit values. Combining 256-long vectors of integers takes on the order of a microsecond on a single core. So if you’re not getting close to 4x speedup, either your numbers are off, or the bottleneck isn’t the CPU (which is parallel), or you are doing very very little work and repeating it (plus thread creation) very many times. Maybe you’re bottlenecked on the memory bus or disk access or somesuch?

    • Rex,

      Thanks for reading.

      I agree 256-long vectors will be fast and not good for parallelization. Thats not what i had. The input page is of size 10M characters and I ran the loop 100 times to get my numbers. I point it out in the section labeled “single thread”

      • When Rex is talking about “combining 256-long vectors of integers”, he means your reduction loop. It iterates over only num_threads*num_buckets values, which is such a small number that the execution time should be insignificant. And indeed, when I tried out a variation of compute_histogram_mt4 locally, I saw 426924 us spent in the first loop and 278 us spent in the second loop (total of 100 iterations). So it’s hard to blame the second loop for any performance problems. Even if I take the second loop out entirely and write into local_histograms that are never read, compute_histogram_mt4′s first loop is not 4X faster than compute_histogram_st. Bottlenecked on the memory bus seems like a decent guess as to why, given that it’s reading data at about 1.2 GBytes per second.

        btw, nice article. I’d never tried out OpenMP before, and you got me to play with it. It’s much easier than doing everything with pthreads by hand.

        • Scott, Thanks a lot for this post. First, it explains what Rex is getting at. Second, it forces me to peel another layer of the onion.

          First, I highly doubt its memory bandwidth:

          The available bandwidth on Nehalem = 1.33GHz (DDR-3) x 8 bytes x 3 channels = 32 GB/s which is 26x more than 1.2 GB/s. In fact, an even more memory intensive kernel showed linear speedup on this machine (http://www.futurechips.org/tips-for-power-coders/open-mp-pthreads.html)

          Then whats causing it in IMO?

          Its the overhead of parallelizing stuff using Open MP. Here is the data:

          Single Thread: 9.47 secs
          mt4 code, num_threads = 1: 12.74 secs
          mt4 code, num_threads = 4: 3.29 secs
          mt4 code w/o loop-2 @ 4 threads = 3.19 secs.

          So, mt4 loop-1 is indeed showing near-linear speedup except that the baseline with Open MP is 30% slower. Why?

          Open MP introduces this overhead each time you start a parallel region. Since I put the parallel region in the inner loop, it was causing that overhead to be paid again and again which led to a slower speed overall. Lesson: Don’t make Open MP spawn threads in the inner loop.

          I hope this explanation helps. Thanks.

          • Yeah, that explains it, thanks. A little disappointing – one would hope OpenMP would be smart enough to leave the worker threads around for the next block. Maybe there’s something to be said for doing things by hand with pthreads after all… :-/

          • You could try to declare the omp parallel statement outside the function-call.
            Calling ‘parallel’ in an OMP-pragma is only telling to create the threads.
            Even worse, when calling #pragma omp for, without first calling #pragma omp parallel, will cause the for-loop to be executed single-threaded.
            Only make sure you don’t call ‘parallel’ within a ‘parallel region’ (e.g. nested)

            Just make sure you don’t mess up the ‘shared’ and (local) private variables.

            Within the parallel region the threads will stay dormant when not in use.

  5. So your speed-up for going 4 way parallel is to go from 3.83 seconds single-threaded to 3.60 on four threads? Hardly worth the effort. You’ve missed that your workload appears to be memory-subsystem bound, and so you’ll gain essentially no benefit from parallelism.

    • Hi Paul,

      Excellent catch. Actually I had written the wrong number for single thread. The typo has been fixed. The actual time for single thread was closer to 10 seconds hence the speedup is much higher than it initially looked.

      I would point out that the histogram is not memory bound at all. Its a very regular workload which does not saturate memory bandwidth. It does go uneasy on cache but then the hardware prefetcher is likely to do a great job of prefetching this array so the cache misses are very few. I ran this on a simulator to verify the above.

      Thanks again.

      • 10.36 to 3.60 is substantially better. I’m still curious as to the last missing 4 seconds of time – I don’t see four seconds of work in the histogram reduction, so something is still throttling the computation. Some of it should be overhead from OpenMP thread launches, etc, but a second per core of such overhead is still too high.

  6. Unless I’m missing something (which is possible), I think the “reduction” step in your last example has a bug.

    12: #pragma omp for
    13: for(int i = 0; i < num_buckets; i++)
    14: for(int t = 0; t < num_threads; t++){
    15: histogram[i] += local_histograms[tid][i]; //< Should be 't', not 'tid'
    16: }

    • Stuart. I fixed the typo. It wasn’t a bug in my C file. I copy pasted code around while trying to get it to post in wordpress and copied the code from snippet #2. You are most welcome to try it out., I would be very interested in seeing the results on a different machine.

  7. All that mental work.. all that time spent on tweaking… to allow an improvement of single core to multi core, to go from 3.8 seconds, to 3.6 seconds?!?!!?

    I think this PARTICULAR issue, would be better meta-optimized as, “just use single core, and leave the other cores to do more useful stuff for the user of the machine” !!

    To abuse Knuth(?), “one of the greatest programming evils, is premature optimization”.
    This would be one of those evil times.

    • Philip. The speedup is about 3x so its definitely worth the effort. There was typo that I fixed.

      In general, there are very few kernels where parallelism is so hopeless. One example is the following averaging filter.


      for (int i = 9; i < N; i++){
      A[i] = (A[i] + A[i-1])/2;
      }

      • I am with the impression that the previous 2-liner has an inconsistency: by my understanding of what averaging is, “A[i]=” should be modified to “A[i-1]=” to avoid iterative corruptions.

        Also, I recommend in the original article to spot the number 128 in the last reference of __declspec and replace it with “num_buckets”, for clarity reasons.

        Interesting read, btw. If the example is available under a Creative Commons Attribution-ShareAlike (CC BY-SA) license, we could base some training material & comments on it.

  8. I am always interested in learning about parallel programming but I confess that I don’t understand why the iterative approaches taken here are proposed. Why, if the goal is to parallel-ize this problem isn’t the approach made parallel in the first place?

    My experience with parallel programming is in higher-level languages than this (Java / .NET / SQL) but it seems to me that conceptually, if you know the ideal number of threads desired, correctly splitting the problems (count the characters, reduce the results) across the threads should give ideal performance.

    (Again, I don’t know the language you’re using but in pseudocode:)

    -Prepare a 2-D array for the histogram results, one row per thread, one column per character.
    -Split the input across the target number of threads, assign each thread a range of the input, and a row in the results array.
    -Wait for each thread to finish.
    -Create an array for the totals, one position per character.
    -Split the results array by columns, assign each thread a set of columns to sum up and store in the totals array.
    -Wait for each thread to finish.

    At the end you have the same results, and no threads were waiting on shared memory at all.

    I recognize that some of the code you’re sharing has ‘built in’ parallelism but my experience so far in a corporate setting with procedural programmers is that you have to approach the problem in a parallel way in the first place or it never works out. You have to begin with a gameplan to keep every thread distinct as long as possible with their own unique work to do. I think your readers would be better served with an explanation of the thought process that breaks a problem across threads from the get-go, instead of trying to maintain the ‘look’ of a procedural ST loop.

    • Corey, I agree that the parallelism algorithm you propose should come to an experienced parallel programmer naturally. However, there are very few such people out there.

      The target audience for this post are people who do not have such experience. I felt that its important to show a bridge between serial and parallel code for them to be able to understand the challenges. I chose the histogram because it takes very little time to change it from serial to parallel and most of the work is in optimizing the parallel work — which by the way will still be required even if we take you approach. Your code will run into false sharing (and you will have to make the output array cache line aligned)

      Thanks for reading and I look forward to your comments.

    • Just wanted to join in here and add a ‘lesson’ –
      When designing a parallel algorithm you are best to choose a divide & conquer functional approach. Should you find a functional way to perform the steps of reduce, process reduced and merge, the less you will be writing locks, semaphores and synchro objects which can create bottlenecks (if not outright defects and deadlocks) difficult to detect (and worse – difficult to predict).
      [BTW - when I say functional, I simply mean the functional style, not a particular language/platform. Though I have to say that writing this algorithm in lisp would be more natural than in C/C++, but that's just this algorithm.]
      Of course – this is not always the correct paradigm and so on, but the given example of a histogram — as well as many other text processing algorithms — is a classic example for a divide & conquer algorithm.
      I would suggest that anyone who is investing time in _understanding_ parallel programming to do some research in functional programming which makes heavy usage of divide & conquer algorithms. You may end up writing very different code for serial mode and for parallel mode, but it will be using the best strategy for each case.

  9. I’m unfamiliar with Open MP (I do my own threading in VC++), but I’d like to point something out: It seems to me that the threaded compute_histogram_st() example would work flawlessly on all x86 hardware in spite of the author’s assertions.

    I’ve been programming x86 assembler for some 15 years and from the statements made I can tell the author doesn’t know x86 assembler as well as he should. According to Intel documentation, the INCrement instruction is atomic — it has no load/increment/save phases that the programmer (or even the CPU) can interrupt and is therefore not vulnerable to being corrupted.. at all.

    Fortunately for us, Intel made life easier for programmers of its architecture by guaranteeing that any single instruction operates atomically, so a more complex example than a histogram might have illustrated the author’s point more accurately by manually reading the value, modifying it in a 2-step operation, and writing it again. The fundamental threading principals are still accurate, but that one example just happens to be a special case and will work perfectly (when the #pragma is removed).

    To my mind the only performance issue is that each of the CPUs will be fighting for access to the cache line containing the histogram output. However, the overhead of spinlocks or critical sections can also be severe (as the author demonstrated). The best solution is as the author described: Individual threads using their own cache-aligned storage and merging their results at the end.

  10. Very cool article. Minor question: is there a reason you did num_threads+1 for the size of the tid array?

    • foon, excellent question. dead on target:) I was hoping someone asks me that.

      I left the description out of the article to keep it simple. The reason: I used a power-of-2 number of threads, i.e., 4. A problem with power of two array sizes and number of threads is that multiple threads can map to the same cache or memory bank and start contending for cache banks and memory banks. By making it a non power of two, you can avoid such unexpected behavior.

      Aater

      • Interesting. Is that purely a performance issue (i.e. thread A gets mapped to cache[X], thread B gets mapped to same spot, meaning next time A runs, it has to flush the cache and pull from memory so things run slower the same way something compiled with some optimizations may run slower because of the relative slowness of memory) or is there also potential correctness issues (i.e. thread A gets mapped to cache[X], writes to something there, B gets mapped to cache[X], writes there and now A’s values are incorrect). I would hope just the former. If I didn’t know the number of threads a priori, would I have to do some grungy math (I guess numthreads & 1 + 1 would work for numthreads > 1) to force it not to be a power of 2?

        (And just to check, did you mean cache in the L1/L2/LN sense or some other sense?)

        • Yes. Purely performance. I would use some match to make it the next non power of 2. I generally use static sizes but didn’t do it for the tutorial to not confuse people.

          I meant caches in L1/L2 sense.

  11. A minor issue: there is no need for #pragma omp atomic in the _mt4 version since the additions are in a #pragma omp single region.

  12. The true problem with parallel programing are the tools available.

    I think C/C++ is a great language for ST, but adding OpenMP just does not cut it.
    New programming languages are missing the code base, and head count.

    C++0x added threading support is adding a functional threading paradigm to an object oriented language. I would say pthreads are of about the same value.

    Threading support is ADDED instead of integrated, and I have to come to the conclusion that I will have to come up with my own C/C++ extension.

    I believe that once threading is integral to C/C++ it will solve a lot of existing problems in MT code.

    Please check out http://www.ParallelAxiom.com for my first approach.

  13. It seems like OpenMP runs very well. I’m surprised, since I thought the large chunk of time to clone a process would outweigh benefits of small bits of code chunks. Perhaps it’s dependent on the size of the dataset, as larger amounts of data would allow each row to do more, take longer and outweigh the clone cost.
    I wondered if there was some way to have an even lighter weight thread that could be initialized in about 1/6 to 1/4th the time of a clone, allowing it to be used with smaller code chunks.
    Have you any experience or advice on the smallest code snippets before the clone overhead becomes burdensome, or is that just a matter of trial and error?

    • Linda, the clone overhead is high and can be prohibitive if individual chunks are small. I did some experiments in 2007 on Fedora Core III with Dual Xeons and found that a new thread costs about 10,000 cycles. Thus a new thread can only be justified if it can off load at least 10K cycles worth of work from the current threads. The 10K number is machine and OS dependent so you will have to experiment with your platform.

      BTW, a common approach to practically eliminating the clone overhead is to use a pre-allocated thread pool. You can create threads once at the beginning of the program and keep them around in the thread-pool until the program finishes.

  14. In critical sections where the code needs to run as fast as possible, it is also possible to use the processor functions of the chip to parallelize even the single threaded portion of the code. For example in your earlier example where the code finishes the parellel processing and then adds up the historgram values using a single thread, you can use MMX/SSE/SSE2. By reading in multiple values (integers) at the same time (single instruction) and perform multiple adds in a single instruction and writing the values back you get high parallelism with limited context/memory delays.

    If done correctly can improve execution time dramatically.

    • Peter, thanks for reading. I am sorry for my late response. Thats a great point and I 100% agree. In fact, I SIMD is one of those topics I want to raise awareness about among programmers. However, for this article, I disabled auto-vectorization of the loop by Intel C Compiler to ensure it doesn’t use SSE. The only reason I didn’t so is to not add another dimension to this article. SIMD is a different topic and I will cover it in a post.

      Btw, just for sanity checking, SSE turns out to be about 5x faster for me. Do you see similar numbers?

      I will remember to notify you when I write the SIMD article.

      Aater

      • Yes, often it is in the range of 5 to 6 times, however, depending on how I pack the data, I can achieve much higher results. For example I can add 8 16 bit numbers or 16 8 bit numbers and as long as your histogram is guaranteed to not exceed 65535 letters per bucket you can get much higher performance. Of course that depends on what you are trying to build and what rules/assumptions you can make use of.

  15. [...] Writing and Optimizing Parallel Programs — A complete example (tags: programming toread parallel computerscience) [...]

  16. Congrats, as a programmer you just wrote and demonstrated what took a professor of mine a few weeks to describe.

    Parallel programming is very tough Race conditions, dependencies, hardware changes, programmer error, legacy code, atomic actions, no debugging support, and critical sections [btw i'm sure i'm missing some], all make it very difficult to properly write parallel applications as well as slowing down your application which you are trying to make faster.

    This demo just clearly states that something that looks so simple; when trying to re-write it for parallel execution; makes it difficult to program and makes it hard to find out where and why we have to do certain actions (like the race condition that must be fixed by an atomic operation).

    Congrats for a well thought-out example and thanks for the read.

    – Rob

    • Hi Rob,

      Thank you for reading. Your comments are indeed very encouraging. I am currently working on a another tutorial explaining task parallelism using a simple example. I hope you will enjoy that. Stay in touch.

      Aater

  17. The code is not entirely correct. First of all, C/C++ do not guarantee that char is unsigned. Therefore, it should be unsigned char at all times.

    Furthermore, I don’t get it why the compiler would make local_histograms a shared object among threads. It is declared inside the parallel region and as such, each thread should have its own. Maybe icc 9.1 is broken in this part.

    Finally, local_histograms should be zeroed, otherwise it is incorrect.

    By slight modifications and using newer compilers (gcc 4.4.5 and intel 12.1) I get decent speedups with the following (I’m sorry for the C++, but it was faster to write it):

    inline void compute_histogram_mt(char* page, const std::size_t page_size, int* histogram)
    {
    unsigned int nth = omp_get_max_threads();
    std::vector<std::vector*> lhistograms(nth, static_cast<std::vector*>(0));

    #pragma omp parallel default(none) shared(page, histogram, lhistograms, nth)
    {
    const std::size_t N = 256;
    std::vector lh(N, 0);
    lhistograms[omp_get_thread_num()] = &lh;

    // compute local histogram
    #pragma omp for
    for (std::size_t i=0; i<page_size; ++i) {
    unsigned char idx = page[i];
    ++lh[idx];
    }

    // aggregate local histograms to global
    #pragma omp for nowait
    for (std::size_t i=0; i<N; ++i) {
    for (std::size_t j=0; j<nth; ++j) {
    histogram[i] += (*lhistograms[j])[i];
    }
    }
    } // omp parallel
    }

    Declaring the local histogram truly local and then sharing its address among the threads helps a big time, especially if the OpenMP implementation supports thread local heap allocators.

    Now the only problem one will encounter is that there will not be enough parallelism in the aggregation loop as it can only go up to 256 threads. It can be solved by using a task-parallel paradigm though throughout the code.

    • The code assumes that all values in char *page are 0 to 127 otherwise there is a buffer overflow regardless of signed or unsigned.

      I don’t get how “int local_histograms[num_threads+1][128];” compiles unless this is something to do with Open MP.

      You cough the other two but missed this one:
      #pragma omp for
      for(int i = tid; i < num_buckets; i +=num_threads)
      {
      for(int t = 0; t < num_threads; t++)
      {
      histogram[i] += local_histograms[t][i];
      }
      }

      **** Note to all coders DO NOT DO THIS ****
      for (...) {
      if (...) {
      ... someFunction(...) {
      ...

      It makes you lose track of braces just look at compute_histogram_st and compute_histogram_mt5 are missing “}” at the end.

  18. [...] the different buckets on the histogram in parallel. Histogram parallelization is described here: http://www.futurechips.org/tips-…. Many scientific kernels also have reduction. Any time you can do a reduction using multiple [...]

  19. [...] I can't stop thinking about is a parallel histogram. I did do some analysis of it here: http://www.futurechips.org/tips-… . Once you get past the histogram 9which can consume months), you can look at some task-parallel [...]

  20. [...] You might remember I wrote previously about Aater Suleman’s blog post that kick started an online debate about why parallel programming is so hard. He’s now followed it up with a step-by-step tutorial on parallelising a program. [...]

  21. I wonder – what are you timing here, and just how fast is that cpu?

    I have what is now quite an old laptop that has a core-duo T7300 @ 2.0Ghz, and as far as I can tell they don’t make core2-quad’s that slow – the cpu is 4 years old.

    On this machine: a serial single-threaded Java implementation (openjdk6 – 32 bit), 256 entry histogram, 10x1024x1024 bytes of random 8-bit data, and totalling up 100 runs takes under 2.5 seconds.

    (For comparison, a more up-to-date i7-980 desktop w/ the oracle jvm 64-bit takes only 0.72s)

    This is only timing the histogram generation – the data is in memory already. Whether I do a few ‘warm up’ iterations outside of the timing first – to prime the hotspot compiler – seems to make no noticeable difference.

    But maybe you’re timing more/something else.

    The code in all it’s complexity:

    static void genhistogram(byte [] data, int [] histogram) {
    for (int i=0;i<data.length;i++) {
    histogram[data[i] & 0xff] += 1;
    }
    }

    Incidentally, any attempts at optimisation (e.g. reading 4 bytes at once using an int, array and then shifting it out) slows it down on intel hardware, although I can get another 25% performance out of a phenomii by doing that.

    Also, for such a relatively big yet simple problem you should get almost exactly 1:1 scaling with the number of cores until you hit memory bandwidth issues – which is essentially impossible when reading bytes at a time. The problem is big enough that thread overhead should be minuscule, and simple enough that dividing it evenly is trivial. The final reduction is so small as to be totally insignificant.

  22. Aater i was wondering if you would be kind enough to help me turn a secvential program into a parallel one. C with OpenMP, i’m having trouble in making it work corectly. I’m new to parallel programming,even though it’s a simple program i can’t figure it out. Contact me by e-mail if you can be of some help,thank you.Razvan

  23. [...] … I was unaware … that a major challenge in multi-threaded programming lies in optimizing parallel programs not just getting them to run. His analysis is insightful and the case study is very enlightening if [...]

  24. Greate pieces. Keep writing such kind of information on your site.
    Im really impressed by it.
    Hey there, You’ve done an excellent job. I will definitely digg it and personally suggest to my friends. I’m sure they’ll be benefited from this website.

  25. What a stuff of un-ambiguity and preswrveness of precious familiarity about
    unpredicted feelings.

  26. Бързи заеми и всички видове бързи кредити
    Сайт за бързи заеми и потребителски
    кредити в България
    Пари назаем до 24 часа с бърз кредит
    В този уеб сайт ще намерите списък на фирми предлагащи бърз заем, в които получавате парите до 24 часа, предложения
    за бърз потребителски кредит над 399 лв, както
    и актуални промоции , списък
    с офисите на компаниите по градове, микрокредити за малкия бизнес, удобен филтър за бързо намиране на най-добрата
    оферта, калкулатор, услуги
    с изцяло онлайн процес за кандидатстване и взимане на парите,
    полезна информация.

    Also visit my webpage; бърз и лесен кредит

  27. Greetings! Very useful advice in this particular article!
    It’s the little changes which will make the largest changes.

    Many thanks for sharing!

  28. Greetings from Florida! I’m bored to death at work so I
    decided to browse your blog on my iphone during lunch break.
    I love the knowledge you present here and can’t wait to take a look when I get home.
    I’m shocked at how fast your blog loaded on my cell phone ..
    I’m not even using WIFI, just 3G .. Anyways, awesome blog!

  29. Excellent blog here! Also your website loads up fast!
    What web host are you using? Can I get your affiliate link to
    your host? I wish my web site loaded up as quickly as
    yours lol

  30. Основно супер бързи финансиране сме изключително
    pricy . Вие внесено разклоняване над годишен
    процент темпо които могат да бъдат доста сто процент .
    Като пример може да върна 20 лева заряд до ползване 100.00 лева за две две или
    три седмици . The купувач федерацията
    на България има някои хубави приятни
    изчисления оценка изключително
    бързо лични заеми по опции .

    се установи, че бихте заплащане
    на Проценти с бърз банков
    кредит , но разходи на върха на
    Процент няма да нечувано

    Have a look at my web site; бърз кредит само срещу лична карта

  31. Red Wings Black all nhl jerseys

  32. Thankfulness to my father who shared with me on the topic of ths webpage, this webpage is genuinely awesome.

  33. Quality information: clearly presented with carefully crafted replies.

    Thank you for taking the time to put this out there.

  34. First off I want to say superb blog! I had a quick question that I’d like to ask if
    you do not mind. I was curious to know how you center yourself and clear your thoughts prior to writing.
    I have had trouble clearing my mind in getting my ideas out there.
    I do enjoy writing however it just seems like
    the first 10 to 15 minutes tend to be wasted just trying to figure out how to begin. Any suggestions or tips?
    Thanks!

  35. Hi my loved one! I want to say that this article is awesome, great written and come with almost all significant infos.
    I would like to peer more posts like this .

  36. I’ve been surfing on-line more than 3 hours nowadays, yet I by no means found any attention-grabbing article like yours.
    It is beautiful worth enough for me. Personally, if all site owners and bloggers made excellent content as you did, the
    net will probably be much more useful than ever before.

  37. You have made some decent points there. I looked on the
    net to find out more about the issue and found most people will go along with your views on this website.

  38. Hi there, I enjoy reading all of your post.
    I lioke tto write a little comment to support you.

  39. This piece of writing will assist the internet users for setting
    up new website or even a blog from stwrt to end.

    Feel free to surf to my web page; carpet cleaning tampa 33614

  40. They incorporated their own style with that of the Western to create their own streetwear mark.
    Tiara and head pieces can be very costly and may never be used again. Anthrop is
    the root word tto indicate mankind in general.

  41. Good way of describing, and nice paragraph to obtain data about my presentation focus, which i am going to deliver in college.

  42. I know this if off topic but I’m looking into starting my own blog and was curious what all is required to get set
    up? I’m assuming having a blog like yours would cost a pretty penny?
    I’m not very web smart so I’m not 100% certain. Any suggestions or
    advice would be greatly appreciated. Appreciate it

  43. Know that you are not only growing great organic plants and saving money for yourself, but you are also contributing to the well-being of the future.
    They can give step-by-step planting instructions, such as: plant positioning;
    how much sun; and seasonal planting times. If you have a vegetable garden that includes growing cabbage and tomatoes, you can shred the tough stems in the garden shredder
    so they decompose quicker than normal.

    My weblog – site

  44. I really like your blog.. very nice colors & theme. Did you design this website yourself oor
    ddid you hkre someone to do itt for you? Plz respond as I’m looking to create
    mmy owwn blog and would like tto know where u got this from.
    kudos

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>