May 302011
 

In my computer architecture course, I was told that a processor is like a car. The steering and the pedals are the ISA, the engine is the microarchitecture, and the software is the driver. I will extend that analogy further, and say that using a compiler is like using a remote-controller to operate the car. While remote controls are great, it is also important to understand their inner-workings. I see a lot of code –in professional software– that can confuse even the smartest compilers. In this post I cover three common tricks to confuse a compiler.

Before I fully dive into compilers, I want to add a quick disclaimer: While I do agree that our compilers need improvement, I also believe that compilers will forever have melting points and can always benefit from a little help from the programmers.

Compilers

A compiler takes your high-level code and converts it into machine code. I view the compiler as the following:

Screen shot 2011-05-30 at 12.56.00 PM

Optimizing the code requires the compiler to perform some extensive analysis. A simple example is dead code elimination (DCE). In DCE, the compiler eliminates code that is not reachable. For example, the loop on lines 3 and 4 will be eliminated by the compiler because the loop’s results are never used.

1: void foo(){
2:   int x;
3:   for(int i = 0; i < N; i++)
4:     x += i;
5: }

In DCE, the compiler performs reverse data flow analysis (def-use) to determine which variables are unused. It then eliminates the lines that are writing this unused variable. For the compiler to perform DCE and other similar optimizations, it must be able to decipher, with 100% confidence, whether or not a variable is going to be accessed. Herein lies our opportunity to confuse the compiler. We can exploit some well-known weak points.

Function calls

Compiler analysis is a very compute-intensive task. For reference, the code of the compiler itself is perhaps the most irregular and complex code in existence. Moreover, the optimization problems emerge very quickly. To keep compile time under control, compilers often limit the scope of analysis to functions and only perform limited–or even no–inter-procedural analysis (IPA). As an added benefit, treating functions as independent entities also allows compilers to compile files in parallel.

Adding multiple layers of function calls, ideally split across files, is often a good method to confuse the compiler.

Lets look at this example:

1: void foo(){
2:   int x;
3:   for(int i = 0; i < N; i++)
4:     x += bar(i);
5: }

A different file contains:


1: void bar(int i){
2:   return i;
3: }

A programmer may think that the only overhead is the function call to bar. However, when compiling foo, the compiler is unable to see that bar does not have any side-effect like writing to a global variable or I/O. For this reason, it disables DCE and the entire useless loop now needs to execute.  This can result in a substantial slowdown.

Pointers

I have seen a lot of programmers use pointers unnecessarily. In general, the pointer dereference (if anything) is considered to be the main overhead of using pointers. In reality, pointers pose a major overhead by confusing the compiler.

A pointer, in theory, is allowed to point anywhere in memory. As soon as the compiler sees a pointer, it can no longer be sure which variables are being accessed.  Consequently, compilers take a pessimistic assumption that several variables in scope can be accessed using the pointer. This often creates many false dependencies in the eyes of the compiler. My classic example is the following loop I found in an image processing application:

1: for(int i = 0; i < N; i++)
2:  *a++ = *b++ + *c++;

After careful inspection, a human can decipher that the code is really just doing:

1: for(int i = 0; i < N; i++)
2:  a[i] =b[i] + c[i];
3: a += N;
4: b += N;
5: c += N;

The first code confuses the compiler. Even though the loop iterations are independent, the compiler assumes that they are not; thus, the compiler does not vectorize this code. In contrast, the compiler is able to vectorize the second code.

For reference: The vectorized code runs 5x faster. Recall that multi-threading this kernel gave us 4x on 4 cores with some programmer effort. This auto-vectorized version gives 5x speedup with a single core and no programmer effort. I tried both the Intel C Compiler (ICC) 9.1 and  GCC 4.1. I quote results with ICC only because it outperforms GCC in all my examples.

If all else fails, use global variables

Global variables are considered evil for many reasons. I am about to provide one more. Since they have global scope, code involving global variables cannot be fully optimized by the compiler. For example, I noticed that the following code runs 30% faster if N was a local variable than if it was a global variable.

1: for(int i = 0; i < N; i++)
2:  a[i] =b[i] + c[i];

 

When I declared N as a global variable, the compiler left it in memory and added a load instruction to the loop. When I put N as a local variable, the compiler loaded it into a register.   While I do blame the compiler for this particular behavior (because I did not declare N as volatile), we have to work with what we have.

Conclusion

It is important to keep IPA and pointer-aliasing in mind when writing C/C++ code. This simple consideration can help the compiler generate code which runs much faster.

A related tip: I have observed that both GCC and ICC are sensitive to the link order, i.e., the generated code is a function of the order in which the input  files are specified in the linker command. This can inflict a major impact (up to 10%) on performance due to caches and branch prediction.  If you truly value performance, it is worth manipulating the link order.

  23 Responses to “How to trick C/C++ compilers into generating terrible code?”

  1. Shouldn’t dead code usually be eliminated by the programmer? :-)

    I understand that you’re just trying to make things simple in these examples, but it would be great to see a real-world example where the compiler can eliminate code that shouldn’t have been blatantly obvious to the programmer.

    It would seem that dead-code optimization would be much more useful if it worked on libraries, so that code paths in a library function would be removed if they were never used by the actual program being compiled. To do so you’d need to link the libraries to the main code either at the source or the intermediate representation level. OK, that would be a whole lot slower, but does it really matter if the final build takes 48 hours?

    • Hi Harry,

      Thanks for the comment. You are right:) I try to keep things simple for illustration purposes. I really like questions like yours as they provide me with the opportunity to explain things further for the curious readers, without overloading the casual ones.

      Here is a simple use case where the programmer will not see it easily but the DCE will. It is inspired from code I found in the Linux kernel.

      const int i386 = 0; /* in a different file */
      ——————————————-
      /* somewhere in another file */
      .
      a = some compute;
      x = a;
      y = a + some other computation ;
      .
      . /* some code that doesn’t access a, x, or y */
      .
      if(i386){
      if(y)
      foo(x);
      } else {
      bar();
      }
      /* end of code */

      Now, this entire code can be replace by just bar() but the programmer is unlikely to realize that.

      I agree with your comments that libraries can use extensive optimization. DCE is just one examples but there are a dozen others. I have not even touched instruction scheduling which can really benefit from more computes.

      For completeness, I must mention that there are compilers that do exhaustive searches to optimize code. The first company I know who wrote such a compiler was for VAX by DEC. (It is possible that it wasn’t the first one and I am mistaken).

      • Hmm. Unless the compiler is looking at both files at once it won’t be able to do that optimization, although I guess if the constant is defined in an include file it would be a valid example. In fact, I suppose in cases like this the programmer typically *can’t* remove the dead code because it’s only dead for certain builds – although I have to wonder why i386 would be a constant variable rather than a #define. :-)

        Are you by any chance talking about the compiler that was released when DEC moved to a RISC processor for the new VAX models? I wasn’t doing much programming at the time, but I was using a VAX system and I remember reading about the changes. (VMS is still my favorite OS in many ways.)

        • I was thinking include files actually so DCE will work. In fact, you have constructed a great example there. This is yet another way to trick the compiler and deny optimizations:-)

          Most (non-stupid) use cases of DCE involve different builds. In fact, one can argue that DCE is what allows different builds to co-exist efficiently in the first place. Otherwise, co-existing builds will have higher performance costs.

          I agree that #define will be a more conventional way of declaring i386. However, #define should not be used for a good reason: unlike #defines, consts do not eliminate the code pre-compilation thus even the dead code gets checked for syntax errors. This helps in maintaining code with different builds. My example should still be fine if i386 was #define. The programmer will still not remove the dead code for the reasons you describe and DCE will.

          I have never worked on a VAX myself. I do think that VAX’s virtual memory system is the most elegant VM system I have seen.

    • A lot of dead/redundant code can also appear from the compiler’s code generator.

      • And from generic code generated by things like macros or templates. It’s common to rely on compiler optimization in order to treat code constructs as compile-time determinable when the meta-language itself doesn’t offer such facilities.

        A good example of dead code elimination being important in compile-time generated code is when functions are inlined. Constant arguments can vastly simplify the inlined code after liveness analysis and propagation passes.. which is a big reason why inlining is about more than just removing function call overhead.

  2. pointer example:
    isn’t the ‘restrict’ key word designed to solve this problem? if a,b and c where marked are restricted then the compiler can assume that they do not overlap. does this help the vectorisation?

    • ssam, excellent comment! Yes restrict does help with pointer aliasing issue. I was really hoping someone says that:) You have just made the point stronger. A programmer who understands the compiler better will know and remember to use restrict. Unfortunately, many of us do not. We need more awareness…

  3. also could you comment on link time optimisation. it has been recently included into GCC (-lto). maybe it helps in the function call example.

  4. I have trouble with your pointer example. Without restrict in the pointer declarations, the compiler is surely unable to vectorise either loop. They both suffer from the same problem i.e. that the ‘a’ array may overlap either or both of the ‘b’ and ‘c’ arrays. I have tested both with the ARM C compiler and this seems to be the case. It will happily vectorise both loops with restrict’d pointers but neither without this. Interestingly, without the restrict, it will compile two separate versions of the loop, one vectorised and one not, plus a preceding runtime test to determine whether the arrays overlap. The result of the test determines which version of the code is actually executed.

    Am I missing something?

    Chris

    • Hi Chirs,

      Sorry my response is a bit late. I agree with you that I need a restrict. However, I have omitted the restrict intentionally to make a point that most programmers do not think about such compilers’ constraints by default. We need more compiler-aware code and it should be second nature to all programmers.

      The Intel Compiler vectorized A[i] = B[i] + C[i] even without restrict. Its a dangerous optimization but I guess it trusts the programmer to stay within array bounds.

      The ARM compiler seems really smart in case of generating both versions and a dynamic test. Its impressive. Actually, it also brings out a great point about Dollar being the only metric in our field: ARM has more financial incentive than Intel to make their compiler better because ARM’s cores are leaner and a good compiler can help ARM more than it will help Intel. Hence, the smarter compilers. As Intel makes leaner cores like Knights, I would expect their compiler to get smarter.

      On that note, I also have a question for you. I really want to play with an ARM core, where can I get a development board? I have been benchmarking A8 using my iPhone 3GS but iOS comes in the way.

      Thanks,
      Aater

      • beaglebaord, pandaboard and gumstix are quite cheap.

      • Thanks for the response, Aater.
        I have to say that the I believe the Intel compiler to be seriously in error for vectorizing the case without restrict. It’s not just dangerous! I would be interested in what other compilers do in these cases but sadly I don’t have the time to investigate…

        ssam’s responses about development boards are all good. Pandaboard and beagleboard are both excellent and very reasonably priced.

        Chris

        • Maybe ICC doesn’t do it in general, but was able to determine via static analysis in this case that the operands can’t overlap. For instance, if the parameters were tied exclusively to static arrays, and that their indexing is limited to valid ranges known by the loop variable. Arrays do seem a little easier to perform this analysis on than raw pointers.

  5. These are just some of the reasons why C/C++ irritates me more and more these days. Working on a compiler for a language without these flaws will do that :)

  6. 1) void bar(int i){
    return i;
    }

    should be

    int bar(int i){
    return i;
    }

    so it compiles.
    2) Microsoft C/C++ compilers since (at least) Visual Studio 2005 can do Link Time Code Generation and Whole Program Optimization, which changes your flow diagram and eliminates the issues you mentioned related to functions in separate source files.

  7. Nice info …..

    Thanks ..

  8. I’m sorry I didn’t catch this thread until now. Is there a language out there that takes advantage of all of the latest multi-threaded, multicore processors with SIMD hardware?

    I have been looking at OpenCL C and C++ for general purpose, highly parallel computing. However, is this the ideal? I mean C for dealing with vast arrays?

  9. Nice site. On your blogs extremely interest and i will tell a buddies. facdakdcdeck

  10. I read this paragraph fully about the comparison of newest and preceding technologies, it’s amazing article.

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>