Jun 292011
 
Loop control flow

I got into a debate with a computer science professor a few months ago when I made a controversial blanket statement that “the code inside loop bodies is the only code that matters for performance.” I should provide some context: I was discussing how multi-threading is about speeding up loops and I don’t care about straight line code which is only gonna execute once (or just a few times). My argument is that programmers do not write billions of lines of straight line code. Its the repetition of code (via loops or recursion) that makes the code “slow.” In fact, I can argue that any time we wait on a computer program to do something useful, we are in fact waiting on a loop (e.g., grep, loading emails, spell checking, photo editing, database transactions, HTML rendering, you name it). It is a rather silly argument but I would like to see some counter arguments/examples. Question: Is parallel programming all about loops/recursions or are there cases where code that executes only once is worth optimizing?

Please note that if the code executing once has a function call which has a loop in it then that counts as a loop, not straight line code. Comments would be great but at least take the time to indicate your vote below.

Are loops the only code that matters for performance?

View Results

Loading ... Loading ...

 

 

  29 Responses to “Discussion: Aren’t loops the only “interesting” code?”

  1. I agree if you include heavy recursion in your definition of loops.

    • I think of recursion and loops as the same since we can always convert recursion to loops.

      Side note: Finally, someone agrees with me on something! Adam, you get the futurechips badge of “awesomeness” for this comment:-)

  2. Some alternate conjectures:

    From a performance perspective, the only interesting code is code that does nonsequential memory reads, because cache misses are one to two orders of magnitude more expensive than most other events in a modern processor.

    From a performance perspective, the only interesting code is code that does I/O, because disk response times and network round-trip times are many orders of magnitude more expensive than any CPU or memory operation.

    • Very first order comment indeed. Both cache misses and I/O dominate everything else.

      However, I can also argue that cache misses or I/O become interesting only if they are in a loop.

      For example,

      Foo = 3;

      Isn’t very interesting as it’s a single cache miss. No programmer will optimize this statement. However,

      For I = 1 to 1000000000:
      Bar[I]=3;

      Is interesting as there are millions of misses involved. People will work on this one.

      I may be misinterpreting your comment. What do you think?

      • Here’s an example where I/O outside a loop has a bigger impact on performance than computation inside a loop:

        for (i = 0; i < 1000000; i++) {
        foo += foo;
        }
        write(socket, message_header, header_len);
        write(socket, message_body, body_len);

        That loop is an obvious candidate for optimization. However, based on a quick test I just ran, it takes slightly under two clock cycles per iteration on a Nehalem-vintage x86_64 processor. That works out to about 0.7 msec at 3GHz. (Tangential observation: branch predictors have gotten really good.) In contrast, the two lines that follow it may, depending on the state of the connection and whether Nagle's algorithm is enabled, incur a network round-trip delay for an ack between the header and the body of the message. If the network RTT is 10ms, the fact that the write is split into two pieces may have an order-of-magnitude bigger impact than the loop in determining how fast the recipient gets the message.

        In the case of cache misses, I agree: the ones that matter the most tend to be those in loops, since cache misses are only 5x to 100x slower than a typical instruction.

        It's also worth noting that there are applications where an operation is executed a lot, and thus needs performance attention, without being in a loop. Consider a server that processes requests based on interrupts: even if the request processing itself isn't explicitly in a loop from the perspective of the code, it may be most frequently executed code path in the application. In this programming model, the interrupt-driven code has some of the same properties as a loop (high number of executions of the same code path) but also differs in some ways (worse memory access locality).

        • Hey Brian.

          Thanks for taking the time to run the study. Very powerful.

          I generally think of servers as a loop because at a high-level it is running the same code over and over again for every transaction. It is in fact a loop but has a wait_for_work in there somewhere but I do not disagree with you.

          Thanks again.

          • I believe most modern OSes try to reduce the overhead you described by buffering the data before sending it. if the header didn’t fill the buffer (which isn’t likely since headers are relatively small), the OS will wait a set period of time for more data before sending the packet.
            and not all networking sends are blocking in the first place. i agree the time until the packet actually arrive is order of magnitudes longer, but that is mostly due to physical limitations and mostly wont have to be handled by the application.
            so it can (and should ) be avoided by higher level infrastructure (namely, the OS you’re running under)

          • Nirlzr,

            I agree with your statement but it applies to writes. I think Brian’s insight includes reads well, which cannot leverage buffering. When a data is needed, it is needed.

            Let see what Brian has to say …

          • Even with writes, it’s nontrivial for the OS to buffer and gather efficiently. In my simple example,

            write(socket, message_header, header_len);
            write(socket, message_body, body_len);

            there’s no way for the OS to know that the first write will be followed immediately by another write. It could try to deal with this case by delaying the output to the network by a very short time to see if there’s another write on the same connection, but that adds latency and complexity.

            The solutions I’ve seen for this use case are things like Linux’s TCP_CORK socket option: a way for the application to give the OS some hints about its I/O usage pattern. Now that I think about it, I suppose the OS could attempt to gather statistics about each process’s reads and writes to do prediction in cases like this – similar to how a CPU predicts branches and memory fetches.

  3. I believe that if you examine the problem in terms of Black Box Theory, then loops are not the only “interesting” code.

    You suggest that we should treat a single line of code that contains a loop as a loop, and not a single line. I disagree. In many cases, we do not know how a library function is processed. There are modern ways to go about inspecting that function, but in light of Black Box, one must ignore what lies inside; that is one of the main premises behind Black Box theory. We must only consider what goes in and what comes out.

    Loops that may or may not occur inside a given function are irrelevant; therefore, it is very possible to encounter a low performing function that is best placed on a separate thread.

    • I see your point a lot better now. Twitter doesn’t replace comments I guess:-) It is a v interesting way of looking at it (assuming I got it:-)

      So is this a fair summary of your argument:

      1. We shall not make assumptions about the library functions. They may or may not have loops. Thus code containing functions is impossible to classify as a loop or otherwise.

      2. If there exists a long loop in a function, it can use threading.

      ..?

      • I think your first part of summarizing my argument is fair to say; I’m not so sure on the second, lol.

        I was mostly just arguing against you final point that “if the code executing once has a function call which has a loop in it then that counts as a loop, not straight line code.” Although it is very likely that a function’s latency is due to a loop; I feel that one cannot assume that–again I’m basing this on the idea of Black Box.

  4. Latency matters a lot for user perceived performance, and that often happens outside of loops.
    Network round-trips. Sequential vs. parallel acquisition of data from web sources or databases. Cache misses. First-time loading of a huge library or data file.

    For sustained performance as in ‘transactions per second that we can support’, sure, think of loops. For user-facing apps it’s a whole different world.

  5. Waiting on external events also makes code slow. For instance, blocking the user interface while waiting on network IO is slow. Putting the network IO on a background thread would fix the performance.

  6. I think that it’s true, but unless you’re programming close to the metal then essentially all the code you’ll ever write is either being executed within a loop (probably some sort of message dispatcher), calling code which contains loops (library routines), or both. In which case the observation reduces to, “The only interesting code is all of the code.”

  7. I find it interesting that for performance,

    code is only important if it is heavily re-used (I.E. loops, recursion)
    data is only important if it is not re-used.

    grep/email/spellcheck aren’t slow because of the loop, it is slow because the data fetched from disk (or wherever) is not re-used. Data access time dominates.

    In the end, program execution is slow because of the slow stuff. The trick is to identify what is slow in this particular case & figure out how to speed it up without slowing anything else down.

    Loops are an attractive starting point because any improvement is multiplied by the execution count. But sometimes the biggest improvements can come from straight-line code if it contains expensive operations.

    Dumb example: Perhaps your big looped operation is preceded by an network-intensive permission-getting operation. (Am I allowed to update this object?) Perhaps it is better to ignore the loop internals and instead parallelize the asking for permission and let it kill the update thread if permission is refused.

  8. It looks like your definition of loop is too fuzzy. Any line of code potentially contain an inner loop (like a function call) or could be a part of an outer loop.

  9. Parallel programming is all about control flow and data dependencies. Multithreaded programming is all about lowering the I/O latency by overlapping communication and computation.

    Not everything is a loop – and not all loops are worthy to be parallelized.

    • Yiannis,

      Perhaps it is just a matter of terminology but I am unable to see the distinction that you are drawing between parallel programming and multithreaded programming.

      Do you mean to say that multithreading is about having multiple threads on a single core while parallel programming is when there is actual concurrent execution?

      It sounds interesting and I have never thought of it this way so I am really looking forward to hear a clarification from you.

      Thanks.

      • I always try to make a distinction – I do not know if it formalized or not though.

        In multithreaded programming, one has multiple instruction streams going on concurrently (ie independent of each other). You can have one or more threads per core. Thread migration, preemption and context switches are not that relevant.

        Parallel programming is having multiple instruction streams that work synergistically towards solving a problem. It can be realized using multithreaded programming but this is not necessary (MPI, PVM etc). You tend to avoid thread migration and preemption.

        At the end of the day, you are always trying to do two things: 1) increase processing power to pump more data in a shorter amount of time and 2) decrease I/O latency to avoid putting a limiter to your data pump (see 1).

        Thus, I am always seeing multithreaded programming as a means to decrease the I/O latency (multiple threads that handle files, sockets, GUI, RDMA and whatever else means high latency for getting data) rather than as a specific aspect of parallel programming.

        In hardware terms see it as SMT vs multi-core: they are both capable of multiple simultaneous hardware instruction streams, but the fine line between them makes one more appealing for I/O latency hiding and the other for parallel programming.

  10. I think code inside the loop is a very broad definition. If something is not in a loop then the computation time for it is going to be in the seconds at most anyway, and that makes it something not worth optimizing for. However, when you say the loops are the only thing that matters it brings more an image of optimizing the innermost loop to be as fast as possible the rest of the code be damned. But there are classes of code that are not bound by the execution time of the inner-most loop. A large outer-loop runs and branches of in all kinds of weird directions, doing all kinds of things depending on the incoming data, instead of being marked by doing the same thing many times. There is definitely code that has many small function calls all over the place and instead of being bound by executing the same operation again and again, it is bound by the number of i-cache misses and TLB size. There are definitely consumer applications that need this but I’m not sure how well the benchmarks express that. I think gcc has a very high i-cache miss rate.

    • Ali,

      Welcome back and thanks for another great comment.

      I meant outer loops but didn’t make it clear so that my bad. You are right about the inner loop. I kind of count the inner loop as a part of the outer loop so speeding up inner loop helps both of them.

      You are right about the microarchitectural effects but I would also argue that most microarchitectural effects become important because they occur in the hot kernels (i.e., loops). If a piece of code runs only once, an instruction TLB miss for that code is less of a problem. If the TLB miss occurs in a loop and continues to occur, that is something which becomes a problem. As I have learned in the last three days, I/O is a great counter example and perhaps you are also hinting in the direction.

      A minor side note on “If something is not in a loop then the computation time for it is going to be in the seconds at most anyway, and that makes it something not worth optimizing for.” I think you meant to say nanoseconds and not seconds. Seconds is a very long time and I would optimize anything that takes even milliseconds.

      Thanks.

  11. “Friendship is the shadow of the evening, which increases with the setting sun of life.”

  12. I largely agree. When profiling applications, hotspots are always code that is executed millions of times (the only exception being I/O, as noted above). Hence by definition it has to be a loop.

    Fortunately we’ll soon get mainstream CPUs with gather instruction support. This will allow to successfully parallelize a lot more loops with SIMD. It’s particularly interesting for auto-vectorizing compilers.

    AVX2 can run up to 8 loop iterations in parallel, so this could become nothing short of revolutionary.

  13. I mostly agree in this case. My email program takes the longest to compute the summary view (Which can contains thousands of emails). So basically, that is the part that I parallelize (and it was also the easiest to parallelize, since I was recursing over an incredibly large list. After I optimized it a lot, the program became a lot faster, overall. Additionally, anything that keeps the user waiting ui-wise (eg. if the program does periodic updates to the screen, it will seem faster than if it took the same amount of time and only updated the screen when it finished) makes the program seem extremely slow.

  14. That will be especially valid since you may treat your website visitors along with car-maintenance gear. People young and old really should shield these types of obtainable on top of that around the clear-cut be able to quite often, so you would like to make sure that his or her own new or used cars are probably walking how they need to. One of these main car-maintenance possessions is actually a tyre measure. Wheel assessments will be smallish, slender utilities, many times really having to do with bigger with a specific write, that’s employed to measure the air space emotional stress during a auto’s bicycle tires. This may be a simple and easy however , integral little automobile or truck preservation; in case car or truck’s train wheels gain completely wrong amount of force, it might probably fall gasoline consumption, help with given to over a bicycle tires, and so prevent generally car or truck’s safety and security. Your own personal brand most likely be released the findings purchasers side with this tyre evaluate throughout the people today situations loan companies rrndividuals are looking into the protection as well as , conservation for their tires-and prevention coupled with capability really care it goes without saying a guidelines you would personal obtainer on the way to accompany individuals.
    - – - – - – - – - – - – - – - – - — – - — – - — – - – — – — – - – - – - – - — – - – - – — – — – -
    Reliance Controls PC3040 40-Feet 30-Amp L14-30 Generator Power Cord for Up to 7500-Watt Generators

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>