Big-O gives us a tool to compare algorithms’ efficiency and execution times without having to write the code and do the experiments. However, I feel that many people misuse the analysis unknowingly. Don’t get me wrong. I am also a big fan of analytic models and understand that they provide insights that cannot be found empirically. However, in case of Big-O, I feel that it has underlying assumptions that were true when it was formed but have since become false. Thus, we need a revision of how Big-O should be taught. This post explains why…

At its soul, Big-O assumes that “more work” means longer execution. This assumption was valid up until early 1990s when computer did most of the work sequentially: computers executed instructions one at a time and if a particular operation took longer to execute (because it had to access DRAM or disk), the computer stalled and waited. Thus, the contributions to execution time from computes, memory, I/O were additive:

Total execution time = Time in compute + Time in memory + Time in I/O

However, in the last two decades, Moore’s law has allowed us to build machines that are far more complex. The sole focus of computer scientists has been to increase “latency hiding,” i.e., they have designed machines that minimize stalls for long latency operations.For example, out-of-order cores overlap computes with DRAM accesses, thus, reducing the impact of DRAM misses. Similarly, OSes overlap processing of other threads with I/O, thus, reducing the effective cost of I/O. In this context, quantifying the impact of a particular factor on overall performance has become next to impossible and combining contributions using addition is incorrect. Unfortunately, I still see professors teaching Big-O and additive performance analysis equations to CS graduates without explaining the dangers of relying on it.

The reason I started this article: A recent analytic model I built for a multi-core chip looked something like (details removed):

Total execution time = MAX(execution time of factor 1, execution time of factor 2, )

It actually actually correlated with my simulations surprisingly well while the Big-O of the same algorithm was way off.

I’m somewhat confused. How does latency hiding fundamentally change algorithmic complexity? If you have one algorithm that is O((A+B)^2) and another that is O(max(A,B)^2), there’s a speedup, but we’re still talking about what is essentially an O(N^2) operation… isn’t that still the fundamental observation that needs to be made about an algorithm?

Maybe my confusion is because I’ve never delved far enough in depth into asymptotic notation. I’ve never had cause to look beyond the broad strokes of “this algorithm is O(n), and this one is O(nlog(n)), so now I know how they’ll scale with larger systems”. My undergrad level algorithms courses were more concerned with understanding and devising algorithms than in being precise in measuring their theoretical runtime. Is there a context in which these distinctions (which seem subtle to me) make a big difference?

Hi Ian,

We agree that O((A+B)^2) is the same as O(N^2) but the point I am making is that just saying an algorithm is O(N^2) is not very useful. For example, an O(N^2) algorithm which uses an array is very likely going to be faster than an O(N) algorithm which uses a linked list. The reason is that each operation in an array is more than 100x faster than the same operation in a linked list (worst case assumptions). The way I am thinking is that Big-O is fine as-is but the way we use it needs to get more precise. Would you agree?

Aater

I would agree, yes, though I’d also argue that the difference is a constant time change (albiet a massive one). The fundamental observation of asymptotic notation, at least to me, is that an algorithm can be classified based on how its runtime or memory consumption grows as the data set changes. Yeah, a 100x increase in speed is a massive improvement that practical applications need to take into account, but even that large constant time increase will be obliterated with a sufficiently large dataset increase.

In other words, I think you’re making an important observation, but I think it’s a bit orthogonal to the question that asymptotic notation is trying to answer. To answer your question to Aaron Kelley, I’d be of the opinion that asymptotic notation isn’t a good way to compare performance of similar algorithms, and a better tool is empirical data for a clearer conception of how performance is affected by different algorithms… especially as the latency hiding you mention changes wildly depending on the microprocessor architecture being used, or even details like cache sizes, memory speeds, and the speed of the CPU. I’m not sure that a theoretical framework can even really accommodate the variety in hardware, or that it’d be anything but unwieldy if it could. O(n) notation is, in my opinion, a more general tool.

I agree that its a constant change. I guess we are concluding is that O notation is never meant to correspond with performance and should not be used in that context. I will give you a funny example by the way from real life which demonstrates how Big-O is used poorly.

I had two theoretical guys at work fight for two hours trying to find a ((n^2) solution instead of an exponential solution to an algorithm. What they didn’t realize was that N was so small that we could solve the actual problem faster than the time it took for them to argue.

Aater

That’s hilarious! That said, I can see where they’re coming from, because I’ve noticed a similar pattern in my own problems. For me, there’s a natural inclination to solve a problem in an “ideal” fashion (and spending three times as long considering my options) rather than hack it out, see what works, and use a “good enough” solution. Throw a bunch of theory-heavy computing scientists at the same problem and I can only imagine the problem is compounded!

So that hilarious anecdote isn’t necessarily about Big-O: it’s the longstanding tradition of making mountains out of molehills for the sake of perfectionism!

Going to have to agree with Ian. Isn’t Big-O about seeing how the running time of an algorithm will change based on the size of the input? It doesn’t tell us much about

actualruntime relative to other algorithms, or even relative to itself, since so much is tossed aside in the “computing” of the Big-O value.Example: If we have one algorithm that is O(n^2) and one that is O(n log n), the second one may actually be

slowerfor the input we are using, but it will scale better as the size of the input increases (i.e. there exists some c where if n > c the second algorithm is faster).If Big-O is taught properly, students will get what it is good for and what it is not, and “latency hiding” in modern hardware + OS’s doesn’t really change anything here.

Hi Aaron,

I would agree with you if Big-O was only used for characterizing scaling of the same algorithm but that is not the case, e.g., when comparing sorting algorithms we conclude quick sort is better primarily based on the Big-O notation because QuickSort requires less work. However, one can envision hardware which will produce bigger times.

I guess I ask for one of the two: Either do not use Big-O as a proxy for performance or extend the analysis to make it realistic. What do you think?

Aater

Quicksort has O(n^2) worst case performance, merge sort has O(n logn) worst case performance. Quicksort is not used based off its Big-O notation, it’s used for it’s empirical performance.

The value of O notation comes from the ability to make comparisons with other algorithms, and when hardware optimizations could make an O(n^2) take less time than an O(n) algorithm then the notation has failed to serve it’s (main) purpose. I think that programmers need to just rely less on the big O metric when making algorithmic decisions and ensure that the capabilities of the underlying hardware are being fully utilized by the algorithm of choice.

I think when the big-O notation “fails” to account for speed-ups due to parallelism, it is in fact the analysis of the problem that is incorrect, insufficient, or otherwise lacking, not the big-O notation.

When considering parallel algorithms, you must be careful in your analysis to assess correctly the complexity of message passing, shared memory, and other synchronization mechanisms in order to get an accurate and useful big-O.

The analysis of parallel algorithms is not an easy task, but just saying that one O(n²) parallel algorithm is better than another O(n²) doesn’t say much unless you compare the complexity correctly: the first may be O(n(n+1)) while the other is O(n(n+log n)); both can be considered O(n²) ( one is n²+n, the other n²+n log n), but the first one is better, and it is likely to show in actual run-time as well.

Steven,

Thanks for an insightful comment. I agree. I would extend your argument beyond parallel algorithms and say that parallelism exist in all systems today (even if single-threaded) because the commonly used out-of-order cores expose instruction-level parallelism, memory scheduling exposes DRAM bank-level parallelism, RAID (replication) exposes disk-level parallelism, hardware prefetching increases memory-level parallelism, and so on … . Thus, Big-O shall always be used very carefully today in ALL cases. What do you think?

Aater

Yes. Big-O is to be used carefully for parallelism. Shared memory, locks, pipes, messages, all have a cost, not necessarily constant, and the most efficient algorithms will be those which take these costs into account in exactly the right way, and maybe in a counter-intuitive way. When I say counter-intuitive, I mean, for example, that in some cases it is faster to recompute stuff than to fetch it from wherever it is stored; which certainly goes against what we would usually do, that is, use storage rather than computation to speed things up.

Another example, not exactly related to parallelism, are memory-hierarchy friendly algorithms, algorithms that explicitly exploit spatial and temporal locality for (sometimes major) speed-ups. The model they use ceases to assume memory access is O(1) (for a “very small value of 1″), replacing memory accesses with different costs, not always constant, depending on which level you read from or write to.

It’s always very stimulating to figure these things out :p

Steven,

I agree. You hit the nail with the compute vs storage example. I guess we agree that Big-O shouldn’t be used as a predictor of performance, and hence, can’t be used as basis of design decisions.

Aater

Hi Austin,

Thanks for reading. You summarized my post better than me:-).

“Big-O should not be used (or used very cautiously) when making algorithmic decisions! ”

Aater

Aater: Yes I guess a simpler comment would have just been “I concur.”

Aater, you said “For example, an O(N^2) algorithm which uses an array is very likely going to be faster than an O(N) algorithm which uses a linked list.” Er, that indicates you don’t understand Big-O at a fairly fundamental level. Direct element access is O(1) on an array and O(n) on a list. That has to factor into the algorithmic complexity so your problem is that by not understanding the Big-O of the algorithm in full, you are drawing a bogus conclusion, that Big-O is flawed.

As Steven points out, the problem is in the analysis, not Big-O notation. Yes, parallelism makes that analysis harder, but it doesn’t invalid the basic concept.

The mathematics of “big O” are not the problem. The problem is that all the pedagogical material on it assumes you that what you are counting (the left-hand-side of an O “equation”) is visible on the face of the program. It is perfectly valid and reasonable to say that some algorithm requires O(n log n) cache line replacements, but you can’t see a cache line replacement in the text of the program — nor in the assembly listing.

To complement what you have said, it may be helpful to read:

http://en.wikipedia.org/wiki/Abuse_of_notation#Big_O_notation