May 242011
 

I had a discussion with a colleague an hour ago about how much slower is python compared to C. His guess was 10x and I was expecting it to be less than 10x. It turns out that python is more than 10x slower. Before doing my own comparison, I first googled for a comparison but most of the existing comparisons had I/O operations which makes the comparison hazy because I/O takes the same amount of execution time in both languages, which indirectly favors the slower language. I decided I want my analysis to be solely compute-bound. I coded my favorite computer-bound kernel, a histogram, in both C and python and tested on an Intel Core2Quad system with 4GB of memory. The code and results follow.

The C-code

There was a typo in C-code. thanks to user xman for pointing it out. It has now been fixed.

#include <stdio.h>                                                              

int histogram[128];
int main(){
  for(int i = 0; i < 1000000000; i++){
    char read_character = i%128;
    histogram[read_character] += 1;
  }
  printf("%d\n", histogram[0]);
}

The above program, compiled using ICC with -o3 optimization, took 2.24 seconds.

Python code

#!/bin/python                                                                   

histogram = [0]*128                                                             

for i in range(0,100000000):
    read_character = i%128
    histogram[read_character] += 1                                              

print histogram[0]

The above code produces the same result as the C-code but 52.68 seconds (wall clock) on my machine with Python 2.4. That is a slow down of 23x.

Analysis

Python is indeed very slow. I then set forth to understand why python is so much slower. First, I decided to see what instructions were actually running. I used objdump as follows to look at the assembly code for C:

objdump -d a.out

The c-code had compiled into the following:

 80484ca:       99                      cltd
 80484cb:       8b c8                   mov    %eax,%ecx
 80484cd:       83 c0 01                add    $0x1,%eax
 80484d0:       33 ca                   xor    %edx,%ecx
 80484d2:       2b ca                   sub    %edx,%ecx
 80484d4:       83 e1 7f                and    $0x7f,%ecx
 80484d7:       33 ca                   xor    %edx,%ecx
 80484d9:       2b ca                   sub    %edx,%ecx
 80484db:       0f b6 c9                movzbl %cl,%ecx
 80484de:       83 04 8d e0 9d 04 08    addl   $0x1,0x8049de0(,%ecx,4)
 80484e5:       01
 80484e6:       3d 00 ca 9a 3b          cmp    $0x3b9aca00,%eax
 80484eb:       7c dd                   jl     80484ca 

Frankly, I do not understand the point behind the following two instructions (yet) but I will explore that some other day.

 80484d0:       33 ca                   xor    %edx,%ecx
 80484d2:       2b ca                   sub    %edx,%ecx

These were the only instructions running per iteration.

To check python, I attach gdb to my python code during its execution and used:

x/50ni $pc

… to print an instruction trace.

I found that python was executing thousands of instructions per iteration. It was continuously interpreting the loop every iteration which was wasteful work. Another observation was that python code had a larger code footprint which code lead to thrashing of the I-cache, further reducing performance. After looking at the difference among the executing instructions, I was actually surprised that the slow down is only 23x.

Conclusion

I don’t draw any general conclusions about whether C or Python is faster. It just shows that on similar compute-intensive code, Python is substantially slower. There are several caveats in this study which you should be aware of:

1. There are better implementations of the histogram in python which may reduce its performance.

2. To be fair, there are several optimizations in C that I have not used, e.g., loop-unrolling, better instruction scheduling etc so C performance also be improved.

3. The code does not have substantial I/O or memory operations. This can make python look worse than it will be if it was compared on a more realistic app with some I/O and memory operations.

I have written this post just hoping that it will save someone else time and that others can help explain or extend the analysis.

  37 Responses to “Quick post: Python vs C in compute-bound workloads”

  1. @FutureChips There are some very inventive attempts to speed up Python code to compiled speeds. Psyco is one: http://psyco.sourceforge.net/

  2. @njrabit thanks for the reference. Will definitely check it out.

  3. Thanks for your post. Out of curiosity, i tried it out with 2.6.1 and the run time was 42 seconds (my machine’s config might be slightly better than yours) but i do confirm its really slow.

    • Raymond, thanks for double checking this. It adds more confidence to my results. Just out of curiosity, what processor do you have?

      • I have a Intel Core i5 running at 2.54 Ghz. Hope that helps.

        • Ah! the frequency difference between our computers can define part of the performance difference in pyton. Another secondary effect can the cache and/or the fact that the i3-i7 chips have a better indirect jump predictor. Indirect jumps are common in interpretive code thus a better predictor can make interpretive languages go faster.

  4. Interesting writeup. Interpreted languages would always be 10x-20x worse unless they’re pre-compiled to binary.

    Nowadays there’s pretty good op-code caches or precompilers for most interpreted languages – including php / c# / even Javascript to a degree within chrome V8.
    .

  5. The C program runs 1000M loops while the python runs 100M loops. The C version is much faster if you make it running only 100M loops.

  6. Didn’t know that Ruby can be quite fast too compared to python.

    For 100M loops -
    GCC: 0.057s
    Python 2.7: 28.9s
    Ruby 1.9: 15.6s

    • Very interesting. We still need perl, php, and javascript to complete the loop.

      The following perl code took 54.23 seconds as well.

      #!/bin/perl
      
      my @histogram; 
      
      for(my $i = 0; $i < 100000000; $i++){
        my $read_character = $i%128;
        $histogram[$read_character] += 1;
      }
      
      print $histogram[0];
      
      • In fairness to Perl: $histogram[$_ % 128] += 1 for (0 .. 100000000); is about twice as fast.

        On my system (i5) I get: 33-35 secs for your perl, about 16 secs for mine, and about .4 secs for c. So, at best still 30X slower, which is pretty appalling.

        I similarly optimized the python code to get rid of the unnecessary variable creation/destruction in each loop which resulted in about 22 secs runtime.

        As you point out, the actual number of instructions in assembly for this is tiny.

        In a language that does that does not force the programmer to distinguish between different numeric and string data types there are all of these other instructions in every iteration to check what type the variable actually is, convert it internally into the correct type to perform the requested operation, etc, which is a big mess performance wise.

        I had never heard of cython before, but the example below demonstrates the situation quite clearly.

        • Ersun, thats a simple perl trick that I have added to my toolbox. I am surprised it made such a big difference.

          You are absolutely right about the number of instructions. cpython has me excited too!

  7. Try Cython.
    If you need to glue a lot of different code together I can recommend Sage math:
    http://www.sagemath.org/
    Cython is very easy to use from the Sage Notebook, some examples:
    http://sagenb.org/home/pub/2046/
    Some examples I ran on a imac (2010 model) sage 4.6 (python 2.6.4):
    def f(N):
    histogram = [0]*128
    for i in xrange(0,N):
    read_character = i%128
    histogram[read_character] += 1
    return histogram
    time r=f(100000000)
    Time: CPU 58.21 s, Wall: 58.33 s

    %cython
    def f1(int N):
    histogram = [0]*128
    cdef int i, read_character
    for i in xrange(0,N):
    read_character = i%128
    histogram[read_character] += 1
    return histogram

    time r=f1(100000000)
    Time: CPU 2.91 s, Wall: 2.91 s

    As you see in this example we get approximately the same speedup with just a small adjustment (cython syntax) of the python code. You could load your c-code as a cdef extern function too.

  8. You should use xrange instead range, it will took a lot ram and range is more slower in this case

    phenon x4 955 overclocked at 4,0ghz linux 64bits

    C with no optimizations: 5.504s

    python with xrange: 17.142s

    there is some good linux distro that uses -O3? NO, so you should compare it without this insane optimizations.

    in my machine, this code in C is 3,4x faster than python, not 23x.

    Also, you should consider that you are using an old python version, are you using a old gcc version too?

    So my conclusion is: “this comparison is very wrong”.

  9. Wow, don’t use “range(0,100000000)” – It will create a list with 100000000 elements (!!!) before the loop. Use xrange instead.

    In my core2duo I had

    - 7.8 seconds in C program
    - 37 seconds in Python program with xrange
    - 234 seconds(!!!) in Python program with range – It freeze my notebook

    • Thanks for sharing this analysis. xrange it is! I will update my post whenever I get a chance and creidt both you and renatopp:)

      Core2Duo seems slow .. btw, which C compiler and what optimization level?

  10. #python version
    histogram = [0]*128

    for i in xrange(0,100000000):
    histogram[i%128] += 1
    print histogram[0]

    time: 0m14.335s

    #pypy
    ./pypy aaa.py

    time: 0m2.432s

    /*C version*/
    gcc -std=c99 -O2 -o aaa aaa.c

    time: 0m0.515s


    • histogram = [0]*128

      for i in xrange(0,100000000/128):
      for i in xrange(0,128):
      histogram[i] +=1
      print histogram[0]

      python time: 0m10.963s
      pypy time: 0m10.963s

      • ops pypy time is 0m1.328s

        the “" does not gives syntax highlight, please fix it :D
        histogram = [0]*128

        for i in xrange(0,100000000/128):
        ....for i in xrange(0,128):
        .........histogram[i] +=1
        print histogram[0]

        • Hey mate, sorry about that. I will look into it tonight:)

        • ohh sorry, please remove my last reply comments, its very boring that I can’t remove my comments, I mean “the html tag code doesn’t show the code identation”

          for i in xrange(0,100000000/128):
          ….for i in xrange(0,128):
          ………histogram[i] +=1
          print histogram[0]

  11. I think the xor trick you see might be because i%128 needs to consider what happens if i is negative. If i compile with “gcc -O0″ with i signed i get:

    movl -12(%ebp), %edx
    movl %edx, %eax
    sarl $31, %eax
    movl %eax, %ecx
    shrl $25, %ecx
    leal (%edx,%ecx), %eax
    andl $127, %eax
    subl %ecx, %eax
    movb %al, -5(%ebp)
    movsbl -5(%ebp),%edx
    movsbl -5(%ebp),%eax
    movl _histogram(,%eax,4), %eax
    addl $1, %eax
    movl %eax, _histogram(,%edx,4)
    addl $1, -12(%ebp)

    but unsigned i get:

    movl -12(%ebp), %eax
    andl $127, %eax
    movb %al, -5(%ebp)
    movsbl -5(%ebp),%edx
    movsbl -5(%ebp),%eax
    movl _histogram(,%eax,4), %eax
    addl $1, %eax
    movl %eax, _histogram(,%edx,4)
    addl $1, -12(%ebp)

    I had to use -O0 as on -O3 gcc must be able to spot that i is really always positive and so remove all the unnecessary additional instructions. As you are using icc it could be that they haven’t got this optimisation

  12. Ah, just took a look at this. Now I see “Cython”, when you told me “CPython”–totally different things :)

    One important note with this is that Python cannot optimize the modulo to an and. Python’s flexibility makes it almost impossible to do any sort of meaningful optimization on a program. Basically every iteration of the loop the interpreter has to look up “i” in the locals dictionary, look up “__mod__” in its methods, call that method with an int object (128) as an argument, etc. It’s just really messy, and I would agree that it’s amazing how fast Python is despite all this.

  13. Today I decided to start learn Go and as a exercise I type your histogram example to check its performance in loops.

    Here you go the code:


    package main

    import "fmt"

    func main() {
    var histogram [128]int
    for i := 0; i < 100000000; i ++ {
    var mod int = i%128
    histogram[mod] += 1
    }
    fmt.Printf("%d\n", histogram[0])
    }

    here you go the results:

    time ./6.out
    781250

    real 0m0.407s
    user 0m0.403s
    sys 0m0.003s

    Pleasant surprise!

  14. [...] faster than python. ~3 seconds to do things that can be done in python in ~50 seconds. Here – Quick post: Python vs C in compute-bound workloads | Future Chips Anyways I have a life – bye bye!!! __________________ Sweet mother of all that is good & [...]

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>