I had a discussion with a colleague an hour ago about how much slower is python compared to C. His guess was 10x and I was expecting it to be less than 10x. It turns out that python is more than 10x slower. Before doing my own comparison, I first googled for a comparison but most of the existing comparisons had I/O operations which makes the comparison hazy because I/O takes the same amount of execution time in both languages, which indirectly favors the slower language. I decided I want my analysis to be solely compute-bound. I coded my favorite computer-bound kernel, a histogram, in both C and python and tested on an Intel Core2Quad system with 4GB of memory. The code and results follow.
The C-code
There was a typo in C-code. thanks to user xman for pointing it out. It has now been fixed.
#include<stdio.h> int histogram[128]; int main(){ for(int i = 0; i < 1000000000; i++){ char read_character = i%128; histogram[read_character] += 1; } printf("%d\n", histogram[0]); }
The above program, compiled using ICC with -o3 optimization, took 2.24 seconds.
Python code
#!/bin/python
histogram = [0]*128
for i in range(0,100000000):
read_character = i%128
histogram[read_character] += 1
print histogram[0]
The above code produces the same result as the C-code but 52.68 seconds (wall clock) on my machine with Python 2.4. That is a slow down of 23x.
Analysis
Python is indeed very slow. I then set forth to understand why python is so much slower. First, I decided to see what instructions were actually running. I used objdump as follows to look at the assembly code for C:
objdump -d a.out
The c-code had compiled into the following:
80484ca: 99 cltd 80484cb: 8b c8 mov %eax,%ecx 80484cd: 83 c0 01 add $0x1,%eax 80484d0: 33 ca xor %edx,%ecx 80484d2: 2b ca sub %edx,%ecx 80484d4: 83 e1 7f and $0x7f,%ecx 80484d7: 33 ca xor %edx,%ecx 80484d9: 2b ca sub %edx,%ecx 80484db: 0f b6 c9 movzbl %cl,%ecx 80484de: 83 04 8d e0 9d 04 08 addl $0x1,0x8049de0(,%ecx,4) 80484e5: 01 80484e6: 3d 00 ca 9a 3b cmp $0x3b9aca00,%eax 80484eb: 7c dd jl 80484ca
Frankly, I do not understand the point behind the following two instructions (yet) but I will explore that some other day.
80484d0: 33 ca xor %edx,%ecx 80484d2: 2b ca sub %edx,%ecx
These were the only instructions running per iteration.
To check python, I attach gdb to my python code during its execution and used:
x/50ni $pc
… to print an instruction trace.
I found that python was executing thousands of instructions per iteration. It was continuously interpreting the loop every iteration which was wasteful work. Another observation was that python code had a larger code footprint which code lead to thrashing of the I-cache, further reducing performance. After looking at the difference among the executing instructions, I was actually surprised that the slow down is only 23x.
Conclusion
I don’t draw any general conclusions about whether C or Python is faster. It just shows that on similar compute-intensive code, Python is substantially slower. There are several caveats in this study which you should be aware of:
1. There are better implementations of the histogram in python which may reduce its performance.
2. To be fair, there are several optimizations in C that I have not used, e.g., loop-unrolling, better instruction scheduling etc so C performance also be improved.
3. The code does not have substantial I/O or memory operations. This can make python look worse than it will be if it was compared on a more realistic app with some I/O and memory operations.
I have written this post just hoping that it will save someone else time and that others can help explain or extend the analysis.
@FutureChips There are some very inventive attempts to speed up Python code to compiled speeds. Psyco is one: http://psyco.sourceforge.net/
@njrabit thanks for the reference. Will definitely check it out.
Thanks for your post. Out of curiosity, i tried it out with 2.6.1 and the run time was 42 seconds (my machine’s config might be slightly better than yours) but i do confirm its really slow.
Raymond, thanks for double checking this. It adds more confidence to my results. Just out of curiosity, what processor do you have?
I have a Intel Core i5 running at 2.54 Ghz. Hope that helps.
Ah! the frequency difference between our computers can define part of the performance difference in pyton. Another secondary effect can the cache and/or the fact that the i3-i7 chips have a better indirect jump predictor. Indirect jumps are common in interpretive code thus a better predictor can make interpretive languages go faster.
Interesting writeup. Interpreted languages would always be 10x-20x worse unless they’re pre-compiled to binary.
Nowadays there’s pretty good op-code caches or precompilers for most interpreted languages – including php / c# / even Javascript to a degree within chrome V8.
.
There are efforts in speeding up python as well. Perhaps python version 3 has already improved.
The C program runs 1000M loops while the python runs 100M loops. The C version is much faster if you make it running only 100M loops.
Thanks for pointing it out. I have fixed it.
Didn’t know that Ruby can be quite fast too compared to python.
For 100M loops -
GCC: 0.057s
Python 2.7: 28.9s
Ruby 1.9: 15.6s
Very interesting. We still need perl, php, and javascript to complete the loop.
The following perl code took 54.23 seconds as well.
#!/bin/perl my @histogram; for(my $i = 0; $i < 100000000; $i++){ my $read_character = $i%128; $histogram[$read_character] += 1; } print $histogram[0];In fairness to Perl: $histogram[$_ % 128] += 1 for (0 .. 100000000); is about twice as fast.
On my system (i5) I get: 33-35 secs for your perl, about 16 secs for mine, and about .4 secs for c. So, at best still 30X slower, which is pretty appalling.
I similarly optimized the python code to get rid of the unnecessary variable creation/destruction in each loop which resulted in about 22 secs runtime.
As you point out, the actual number of instructions in assembly for this is tiny.
In a language that does that does not force the programmer to distinguish between different numeric and string data types there are all of these other instructions in every iteration to check what type the variable actually is, convert it internally into the correct type to perform the requested operation, etc, which is a big mess performance wise.
I had never heard of cython before, but the example below demonstrates the situation quite clearly.
Ersun, thats a simple perl trick that I have added to my toolbox. I am surprised it made such a big difference.
You are absolutely right about the number of instructions. cpython has me excited too!
Try Cython.
If you need to glue a lot of different code together I can recommend Sage math:
http://www.sagemath.org/
Cython is very easy to use from the Sage Notebook, some examples:
http://sagenb.org/home/pub/2046/
Some examples I ran on a imac (2010 model) sage 4.6 (python 2.6.4):
def f(N):
histogram = [0]*128
for i in xrange(0,N):
read_character = i%128
histogram[read_character] += 1
return histogram
time r=f(100000000)
Time: CPU 58.21 s, Wall: 58.33 s
%cython
def f1(int N):
histogram = [0]*128
cdef int i, read_character
for i in xrange(0,N):
read_character = i%128
histogram[read_character] += 1
return histogram
time r=f1(100000000)
Time: CPU 2.91 s, Wall: 2.91 s
As you see in this example we get approximately the same speedup with just a small adjustment (cython syntax) of the python code. You could load your c-code as a cdef extern function too.
Very cool. Thanks for sharing. I will look into cython for sure.
You should use xrange instead range, it will took a lot ram and range is more slower in this case
phenon x4 955 overclocked at 4,0ghz linux 64bits
C with no optimizations: 5.504s
python with xrange: 17.142s
there is some good linux distro that uses -O3? NO, so you should compare it without this insane optimizations.
in my machine, this code in C is 3,4x faster than python, not 23x.
Also, you should consider that you are using an old python version, are you using a old gcc version too?
So my conclusion is: “this comparison is very wrong”.
range vs xrange.
range creates a list of values (100MBs worth here), and think iterates through it. xrange is a generator, so creates the integers on the fly.
that said loops in python are slow. for high performance its best to use numpy, numexpr, cython etc. or call out to a C/Fortran function. also there is the very interesting pypy.
also, xrange is renamed to range in python 3
+1
In some occasions, pypy can be more fast than C. (vide http://morepypy.blogspot.com/2011/02/pypy-faster-than-c-on-carefully-crafted.html)
But, these simple comparisons have no meaning at all.
Often people confuse the speed of python the language with the speed of CPython the implementation. The same code with no modifications took 2.63 s on my macbook pro running PyPy (Python 2.71).
Hey thanks for teaching me xrange.
As I clearly said in my post, I used a naive python code and was hoping someone will teach me more python.
Just curious, what was your execution time with -O3 in C-code?
Hello Aater,
gcc -std=c99 -O3 -o aaa aaa.c
./aaa
time: 0.527s
what’s the difference?
http://www.network-theory.co.uk/docs/gccintro/gccintro_49.html
Wow, don’t use “range(0,100000000)” – It will create a list with 100000000 elements (!!!) before the loop. Use xrange instead.
In my core2duo I had
- 7.8 seconds in C program
- 37 seconds in Python program with xrange
- 234 seconds(!!!) in Python program with range – It freeze my notebook
Thanks for sharing this analysis. xrange it is! I will update my post whenever I get a chance and creidt both you and renatopp:)
Core2Duo seems slow .. btw, which C compiler and what optimization level?
Hello Aater,
I’m using gcc 4.5.0×64 in OpenSuse 11.3.
For compile I used the same option as you “gcc -o3″
#python version
histogram = [0]*128
for i in xrange(0,100000000):
histogram[i%128] += 1
print histogram[0]
time: 0m14.335s
#pypy
./pypy aaa.py
time: 0m2.432s
/*C version*/
gcc -std=c99 -O2 -o aaa aaa.c
time: 0m0.515s
histogram = [0]*128
for i in xrange(0,100000000/128):
for i in xrange(0,128):
histogram[i] +=1
print histogram[0]
python time: 0m10.963s
pypy time: 0m10.963s
ops pypy time is 0m1.328s
the “
" does not gives syntax highlight, please fix it
histogram = [0]*128
for i in xrange(0,100000000/128):
....for i in xrange(0,128):
.........histogram[i] +=1
print histogram[0]
Hey mate, sorry about that. I will look into it tonight:)
ohh sorry, please remove my last reply comments, its very boring that I can’t remove my comments, I mean “the html tag code doesn’t show the code identation”
for i in xrange(0,100000000/128):
….for i in xrange(0,128):
………histogram[i] +=1
print histogram[0]
I think the xor trick you see might be because i%128 needs to consider what happens if i is negative. If i compile with “gcc -O0″ with i signed i get:
movl -12(%ebp), %edx
movl %edx, %eax
sarl $31, %eax
movl %eax, %ecx
shrl $25, %ecx
leal (%edx,%ecx), %eax
andl $127, %eax
subl %ecx, %eax
movb %al, -5(%ebp)
movsbl -5(%ebp),%edx
movsbl -5(%ebp),%eax
movl _histogram(,%eax,4), %eax
addl $1, %eax
movl %eax, _histogram(,%edx,4)
addl $1, -12(%ebp)
but unsigned i get:
movl -12(%ebp), %eax
andl $127, %eax
movb %al, -5(%ebp)
movsbl -5(%ebp),%edx
movsbl -5(%ebp),%eax
movl _histogram(,%eax,4), %eax
addl $1, %eax
movl %eax, _histogram(,%edx,4)
addl $1, -12(%ebp)
I had to use -O0 as on -O3 gcc must be able to spot that i is really always positive and so remove all the unnecessary additional instructions. As you are using icc it could be that they haven’t got this optimisation
Ah, just took a look at this. Now I see “Cython”, when you told me “CPython”–totally different things
One important note with this is that Python cannot optimize the modulo to an and. Python’s flexibility makes it almost impossible to do any sort of meaningful optimization on a program. Basically every iteration of the loop the interpreter has to look up “i” in the locals dictionary, look up “__mod__” in its methods, call that method with an int object (128) as an argument, etc. It’s just really messy, and I would agree that it’s amazing how fast Python is despite all this.
Today I decided to start learn Go and as a exercise I type your histogram example to check its performance in loops.
Here you go the code:
package main
import "fmt"
func main() {
var histogram [128]int
for i := 0; i < 100000000; i ++ {
var mod int = i%128
histogram[mod] += 1
}
fmt.Printf("%d\n", histogram[0])
}
here you go the results:
time ./6.out
781250
real 0m0.407s
user 0m0.403s
sys 0m0.003s
Pleasant surprise!
[...] faster than python. ~3 seconds to do things that can be done in python in ~50 seconds. Here – Quick post: Python vs C in compute-bound workloads | Future Chips Anyways I have a life – bye bye!!! __________________ Sweet mother of all that is good & [...]