In my last post on “Understanding the Cortex A8 architecture,” I promised that I will make a web-based front-end to my Cortex A8 test harness so that the readers can run experiments themselves. The tool is ready and kicking. This post shows how a developer can use it to learn the A8 architecture. As far as I know, this is the first setup of its kind.
This web-application provides the ability to find the number of cycles a code snippet will take to execute on an iPhone 3GS. The application takes a C-code snippet as its only input and outputs the assembly representation of the snippet and the average number of cycles taken to execute the snippet. Since it is difficult to accurately measure the number of cycles in one go, the test harness runs the snippet a 100 times with varying loop sizes. I vary loop sizes to allow convenient analysis of the memory system. (See my previous post for details on memory system analysis).
The tool is extremely simple to use. Its home page looks as follows:
The code on the page is the code of the test-harness. It is provided for you to understand what the test harness is doing. A power user will take the few minutes required to read this code.
The white text box is for you to put your test code. I have put a one-line array traversal code there for this tutorial. You can replace it with (almost) any legal C-code.
When the submit button is hit, the code will compile and run on my Cortex A8 for a few seconds (generally 10-20) and the following screen will appear:
There are three sections of this screen:
Input code: This is the code you entered on the previous screen and is provided solely for your reference.
Assembly: This section shows the assembly code which was generated by the compiler for the code between the two mach_absolute_time functions. My compiler is a GCC 4.2 cross-compiler for ARM which ships with Xcode 4.2.1 from Apple.
Note: this includes the assembly of the inner loop code containing your snippet as well as the outer loop code (iteration variable j) which runs the inner loop 100 times. I had to leave the outer loop in here because its assembly was tangled with the inner loop.
Output: This a two-column table.
Column1: Log base 2 of the number of loop iterations of the inner loop. 10 implies 1024 iterations and 22 implies 4 million iterations.
Column 2: The average number of cycles taken by the inner loop instructions.
For reference: The sample output shows how the latency per iteration is increasing as the array size is increased. This is because the larger array does not fit in the cache and requires memory accesses.
The output will contain “Compile Error” if your code did not compile and “Run Error” if the code exited with an error code.
How to access it?
For security reasons, I am only releasing the beta version with restricted access. if you need access, leave a comment with a valid email address and I will create you an account.
As always, your feedback is very valuable. It can really help me refine this tool into something we can all use to understand what’s under the hood (without having to setup a whole environment). Side note: I have done a similar web tool for x86 as well so let me know if anyone is interested in toying with it.
Coming attractions …
A key insight we developed in the previous post was that a memory access can be 80-200x slower than the L1 cache accesses. Thus, it is feasible to do 80-200 computations (service by the L1 or register file) to save one memory access. This learning should clearly affect the choices of our data structures, if nothing else. In the next post, we will optimize a simple program with this trade-off in mind.