Jun 052011
 

In my first post on “Tips for iPhone App Developers,” I showed how to setup a development board using as iPhone 3GS. In this post, I will use that setup and a micro-benchmark to explore the ARM Cortex A8 architecture.

Update 6/9/2011:  The web-based sandbox for understanding Cortex A8 is ready

Brainstorming the first benchmark

Since A8 is an in-order core, its performance should be very  sensitive to the number of memory accesses. Thus, I will tinker with the memory system first.Lets try to find out the access latencies of the L1, L2, and DRAM on the iPhone 3GS.

I always start memory-system studies with my classic loop:

for(int j = 0; j < size; j+= cache_line_size){
  sum += array[j];
}

It simply strides through an array, accessing a new cache line every iteration, which leads to a lot of cache and memory accesses. I have found that this simple loop (and its minor variations) can usually tell a lot about a memory system.

Setting up a test harness

Before I start testing, I want a test harness to streamline the process. I want to:

1. Repeat each experiment many times (say 100) and average the results to eliminate random effects (easy to do).

2. Run the test multiple times with different sizes of the test loop above (for loop it is!).

3. Report execution time per iteration in terms of processor cycles/loop to minimize processing in my head.

The only complicated step is the last one. I cannot use the time utility in iOS because it measures the time of the entire program and that at a low resolution. I need a way to read the cycle count register.  Five minutes of googling revealed that reading the cycle counter register in ARM requires kernel patches which I do not want to mess with (Request to ARM architects, please fix this.). Instead, I decided to use the mach_absolute_time functions from Apple. They are described here.

When called, the mach_absolute_time() function returns the number of “ticks” from an absolute base. There are also helper functions to convert mach time into nanoseconds using the mach_timebase_info function. Here is the barebones instrumentation code for reference:

double subtractTimes( uint64_t endTime, uint64_t startTime ){
  mach_timebase_info_data_t info;
  kern_return_t err = mach_timebase_info( &info );
  //Convert the timebase into nano seconds
  if( err == 0  )
    conversion = 1e-9 * (double) info.numer / (double) info.denom;
  return conversion * (double) difference;
}

 

double run_test(){
  uint64 start, end;
  start = mach_absolute_time();
    // the actual work to be measured
  end = mach_absolute_time();
  double elapsed = subtractTimes(end, start);
  return elapsed * cpu_frequency;
}

The function run_test returns the approximate number of CPU cycles taken by the code in-between the two calls to mach_absolute_time().  The function subtractTimes returns the elapsed time in seconds. I multiply it by CPU frequency to get processor cycles. (Recall that we got the CPU frequency in the previous post).

As mentioned earlier, the test harness needs several other features. The actual test harness I used is below.

  1: #include <assert.h>
  2: #include<stdio.h>
  3: #include <mach/mach.h>
  4: #include <mach/mach_time.h>
  5: #include <unistd.h>
  6:
  7: typedef unsigned long long uint64;
  8:
  9: const int rows = 23;
 10: const int cache_size = 256 * 1024;
 11: const int log2_cache_size = 18;
 12: int A[1<<rows];
 13: int B[cache_size];
 14:
 15: // http://www.macresearch.org/tutorial_performance_and_time
 16: //Raw mach_absolute_times going in, difference in seconds out
 17: double subtractTimes( uint64_t endTime, uint64_t startTime )
 18: {
 19:     uint64_t difference = endTime - startTime;
 20:     static double conversion = 0.0;
 21:
 22:     if( conversion == 0.0 )
 23:     {
 24:         mach_timebase_info_data_t info;
 25:         kern_return_t err = mach_timebase_info( &info );
 26:
 27:   //Convert the timebase into nano seconds
 28:         if( err == 0  )
 29:     conversion = 1e-9 * (double) info.numer / (double) info.denom;
 30:     }
 31:
 32:     return conversion * (double) difference;
 33: }
 34:
 35:
 36:
 37: double get_array(int *array, int log2_size){
 38:   uint64 start, end;
 39:   int size = 1 << log2_size;
 40:   int sum = 0;
 41:
 42:   start = mach_absolute_time();
 43:   for(int i = 0; i < 100; i++)
 44:     for(int j = 0; j < size; j+= 1*16){ // MAIN test loop
 45:       int x = rand()%size;
 46:       sum += array[j] + x;
 47:
 48:       //int x = j & 0xFFFF;
 49:       //if(x)
 50:       //int x = rand();
 51:       //if(x & 0x37){
 52:       //sum += x;
 53:       //} else {
 54:       //sum -= x;
 55:       //}
 56:     }
 57:   if(sum == 1)
 58:     printf("%10d ", sum);
 59:   end = mach_absolute_time();
 60:   double elapsed = subtractTimes(end, start);
 61:   elapsed = elapsed/100; // remove the effect of 100 repetitions
 62:   elapsed = elapsed/(size/(1*64)); // get execution time per iteration
 63:   double elapsed_cycles = elapsed * 600e6; // multiply by CPU frequency to get cycles
 64:   return elapsed_cycles;
 65: }
 66:
 67: int main(int argc, char **argv){
 68:   printf("Testing ARM A8 in iPhone 3GS\n");
 69:
 70:   get_array(A, rows);
 71:
 72:
 73:   int count = 0;
 74:   //while(1){
 75:     for(int i = rows-1; i > 9; i--){
 76:       printf("%4d ", i);
 77:       double elapsed = get_array(A, i);
 78:       printf("%5.3f\n", elapsed);
 79:   //  }
 80:     printf("Done %d\n", count++);
 81:   }
 82: }
 83:
 84:

At the heart of the test harness is what I have called my MAIN loop (line 44). This is the code that I will be changing for each test.

Now lets try some memory system tests. Notice that in my code above, I call the get_array function on lines 70 on all of array A. I call it because without this call, the results were distorted by the page faults for bringing the data from disk to memory.  I do not want this disturbance from the disk because I want to study the memory system in isolation today. By calling this get_array once brings in all of array from disk to memory and notice that I discard the measurements from that run of the loop.

Running the experiments

1. Get the latency of an almost empty loop:

for(int = 0; i < size; i += 16){
  sum += sum;
}

This loop runs at 3 cycles/iteration. This makes sense because the add and the branch will both cost cycles.

2. Just out of curiosity, I want to know the overhead of of adding a new operand.

for(int = 0; i < size; i += 16){
  sum += 3;
}

This loop runs at 4 cycles/iteration. i was expecting that it will also take 3 cycles/iteration. I will analyze it next week unless one of the ARM engineers can tell us why?

3. I also want to know the overhead of the rand() function because I want to use it later. Lets try:

for(int = 0; i < size; i += 16){
  sum += rand();
}

This loop runs at 55 cycles/iteration. Thus rand() takes roughly 50 cycles per call.
4. Now lets introduce memory traffic.

for(int = 0; i < size; i += 16){
  sum += array[i];
}

Now the results are getting exciting. I see my test harness prints the following table (I have manually changed sizes to human readable format).

The first column is the size of array being processed by the loop.

The second column is the cycles per iterations. Since each iteration beings in a new cache line, the loop that do not fit in the cache will have a higher latency.

Size  Cycles/iteration (avg)
4MB   85.134
2MB   84.620
1MB   83.480
512KB 81.538
256KB 78.411
128KB 69.222
 64KB 28.579
 32KB 13.238
 16KB 11.110
  8KB  4.438
  4KB  4.53
  2KB  4.78

The rows in green are small arrays that fit in the L1 cache.  Consequently, even though there is a memory load involved, the speed is same as the loop with no memory access.  It is clear that the L1 D-cache in Cortex 8 has a 1 access/cycle throughput.

The rows in blue are primarily in the L2. My guess based on this data is that an L1 cache miss, which hits in the L2, roughly costs 10 cycles to access (one cycle to check the L1 first and 10 additional to go to L2). This information is again consistent with the published specs where the L2 is 256KB.

Lastly, the rows in red indicate sizes that fit in neither L1 or L2. Consequently, the core has to go out to memory which seems to cost roughly 80 cycles per access.

5. Modern DRAM memories use page-mode addressing which makes them more friendly to to sequential accesses. As a rule of thumb, sequential accesses incurs 1/3rd the cost of a random access (I willl explain why in another post). I want to know if the iPhone uses page-mode and if the rule of thumb applies to the iPhone.

I first ran the following loop:

for(int j = 0; j < size; j+= 1*64){
 int x = rand()%size;
 sum += array[j] + x;
}

and got these results:

4MB   84.919
2MB   84.625
1MB   83.280
512KB 81.499
256KB 78.378
128KB 69.835
64KB  25.845
32KB  13.454
16KB  11.115
8KB    4.416
4KB    4.441

Notice that the results are not much different that we expected. My guess is that the core is employing latency hiding because otherwise the rand function has an overhead of about 50 cycles.

Now lets try random access to the memory.

for(int j = 0; j < size; j+= 1*64){
  int x = rand()%size;
  sum += array[x] + x;
}

I have kept the computes per iteration constant and replaced the array index with x which is a random number. I get the following results:

22 229.183
21 225.846
20 224.475
19 213.549
18 199.968
17 176.183
16 131.654
15 89.928
14 78.397
13 73.951
12 73.508
11 73.805
10 77.906

I have highlighted the most relevant results. For loop sizes which are memory bound, the latency per iteration has roughly increased by 3x which is what I was expecting.

Energy/Battery life measurements

Memory has become a big chunk of power in today’s system. Lets try to see this in action on an iPhone. I fully charged my iPhone to a 100% and ran the code with no memory accesses for 30 minutes. The battery loss was only 4% (I could not turn off the wi-fi but I turned off most features and the display). I recharged it back to 100% and the ran the code with regular memory accesses to memory. The battery loss was 9%. Lastly, I recharged it and ran the random memory access loop for the same time. The battery loss was 15%.

Disclaimer: This battery-life experiment is not very controlled so do not base decisions on it:) I will design a better one. I just wanted to give an idea of what can be done.

Conclusion

I think the important result of this analysis is the understanding that it is sometimes feasible to let computes double or even triple if extra computes can save memory or L2 cache access.

Coming attractions … a web-based front-end for my ARM setup

Inspired by  Google GO, I want to setup an web-based test harness for all of you to play with my ARM setup. I have started working on it already and its 75% working. Once complete, it will save me the effort doing these experiments as you all we will be able to run them yourselves:)

Update 6/9/2011: The web-based sandbox for understanding Cortex A8 is ready

  3 Responses to “Tips for iPhone App Developers: Understanding Cortex A8 core (Part 2)”

  1. I do think you should be brave and make the kernel patch. Just make sure all the kernel patch is doing is providing user mode access to the PMU counters. It seems that is PMUSERENR[0] is set to 1. Go to infocenter.arm.com . They should have the codes for all the PMU events for each core there somewhere. Can’t really blame the architecture for that though, just giving the kernel more control over what the user can see.

    That part about the sum vs 3 is pretty weird. Can you publish the two disassembly, since it doesn’t make any sense for the code without a dependency to be slower. Actually all the disassembly would be a cool thing to see.

    • Hi,

      Why do you think I should do the patch, do you not think the mach measurements are accurate? I think they are pretty close and hence found no motivation.

      I am working on getting the disassembly.

      Aater

      • I agree, the numbers seem to be consistent. I was just thinking that tracking the other events would be interesting and would confirm some of the behavior. To be honest though I am surprised at where the big jumps in average time are occurring.

        I am curious what you think would be a better implementation on how to access PMU counters. Thinking about it some more, directly modifying the counters on the user side seems like a bad idea because of context switches, so don’t listen to the bit about enabling user access to those counters. I guess the only way to do it without help from the kernel is if the counters get virtualized. That would either still require some help from the kernel in swapping out state (but agreed in that it gives the user more independent control) in the software or creating backups in hardware (which sounds horrible especially on a RISC architecture). Otherwise it just sounds better to leave it to the kernel to implement it with appropriate drivers. I guess I don’t know what your patched kernel was going to do.

        Look forward to seeing the assembly.

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>