Here are the answers to the self assessment. Please share with us how you did.

1. (a) How many things can you tell apart using 5 bits?

**32**

(b) Convert the 6-bit value 101110 to decimal, assuming its an unsigned number?

**32+8+4+2=46 /*Thanks Nick*/**

(c) Convert the 6-bit value 101110 to decimal, assuming its a 2′s complement number?

**Flip: 010001
+ 1: 010010
Answer: -18**

(d) If I define a floating point format as follows:

1 – sign bit 3 – exponent bits 3 – mantissa bits bias = 3

What is the value of 1001111 in decimal?

-1 x 1.111 x 2^(1-3) = –0.01111 = –1x ( 1/4 + 1/8 + 1/16 + 1/32) = –1x (8+4+2+1/32) = **–15/32**

2. (a) Convert the unsigned decimal number 143 into hex

143%16 = 15 = 0xF

143/16 = 8 = 0×8

**0x8F **

(b) How will you implement the ‘==’ C operator in hardware?

**Unsigned and 2′s complement Integer: Using an XNOR for each bit and OR’ing the bits to get the final answer
**

**Sign-magnitude notations: It is tricky in case the value is 0. -0 and +0 can must be marked equal. /*Thanks to Bob for this fix*/
**

3. (a) For a 2MB 8-way associative cache with a block size of 64B and perfect LRU replacement, calculate the number of sets in the cache.

Block size = 2^6

Each set’s size = 8 x 2^6 = 2^9

Number of sets = 2^21 / 2^9 = 2^12

**There are 4K sets**

(b) How many cache misses will the following code incur if the processor had the cache described in (a)?

int A[128];int B[128];for i = 0 to 127: sum += A[i] * B[i]

Both A and B will be brought in the cache. Array A is of size 4×128 = 512B = 8 Cache blocks. Total cache misses will be for A and B, i.e., **16 Cache blocks**.

(c) How can I reduce the number of cache misses?

A. Blocking

B. Increasing the cache size

C. Multi-threading

**D. Not possible **There is no re-use in this kernel so a cache cannot help.

4. I ran an experiment and found that my code runs marginally faster on a machine with MSI cache coherence compared to a machine with MESI cache coherence. What property of the code can make MESI protocol slower than MSI?

**MESI is better than MSI only if the line is accessed in the Exclusive state. If the line is modified as soon as it is read, the state has to transition to M state. Thus, MESI adds an extra transition which introduces performance overhead. **

5. Suppose we have a virtual address space of 10-bits and a physical address space of 7-bits. Each page is 32-bytes. Assume that frame 0 in physical memory is reserved for the page table. The remaining pages are managed by the OS for demand paging. A random replacement policy is used for paging. How many page table entries are required? How many bits are required in each page table entry?

**32** page table entries

A valid bit, a Page frame number which is 2 bits. 3 bits + permission bits.

6. A 2-level branch predictor has a 10-bit global history register. How much storage (in bytes) is required to implement this branch predictor?

2^10 entries of 2-bits each.** 256 Bytes (2 Kbits). **/*Thanks to JoachimS for this correction***/
**

7. (a) If I use a hash table with chaining to design a telephone directory, how many bytes of memory will be needed to save this telephone directory?

You can assume that:

All names are unique 20 byte strings

Phone numbers are 10 decimal digits

Each pointer is 4-bytes

Linked lists are used to resolve collisions in the hash table

The hash table has 128 buckets

Each bucket only stores a pointer to a linked-list of all elements in that bucket .

The directory contains 1 million entries

The hash table is a 128-entry array of 4-byte pointers. Thus, 512 bytes.

Each data node consists of a pointer (4 bytes), name (20 bytes), phone numbers (5 bytes for 10 digit US numbers).

**Total = 512 + 29*M = 29MBytes.**

(b) Suppose I replace the hash table with binary tree, how much space will then be required? (do not count the pointer to the root).

Each node will consist of the same 29 bytes plus an additional pointer. Thus total is 33 bytes per node times 1 million nodes = **33 MB**

8. Compile the following line of C-code:

A[B[i++]]++;

You can assume that:

Elements of A and B, and variable i are all 4-byte unsigned integer value.

The machine has 8 32-bit registers named r0-r7

Address of A and B are in registers r0 and r1 respectively.

Value of i is in r2.

**LOAD r3, MEMORY[r1 + r2*4] ; r3 gets B[i]**

** **

**ADD r2, r2, #1 ; i++**

**LOAD r4, MEMORY[r0 + r3*4] ; r4 = A[B[i]]**

**ADD r4, r4, #1; A[B[i]]++;**

** **

** **

** **

**STORE r4, MEMORY[r0 + r3*4]; /* Thanks to uarch for this correction */
**

9. Optimize the following assembly code to use the fewest possible instructions and registers:

LOOP Load r1, [r0] ; load r1 from memory <---Can be moved out of the loopadd r2, r1, r3 sub r4, r2, r2<----- Can be eliminated since r4 is never usedadd r5, r2, r3 <-----Can’t be eliminated as its setting branch condbrn LOOP

10. When executing the following code on a 128-core multicore chip, I see a speedup of only 40x. What could be the reason?

for i = 0 to 1000000000: sum += A[i] + B[i] // A and B are arrays of 4-bytes integers

**The loop is bandwidth intensive as the compute to memory ratio is very high.**

Please leave a comment if you have any questions. Thanks!

In question 3b, how can you be sure that both arrays start at cache line boundaries?

In question 8, your indices into array A are not correct. Replace both sues of ‘r3′ with ‘r0 + r3*4′

Hey,

Thanks a lot for the pointing it out. I have fixed it. Sorry the response was late.

Aater

For question 9, I would say a better answer would involve moving the sub out of the loop (it sets r4 to 0, so the value of r2 doesn’t matter). IMO you can’t justify the sub (and nothing else) being dead code, since nothing is read after the loop–I assumed every register must have the same value after the loop, otherwise the entire loop could be eliminated. That’s just nitpicking though

But there seems to be a much bigger problem–if I understand your assembly notation correctly (dst, src, src), r5 will have the exact same value every iteration of the loop (r1+2*r3), so it will either go through the loop exactly once or it will never break out of the loop.

Aloha!

Small nitpick: Question 6 actually asks for number of Bytes, not bits. That is, the correct answer should be 256 Bytes, not 2 kbits. In reality, the branch predictor would be implemented as a 1024×2 bit memory macro.

1d) You make many assumptions about the floating point number system in use. For instance, not all hardware uses base 2. Not all systems have an assumed 1. etc.

2b) an XNOR gate ?!? The domain of the == operator includes much more than just single-bit quantities. Comparison of IEEE floating point numbers, for example, is quite complex given that +0 must equal -0, special values must be considered, etc.

(oops, this seems to have gotten the wrong nesting level, sorry)

Hi Bob,

Thanks for reading and great suggestions.

1d) I had IEEE floating point arithmetic in mind when I made the question. My feeling is that all systems do more or less the same specifications but I may be wrong. Can you point me to some examples of non-base 2 FPs? I am intrigued.

2b) I see your point. I was thinking about about Integers and not FP. By the way, I did not mean single-bit quantities. An XNOR followed an OR of all bits gives you the answer. I thought the Or is implicit but I have fixed my answer too clarify.

Aater

example of non base 2 fp ? : see IBM z/OS Principle Of Operation.

“the value of a floating point number is the product of it’s fraction and the number 16 raised to the power of the exponent which is represented by its characteristic.”

iow base 16 FP.

Wow. Very interesting. Thanks for the pointer. I always assume IEEE FP standards when I say FP but I guess I ought to be more careful.

Hi JoachimS,

Thanks for the “nitpick” . It was very important to fix it though. I have made the change. Please take a look.

Aater

In 3b you should mention in the question that sizeof(int)==4.

In 7a you have “pointer is 4-digits” in the question. Your rounding is also a bit sloppy, since e.g. 29000512 bytes = ~27.7MB in memory.

In the answer to 10, I think you should call the ratio low, or reword the sentence.

Thanks a lot. I agree with your points. I will fix this later today.

Thanks! I had a lot of fun with this low-level programming quiz.

My only confusions are:

Question ten seems ambiguous. My answer was that shared access to the ‘sum’ variable would prevent a greater parallelization. Can you elaborate on your answer?

Question 2b seems bizarre. Isn’t ‘==’ typically implemented as a cmp/jnz sequence on x86?

Thanks again for taking the time to put this together, I feel like I improved as a programmer by taking this (if only infinitesimally).

You are welcome. I am glad it was fun for you.

You are right about the ambiguity. Contention for sum would be a great answer as well.

I was assuming that when writing parallel code, every thread will compute its own local partial sum and there will be a reduction at the end to add the partial sums in to global sum. This will eliminate the contention for sum. The reduction will not be that big since the loop runs for a large number of iterations, making the parallel part significantly longer.

‘==’ is generally the cmp instruction but I was really trying to get a lower level gate-level answer. The cmp instruction, in its execute phase, actually passes through a comparator. One of the functions of a comparator is check if the two values being compared are equal. The XNOR is the logic for the equal. (I am lying here a bit. The comparator is actually a lot more complicated).

I should make it more explicit. Thanks for pointing it out. I hope will find some Parallel Programming posts very useful. I have covered the above in some previous posts.

Regards.