Here are the answers to the self assessment. Please share with us how you did.
1. (a) How many things can you tell apart using 5 bits?
(b) Convert the 6-bit value 101110 to decimal, assuming its an unsigned number?
32+8+4+2=46 /*Thanks Nick*/
(c) Convert the 6-bit value 101110 to decimal, assuming its a 2′s complement number?
+ 1: 010010
(d) If I define a floating point format as follows:
1 – sign bit 3 – exponent bits 3 – mantissa bits bias = 3
What is the value of 1001111 in decimal?
-1 x 1.111 x 2^(1-3) = –0.01111 = –1x ( 1/4 + 1/8 + 1/16 + 1/32) = –1x (8+4+2+1/32) = –15/32
2. (a) Convert the unsigned decimal number 143 into hex
143%16 = 15 = 0xF
143/16 = 8 = 0×8
(b) How will you implement the ‘==’ C operator in hardware?
Unsigned and 2′s complement Integer: Using an XNOR for each bit and OR’ing the bits to get the final answer
Sign-magnitude notations: It is tricky in case the value is 0. -0 and +0 can must be marked equal. /*Thanks to Bob for this fix*/
3. (a) For a 2MB 8-way associative cache with a block size of 64B and perfect LRU replacement, calculate the number of sets in the cache.
Block size = 2^6
Each set’s size = 8 x 2^6 = 2^9
Number of sets = 2^21 / 2^9 = 2^12
There are 4K sets
(b) How many cache misses will the following code incur if the processor had the cache described in (a)?
int A;int B;for i = 0 to 127: sum += A[i] * B[i]
Both A and B will be brought in the cache. Array A is of size 4×128 = 512B = 8 Cache blocks. Total cache misses will be for A and B, i.e., 16 Cache blocks.
(c) How can I reduce the number of cache misses?
B. Increasing the cache size
D. Not possible There is no re-use in this kernel so a cache cannot help.
4. I ran an experiment and found that my code runs marginally faster on a machine with MSI cache coherence compared to a machine with MESI cache coherence. What property of the code can make MESI protocol slower than MSI?
MESI is better than MSI only if the line is accessed in the Exclusive state. If the line is modified as soon as it is read, the state has to transition to M state. Thus, MESI adds an extra transition which introduces performance overhead.
5. Suppose we have a virtual address space of 10-bits and a physical address space of 7-bits. Each page is 32-bytes. Assume that frame 0 in physical memory is reserved for the page table. The remaining pages are managed by the OS for demand paging. A random replacement policy is used for paging. How many page table entries are required? How many bits are required in each page table entry?
32 page table entries
A valid bit, a Page frame number which is 2 bits. 3 bits + permission bits.
6. A 2-level branch predictor has a 10-bit global history register. How much storage (in bytes) is required to implement this branch predictor?
2^10 entries of 2-bits each. 256 Bytes (2 Kbits). /*Thanks to JoachimS for this correction*/
7. (a) If I use a hash table with chaining to design a telephone directory, how many bytes of memory will be needed to save this telephone directory?
You can assume that:
All names are unique 20 byte strings
Phone numbers are 10 decimal digits
Each pointer is 4-bytes
Linked lists are used to resolve collisions in the hash table
The hash table has 128 buckets
Each bucket only stores a pointer to a linked-list of all elements in that bucket .
The directory contains 1 million entries
The hash table is a 128-entry array of 4-byte pointers. Thus, 512 bytes.
Each data node consists of a pointer (4 bytes), name (20 bytes), phone numbers (5 bytes for 10 digit US numbers).
Total = 512 + 29*M = 29MBytes.
(b) Suppose I replace the hash table with binary tree, how much space will then be required? (do not count the pointer to the root).
Each node will consist of the same 29 bytes plus an additional pointer. Thus total is 33 bytes per node times 1 million nodes = 33 MB
8. Compile the following line of C-code:
You can assume that:
Elements of A and B, and variable i are all 4-byte unsigned integer value.
The machine has 8 32-bit registers named r0-r7
Address of A and B are in registers r0 and r1 respectively.
Value of i is in r2.
LOAD r3, MEMORY[r1 + r2*4] ; r3 gets B[i]
ADD r2, r2, #1 ; i++
LOAD r4, MEMORY[r0 + r3*4] ; r4 = A[B[i]]
ADD r4, r4, #1; A[B[i]]++;
STORE r4, MEMORY[r0 + r3*4]; /* Thanks to uarch for this correction */
9. Optimize the following assembly code to use the fewest possible instructions and registers:
LOOP Load r1, [r0] ; load r1 from memory <--- Can be moved out of the loop add r2, r1, r3 sub r4, r2, r2 <----- Can be eliminated since r4 is never used add r5, r2, r3 <----- Can’t be eliminated as its setting branch cond brn LOOP
10. When executing the following code on a 128-core multicore chip, I see a speedup of only 40x. What could be the reason?
for i = 0 to 1000000000: sum += A[i] + B[i] // A and B are arrays of 4-bytes integers
The loop is bandwidth intensive as the compute to memory ratio is very high.
Please leave a comment if you have any questions. Thanks!