Jun 142011
 

Traditionally, RAM, or Random Access Memory, was used to describe a memory which offered the same access latency for all its memory locations.  This is barely the case with modern DRAM systems. In this post, I describe a ten thousand foot view of how modern DRAMs work with the hope that it can help the programmers in choosing their algorithms and data structures wisely.

In the traditional view, memory was a flat monolithic structure with a fixed access latency. Today’s computer systems use page-mode addressing and smart memory scheduling which makes this flat view look rather naive. In my opinion, there are four concepts every programmer must be familiar with: pages, banks, row conflicts, and FRFCFS.

Pages

Data in the DRAM is stored at a page granularity. The size of a typical page is between 4K to 16K. In theory, this size is independent of the OS pages which are typically 4KB each. In practice, most DRAMs use pages bigger than the OS pages in order to improve locality in the row buffers (described below).

Banks

To reduce access latency, memory is split into multiple equal-sized units called banks. Most DRAM chips today have 8 to 16 banks. Each bank stores tens of thousands of pages.

The following is a practical example of how a 4 GB (32-bit) physical address will be split by the DRAM (credit to my friend Khubaib). Bits 12-0 identify the byte within the 8KB page. Bits 16-13 identify which of the 16 memory channels the system should use for this address. Bits 20-17 identify which of the 16 banks the address will be found in and Bits 32-21 identify which row within the bank is accessed. Channel bits are used to split banks across DRAM modules in order to increase the bandwidth to DRAM.

Capture

A memory bank can only service one request at a time. Any other accesses to the same bank must wait for the previous access to complete, known as a bank-conflict. In contrast, memory access to different banks can proceed in parallel (known as bank-level parallelism).

Row-Buffer

Each DRAM bank has one row-buffer, a structure which provides access to the page which is open at the bank. Before a memory location can be read, the entire page containing that memory location is opened and read into the row buffer. The page stays in the row buffer until it is explicitly closed. If an access to the open page arrives at the bank, it can be serviced immediately from the row buffer within a single memory cycle. This scenario is called a row-buffer hit (typically less than ten processor cycles). However, if an access to another row arrives, the current row must be closed and the new row must be opened before the request can be serviced. This is called a row-buffer conflict. A row-buffer conflict incurs substantial delay in DRAM (typically 70+ processor cycles).

FR-FCFS

Memory accesses are not sent to the DRAM in the order they are sent by the core. Instead, they are buffered in the on-chip memory request buffers and dispatched to the DRAM in the order specified by the memory scheduler. To maximize throughput of the memory system, memory schedulers today employ a policy called the First Row-First Come First Serve (FR-FCFS). The idea is to prioritize accesses to the page which is currently open in order to minimize the number of row conflicts. While this helps with memory throughput, it further increases the penalty of a row buffer conflict. As a rule of thumb, a row buffer conflict takes at least 3x the time of a row-buffer hit. It is important to note that all known memory scheduling algorithms schedule requests for each bank completely independently and, thus, have no impact on bank-level parallelism.

What can programmers do?

Avoid memory accesses: I cannot stress this enough. Memory may seem free in terms of storage but it costs a lot more in performance if the working data set size becomes bigger than the cache or the available DRAM. It is ultimately a good idea to reduce the memory footprint as much as reasonably possible, especially on embedded devices.

Bank conflicts: In general, application programmers work with virtual addresses while the DRAM operates on physical addresses.  Thus, programmers cannot reduce hazards like bank conflicts as they have no control over the placement of data in physical memory. However, there are some special scenarios where the programmers can help. For example, most ISAs have now introduced virtual pages that are as big as 1GB. Programmers are allowed to do their own mapping within these large pages. This gives the programmers a lot more control over what data goes to which bank and they can use this control to increase bank-level parallelism.

Row-Buffer Conflicts: Programmers can also help by increasing the spatial locality in the row buffer, .i., try to make consecutive memory accesses as close to each other as possible. This often implies packing data which is likely to be accessed together in consecutive memory locations, e.g., building a struct of arrays is often better performance. By the way, this also implies that linked data structures that promote random accesses should be avoided as much as possible because each row buffer conflict has a latency greater than 100 computes.

Experimental Data

I did an experiment in this post on an ARM Cortex A8 which showed that sequential accesses cost only 20 processor cycles but random accesses cost close to 180 cycles. The same experiment on an i7 led to about a 2x boost from sequential accesses.

Conclusion

Programmers can save battery and increase performance via DRAM-aware programming. Such optimizations may not be practical in all scenarios but understanding the basic DRAM structure can help programmers identify existing or potential performance bottlenecks, specially on more constrained platforms like the iPhone.

  13 Responses to “What every Programmer should know about the memory system”

  1. Fascinating.

    Just for clarity – in a standard PC, does the memory scheduler live in the CPU, the Northbridge, or on the RAM stick?

    • It depends on the CPU. Older CPUs had the memory scheduler (memory controller) on the north bridge. AMD made the first x86 processors with the memory controller on the CPU die and Intel followed suit later on.

      So basically, all current generation (and several past generation) of x86 processors have had the memory controller on the CPU die.

      Also, one important aspect of all current DRAM sticks is that they’re purposely designed to be “dumb” devices — i.e. they are basically just the storage circuits with enough circuitry to retrieve data and send it on the bus. They have no logic to prevent timing violations — the memory controller is wholly responsible for ensuring that the timing is valid. This is what makes designing memory controllers rather difficult.

      • Exactly. But its great for performance though. Having the memory controller on chip allows better and more informed scheduling which increases performance substantially.

  2. Isn’t this a shorter version of Ulrich Drepper’s What every programmer should know about memory –
    http://lwn.net/Articles/250967/ ?

    • Interesting. It does seem that way and its amusing how Ulrich and I chose the same title. I had not seen Ulrich’s post when I wrote mine but thank you for pointing me to it. It is very well done (a little too detailed for most software guys frankly) but its a good reference for anyone who wants to learn more. Thanks for the pointer. Another good starting point Dr. Bruce Jacob’s DRAM Sim. The code simulates a DRAM in good detail.

  3. The 32 bit address pictured under ‘banks’ has 33 bits.

  4. [...] 6. Что каждый программист должен знать о памяти [...]

  5. [...] http://www.futurechips.org/chip-design-for-all/what-every-programmer-should-know-about-the-memory-sy… Share this:ShareEmailPrintDiggFacebookStumbleUponRedditTwitterLike this:LikeBe the first to like this. [...]

  6. Hi Aater, there’s a little typo in the article: FR-FCFS is First-Ready FCFS, meaning that rows in row buffers are ready to be accessed and accesses to them should have priority. Nice article!

  7. [...] Topics Here is what every programmer should know about: Unicode, time, memory (simple), memory (extremely in-depth), regular expressions, search engine optimization, floating point, [...]

  8. I’m not certain where you’re getting your info, but good topic. I needs to spend a while studying much more or working out more. Thank you for magnificent info I was searching for this information for my mission.

  9. It is truly a nice and helpful piece of info. I am happy that you shared this helpful info with us. Please stay us informed like this. Thanks for sharing.

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>