Lecture 15

CS 111: Lecture 15

VM and processes; distributed systems

Date: November 18, 2010

Scribe Notes: Paul Leung, Shannon Lin

Virtual memory continued:

Last lecture, we learned about virtual memory and focused on obtaining correctness and robustness as well as the mechanisms behind virtual memory. Today, the focus will be on increasing performance. We will explore different policies, or how to pick a victim to be removed from a higher level of memory.

1. Victim policy

A victim is a data block in a higher level of memory that will be replaced with another data block. A victim policy is a method for choosing which data block in a level of memory to pick as the victim. In order to use data, it must first be put into RAM. Thus, we must decide how to pick a victim. Data isn’t put into RAM randomly but based on a victim policy.

1.1 Locality of Reference

First, we take into account the locality of reference. There are 2 types of locality, temporal locality and spatial locality. Temporal locality refers to one page access likely to lead to accessing the same page again in the program. If we access a page P at time T, we would access page P again at time (T + δ) where δ is small. Spatial locality on the other hand is a page access likely to lead to accessing a nearby page. Meaning, if we were to access a page P at time T, we would next access page (P + δ) at time T assuming δ is small. When accounting for locality of reference, we want to take advantage of both temporal and spatial locality as much as possible.

1.2 Testing Policies

So given different policies, how do we know which policy to use? We must test the different policies and check the performance. One possible method is to try the different policies by programming it in Linux onto a machine, and test it for a year on actual data, collecting the results and comparing them. This is a very expensive approach however. Is there an alternative, cheaper way to test these policies? A smarter alternative is to take data from past runs using the policies we want to test. Specifically, taking the reference strings from those past runs.

Reference strings are a sequence of page numbers, specifically virtual page numbers, which tells you the order of pages accessed from some previous run (i.e. 0 1 2 3 0 1 4 0 1 2 3). These can be used as our sample test cases. Therefore, to check different policies, we can run a multitude of different reference strings and see how the different policies perform.

1.3 Policy: FIFO (First in First Out)

The “first in first out” policy of FIFO, replaces the pages loaded into the RAM in the order they entered. Thus the first pages to be loaded into the RAM are the first pages to be moved out of the RAM. This policy is relatively simple to implement. We simply count when a page is loaded in from disk to RAM and replace them according to the count (Note: the number of touches to this data does not apply to the count).

To test this policy according to the reference string mentioned earlier, first we assume that it is a small machine and there are 3 physical pages of space on RAM, named “A, B, C” (although in reality they are labeled “0 1 2” but in order to not confuse ourselves with the reference string, we’ll use A B C). We use our example sequence to test our FIFO policy:

Bolded number indicates a page fault

	0	1	2	3	0	1	4	0	1	2	3
A	0	0	0	3	3	3	4	4	4	4	4
B		1	1	1	0	0	0	0	0	2	2
C			2	2	2	1	1	1	1	1	3

0 1 2 3 0 1 4 0 1 2 3 <= Requested page

Cost = 9 page faults

From this test case, we can see that the cost, which is counted by the number of page faults, or when a page is replaced in one of our physical pages of space on RAM, is 9 page faults. Thus given 3 physical pages, the cost is 9 page faults. If we had 5 physical pages, using the same FIFO policy would cost 5 page faults. 5 page faults is the best we can do because we have all 5 different pages (0, 1, 2, 3, 4) fitting into RAM all at once. What about 4 physical pages though? 4 physical pages would incur a cost of 10 page faults. This is counterintuitive since one would think that adding more RAM would lead to better performance whereas in this case, adding more RAM is actually leading to a poorer performance. This is known as Belady’s Anomaly, meaning adding more cache can slow you down (Note: Doubling cache size is highly unlikely to slow you down). Belady’s Anomaly however rarely happens, as this is a reference string specifically picked to demonstrate Belady’s Anomaly.

1.4 Policy: Optimal Page Replacement

The best page replacement policy, or optimal page replacement, is accomplished by looking into the future to decide which pages we will need and which pages we should replace. This requires an oracle, which is a subroutine that can look into the future (i.e. scan all the way to the tail of a reference string) and tell us which pages to choose. With this, the optimal replacement policy is to choose the page that isn’t mention for the longest time as the victim. We use our example sequence to test our Optimal Page Replacement policy:

Bolded number indicates a page fault

	0	1	2	3	0	1	4	0	1	2	3
A	0	0	0	0	3	3	0	0	0	2	2
B		1	1	1	1	1	1	1	1	1	3
C			2	3	3	4	4	4	4	4	4

0 1 2 3 0 1 4 0 1 2 3 <= Requested page

Cost = 7 page faults

From this test case, we can see that the cost given 3 physical pages is 7 page faults. This is an improvement compared to the cost of 9 page faults with the FIFO policy. Given this reference string, this is the lowest cost we can accomplish with 3 physical pages. We have a bigger issue however. Most systems aren’t equipped with an oracle that can look into the future for us. So how can we implement an oracle? One option is to implement it through hardware. This approach however, is unlikely. Another option is to run the program once, collect the reference string, and use the reference string as our oracle when we run the program again. Yet, while oracles are sometimes available, most often it is probably not. In addition, this will add a bit of delay, but compared to the cost, it is worth it.

1.5 Policy: LRU (Least Recently Used)

The least recently used policy, or LRU, picks the victim based on which one hasn’t been used most recently.

Bolded number indicates a page fault

	0	1	2	3	0	1	4	0	1	2	3
A	0	0	0	3	3	3	4	4	2	2	2
B		1	1	1	0	0	0	0	0	3	3
C			2	2	2	1	1	1	1	1	4

0 1 2 3 0 1 4 0 1 2 3 <= Requested page

Cost = 10 page faults

In this case, the cost is 10 page faults. Given this reference string, it is extremely bad and worse than FIFO. This reference string however, is a special case picked to illustrate Belady’s Anomaly, and thus, LRU usually works better.

But how do we implement LRU? One option is to use a one level page table. We keep the VPN as the index and a physical page number is listed as the index entry. In the page table, we use the extra bits to place a time stamp or a page fault counter at the end of each entry. This counter is used to indicate which pages were last used. To prevent issues from the counter overflowing, we should renormalize the number once a counter wraps around. The overhead from keeping a table that must be renormalized however, slows us down and we’re unable to implement this efficiently. As a result, most people cheat by implementing an approximation to a real LRU policy, where the counter isn’t update each time, but rather the counter is only updated when a page fault occurs. Program loops tend to have spatial locality and memory will be travelled in the positive direction.

2. Ideas for improving virtual memory performance

2.1 Creating better policies for replacement

One straightforward option for improving virtual memory performance is to create a better policy for page replacement. This option is hard to implement however, as many policies have already been created.

2.2 Demand Paging

Another option is called demand paging. Demand paging doesn’t load a page from memory until the program actually requests it. This is different from ordinary paging where we read a program into RAM to start, and then we begin execution. To compare ordinary paging to demand paging, consider this:

N = number of pages in a program

U = number of pages used (0 < U <= N)

C = average cost of reading in a page to RAM

F = cost of faulting

	Total Costs	Latency
Ordinary paging	NC	NC
Demand paging	UC + (U-1)F	C

If you think about ordinary paging, it will be faster once the whole thing is loaded into memory. If you look at a computer and a user as whole however, you want both of them to be thinking at the same time. If as a user you were to wait until the entire program is loaded into the RAM, you wouldn’t want to use it. Rather, you would want something that is quick to start, in exchange for performance that can work at the same time. By using demand paging, you can use the computer and think about what you want to do at the same time rather than waiting for the computer to finish thinking before you can start doing work again. Demand paging is also especially for impatient users, which seems to be the majority of all people.

2.3 Dirty bit

Suppose a victim has never been changed, meaning no one has ever modified the loaded page. This is nice since we won’t have to write the page back to the disk upon replacing it or picking it as the victim which costs a lot of time. But how can we know if a page has been used or not? This marking is referred to as the dirty bit. For each page, there is a dirty bit which indicates if a page has been used or not. If the bit = 0, it means that the page is clean. If the bit = 1, then the page has been modified.

How can we implement this? There are 3 main options. The first option is every time an application executes a store instruction, it must be paired with a “set dirty bit” instruction. The second option is to implement the dirty bit in hardware. This is the best option since is makes the performance fast and as easy as possibly (many processors do this). The third option is the mark all pages a RD_ONLY at first, and update to R/W on the first page fault and set the dirty bit implicitly. Kernel developers have to design the kernel in such a way that the kernel understands that if option 2, or the hardware option, is available to use and take advantage of it, but it should also understand to use option 3 if the hardware is not available.

2.4 Two classic virtual memory tricks

There are 2 classic virtual memory tricks to improve performance.

2.4.1 Heap manager of Bourne shell

These break syscalls are slow, so some people implement a signal handler for SIGSEGV to allocate memory. The advantage of this method is that it avoids the sbrk overhead delay. The disadvantage however, is that if your program has a bad memory reference, it will allocate more space for you anyway and you will lose valuable checking.

2.4.2 vfork

vfork is a system call that acts like for. There are 2 main differences however between vfork and fork. The first difference is that the parent and then child share the memory and page tables rather than creating separate memory and page tables for each child. The second difference is that the parent is frozen until the child exits or executes. These differences allow vfork to be run without needing to clone the memory. One way to improve the performance of fork is to do a “lazy fork.” In a lazy fork, the parent and child have RD_ONLY access to the page table. The OS will know it should have R/W access, but the hardware will think it is RD_ONLY. This way pages are only cloned as needed. While this means you have to clone page tables, it is smaller than cloning memory altogether. Running vfork and exec also doesn’t need to clone the memory but it is non-orthongonal which is controversial.

3 .Virtual Memory and File System efficiency

One thing we must always take into account when dealing with file systems and virtual memory is the fact that accesses to disk are much slower compared to cache and RAM. We should also make disk access as efficient as possible. What does this entail though? One thing to be concerned with is looking at several disk accesses as a whole. Do these requests attempt to read/write to blocks that are very close to each other on the physical disk? One of the common cases that leads to very poor performance is when two programs are reading from two opposite ends of the disk and their requests are queued in an alternating pattern. Each request is preceded by the worst case seek time. Let's look at some of the disk scheduling algorithms that are available.

.Algorithm: FIFO (First In First Out)
- o.+ FIFO does the requests in the order that they were received. It is very easy to implement, all we need is a queue to put the requests in.
- o.- As mentioned above, if you have two programs reading two different sections on disk, you will have bad seeking and therefore bad performance.

So how can we make this better? Let's first get an understanding of the costs that are involved in using a disk. There is a cost to read and write a block, and there is a cost to have the disk arm seek to that block. The cost to read/write a block is primarily a hardware constraint, but we can optimize our software to make seeks better. To get a better picture of this, let's assume a simple model of a disk such that the whole thing is a linear array as opposed to cylinders. If one was at block I and seeking to block J, our cost would be |I-J|.

Looking at FIFO's costs, if the disk arm was currently at the front of the array, our average seek cost to access a block on the disk would be ½. If we were at the middle of the array, the average cost would be ¼. To get the average cost altogether we must take an integral from 0 to H on H * (H/2) + (1 – H) * (1 – H/2) dH, which leaves us with ¼.

.Algorithm: Shortest Seek Time First
- o.This algorithm looks at all the requests in the queue and picks the request that accesses a block closest to the current location of the disk arm.
- o.+ This algorithm minimizes latency and maximizes throughput.
- o.- This is totally unfair and could lead to starvation.

.Algorithm: Elevator Algorithm
- o.In this algorithm, the head on the disk has a given direction at every given moment, which means the head is either traveling along the array in the positive or negative direction. It will accept requests closest to it in the given direction, and then reverses and goes in opposite direction once all the requests that are in the current direction are done.
- o.This is still not fair however, because someone at the front of the array has a worst case waiting time of 2 full disk seeks, whereas someone at the middle of the array has a worst case waiting time of 1 full disk seek.
- o.+ With this algorithm there is no starvation.
- o.- This algorithm doesn't give quite as good latency and throughput compared to Shortest Seek Time First.

.Algorithm: Circular Elevator
- o.This is like the previous algorithm, but the elevator always goes up, when it reaches the end of the array it moves back to the front of the array.
- o.+ The average cost is now 1 full disk seek for everyone, regardless of their position on disk. We now have fairness again.
- o.- There is a worse throughput and latency compared to the regular elevator algorithm.

Are there any other things we can do to make our algorithms better? One thing we can add to any of the algorithms we mentioned is to add bashing and dallying. Dallying has the system stop performing disk accesses for a moment in order to let the pool of requests to grow. After this happens we can then pick a request in the queue according to our algoirthm. This is beneficial when it is likely that a request that will soon be in the pool is actually the best thing to do next. Dallying is also known as anticipatory scheduling. The next question that arises then is how long should one dally for? Studies and implementations usually base it on past reference strings in order to figure out an optimal time.

Lastly, there are some robustness constraints for file systems to be concerned with. One thing that we must take into account is that our disk scheduling algorithms must also take into account the fact that certain requests must be done in a particular order (remember that out of all the ones we mentioned, only FIFO does requests in order). Say we have request A, B,and C and we are doing a renaming. If we do C first and crash, we have no way of fixing ourselves. We can therefore put dependencies on requests that the algorithms must respect, so in our example we can say do C only if A and B have been done. BSD file systems do this, and it is certainly something we should do as well.