Virtual Memory

By Avinash Kumar, Nithin Reddy and Vineet Pandit.

 

One level page table model:-

·         Maps the virtual pages with the physical RAM.

·         There could be empty virtual pages.

 

off_t swapmap (void * address,  pte_t *p)

{

return p->swap + (intptr_t) address &  ~(pagesize-1);

}

 

·         void* address is the address of the failing page.

·         pte_t p is the process table entry.

o   This parameter is required if 2 processes page fault at the same time on the same computer. This way the OS has a way to find out where the virtual memory lives on the disk.

 

·         This routine should be called when there is a page fault.

·         For example: - There is a 0 in the page table entry.

 

·         There is a routine that the kernel runs when there is a page fault.

 

void pfault(void *addr, ptet* p )

{

off_t o = swapmap(addr, p);                                                                                           // finds out where on disk the OS is going to swap.

if(o == -FAULT)                                                                                                                     // doesn’t swap to anywhere invalid page.

{

kill(p, SIGSEGV);                                                                                                       // bad process (just pseudocode)

}

else                                                                                                                                            // we want to copy the page entry from the hard disk to the RAM, but in actuality there is never any unused RAM.  So

                                                                                                                                                    // we look through physical  memory and pick a victim page that’s in use and is not used/will not be used in the near

                                                                                                                                                    // future (page_removal_policy)

{

victim_page, victim_process = page_removal_policy();                          // victim page: owned by victim_process

pa = pmap(victim_page, victim_process );                                                    // pa = physical page address

write page pa to disk @ swapmap(victim_page, victim_process);      // save old page to disk

read page pa from disk @ swapmap(addr, p);                                             // save new page to memory

p->page_table[addr>>page_bits] = pa;                                                         // update new page table

victim_process ->page_table[victim_page ] = FAULT;                             // update old page table

}

}

·         When the kernel traps on a page fault, the hardware saves the instruction that faulted in the program counter. 

·         In other words, on return from interrupt the same instruction is retried.

·         When the same instruction is run again, this time the page table will work.

 

Q: Say one process faults, should entire kernel freeze and wait for the write and read to work?

A: Of course not. While we’re handling the page fault we shouldn’t stay inside the kernel. We should let other processes run because it’s going to take a while for this read and write to work. Even though the read and write look like subroutines, in actuality, it’s going to cause us to return from the interrupt and let somebody else run. When the write finishes we get another trap and go onto say it’s time to do a read, then while were reading we let other processes run here, then there’s another trap and we’ll update the page table entries.

 

 

·         As we mentioned, it’s possible to have two processes that have read only access to the same page, it’s possible that they can both page fault at same time. They’re reading a page that isn’t there.  In that case you have to be careful when you’re looking for victim pages and not step on your own work. This makes your implementation more complicated.

 

 

The page_removal_policy

·         How do we decide which page is the victim?

·         2 basic properties of almost all real world applications

a)      Spatial locality : accessing page i implies that you will soon access page i+ 1, I-1 i+2 i-2 i+ i-

b)      Temporal locality: if you access a page i at time t then you will most probably access the same page at time t+

·         In practice these notions are also combined.

 

·         Now we have to implement page_removal_policy

·         A terrible algorithm (simplest one) is First in first out: whatever page got into system first is the first to get paged out.

·         Best policy is to get the future and remove the last used page first ("oracle" algorithm)

·         Unfortunately oracle's aren’t around In the real world.

 

Testing the page_removal_policy

·         We run an application on a busy server and we'll see what the reference string is.

o   A reference string is a list of page numbers of an actual app and we'll use this reference string for our page_removal_policy.

·         Ideally we’ll have a database of reference strings, each with an importance attached to it. We’ll take these reference strings and try them out with our algorithm.

·         We can have a lot of reference strings for other apps, oracle, apache, mysql, firefox

 

 

·         Imagine a small reference string with 5 memory locations

o   5 pages small virtual memory 0 through 4 -> 0 1 2 3 0 1 4 0 1 2 3 4

o   3 pages of RAM

o   Assuming First In First Out

 

Page

0

1

2

3

0

1

4

0

1

2

3

4

1

Δ0

0

0

Δ3

3

3

Δ4

4

4

4

4

4

2

-

Δ1

1

1

Δ0

0

0

0

0

Δ2

2

2

3

-

-

Δ2

2

2

Δ1

1

1

1

1

Δ3

2

This method results in 9 page faults  (cost of running the program)

 

·         Ways to reduce page faults (make it go faster)

·         2 ways to attack this problem

§  brute force :simply increase RAM

§  Supposed we had 5 pages of RAM, there would be 5 page faults.

§  Now with 4,

Page

0

1

2

3

0

1

4

0

1

2

3

4

1

Δ0

0

0

0

0

0

Δ4

4

4

4

Δ3

3

2

-

Δ1

1

1

1

1

1

Δ0

0

0

0

Δ4

3

-

-

Δ2

2

2

2

2

2

Δ1

1

1

1

4

-

-

-

Δ3

3

3

3

3

3

Δ2

2

2

 

 

This results in 10 pagefaults.

 

 

 

 

 

 

 

 

 

3 pages of RAM meant 9 page faults, but with 4 there are 10 page faults.

Anomaly more RAM but slower system

Belady's anomaly: does not always help to increase memory or cache.

 

The best possible policy would be the ‘Oracle’ Algorithm

It looks to the future and discard the page that is not needed for the longest time but this algorithm is impossible to implement.

Using this algorithm on our reference string.

 

The page # in memory being requested

Page

0

1

2

3

0

1

4

0

1

2

3

4

A

Δ0

0

0

0

0

0

0

0

0

Δ2

2

2

B

-

Δ1

1

1

1

1

1

1

1

1

Δ3

3

C

-

-

Δ2

Δ3

3

3

Δ4

4

4

4

4

4

 

 

 

 

 

 

 

We get 7 page faults in the 3 page machine.

 

 

 

 

 

In the real world getting the future is not possible. So what we can do is approximate the future using the things that are given to us.

One of the common algorithms used to do that is ‘Least Recently Used’ (LRU) algorithm.

It assumes that a page not looked at in the past won’t be used in the future.

This algorithm assumes temporal locality.

 

Using LRU.

page

0

1

2

3

0

1

4

0

1

2

3

4

 

A

Δ0

0

0

Δ3

3

3

Δ4

4

4

Δ2

2

2

 

B

--

Δ1

1

1

Δ0

0

0

0

0

0

Δ3

3

 

C

--

--

Δ2

2

2

Δ1

1

1

1

1

1

Δ4

 

 

10 page faults on a 3 page machine.

 

For this particular string LRU is worse than FIFO.

 

Implementing LRU:

We cannot implement LRU using the data structures we have in our page fault algorithm. There is not enough information in the tables that tells you which page is least recently accessed. So we have to add code to the kernel to tell what page was least recently accessed.

 

How to record  when page tables are accessed

You could use a real time clock adding a clock value to the page table entry.

Disadvantage:

This causes the PTE to grow by 32 bits (considering a 32 bit clock). This is very expensive as PTEs take up RAM.

 

Instead of using a real time clock there could be a clock with less number of bits and have that clock tick once per page fault.

Example: use very small clock(low res) (Maybe a 4 bit clock). Ideally store the clock in the few extra bits in the page table entry.

Simple model: Every time the clock ticks it will increment all the entries by 1. Any page table entry with a number more than 15 is counted as old. Incrementing all the entries is a slow process.

Alternative model: Instead,  The clock wraps around. Doesn’t record how old the page is, records last time page was accessed.

Disadvantages:

This won't take care of the accesses without page faults. Thus it ends up ignoring most accesses.

Every time page fault: count as access

Since kernel is consulted every page fault

Don’t count just accesses that fault, count accesses that DIDN’T page fault

Looked at page, already in ram, kernel not consulted: this doesn’t do that

 

So this is a terrible algorithm for LRU

 

There are two other approaches

Software: kernel traps for many reasons. =>hardware and clock interrupts.  It traps every 50th of a second automatically.

 So on every trap count accesses

-          From kernel

-          From applications

 

    Hardware: every time a page table entry is used, a bit is cleared in the page table entry. (dirty bit)

Instead have a single bit in the page table entry called dirty bit. Hardware maintains the dirty bit by setting it everytime written to the page entry.

a.       If not maintained by hardware. Then mark clean pages as read only then writing causes faults so the page fault handler sets the dirty bit marks read write and returns.

 

 

 

 

 

How to tell if a page has been modified

Keep copy of page in ram, then

1.       Compare to a copy in RAM   =>   (works but too expensive)

2.       Keep checksum (say md5) of page in RAM in PTE: if matches, hasn’t changed, move on

        1. Md5: 128 bits
        2. Chance of 2 pages having same checksum is very small (2^-128)

Problems

a.       Chance of collision: 2 different blocks of memory hashed to same value, and will lose data

b.      Checksumming process is slow. (walkthrough whole array)  => (consume CPU count)

3.          Instead have a single bit in the page table entry called dirty bit.cleared if page is never changed, set when changed

    1. Ideally:  Hardware maintains the dirty bit by setting it everytime written to the page entry.
      1. Every time you do a ref to a page
    2. If not maintained by hardware, approx. in software. YES! Perfect approx. of dirty bit in software. When page is first pulled in, marked as read only. Then mark clean pages as read only then: first time someone wants to write to page: fault. Writing causes faults so the page fault handler sets the dirty bit, marks the page read-write, and returns.
      1. Even though there’s no hardware support for dirty bit, as long as hardware allows you to mark as read only, you can implement dirty bit support though software.

 

***************break*******p

Assume C== 10ms to read a random page

F = 0.1ms cPU overhead to do a page fault

N == 10000pages in the program

U = 1000 pages actually used                      U<=N

 

Assume no thrashing

Which is faster:

1. read whole program into RAM then execute it (no page faults)

2. Read first page into ram and start running it (page faults)

3. Read all N pages into RAM

 

 

Total Cost

 Latency (Time to start a program)

No demand Paging

NC

NC

With Demand Paging

U(C + F)

C + F

 

Fault in pages as needed

Latency: amount of time it takes to start up program (for example: time from user click on icon until splash screen)

*********SEE COMMON OPTIMIZATIONS spring 09***************

Program into ram: assume that there’s no thrashing,

Read all N pages into ram: then finished, no page fault

other option:

demand paging: don’t read program into ram

 

sometimes total cost is greater with demand paging ( if U == N)

-demand paging slower

another sense: using same cost to read a random page

if on disk, much cheaper to read continuous pages than it is to read same number of random pages because pages are contiguous

all these numbers are fair on flash, but on disk this number is considerably smaller. Therefore sometimes ordinary paging will win

 

 

 

fork vs vfork

 

We know about fork

Vfork:

A.      the child runs and parent doesn't (parent is frozen, can’t even do a wait) until the child does

1.       Exit

2.       Does an exec.

3.       Child has to run some other program

B.      Child and parent share all the memory until 1. or 2. occurs.

-So no race conditions occur due to this shared memory, as child and parent can’t possibly run at the same time.

 

What's the point

Why vfork?

Vfork is only good if you want to do an exec

Or other things, such as Closing things (e.g. pipes)  => [done in child]

 

B/c they share memory they share page tables. So we don't need to copy all the memory to the clone b/c memory is shared. Vfork is a cheap fork b/c of the existence of virtual memory.

Do everything with vfork that you do with fork, except that you don’t have to copy pages tables, therefore it runs FASTER

 

 

Forks in general:

Nowadays: people don’t copy the RAM, but a fork now clones the page tables and marks them as read only and copy_on_write which copies the pages if write is called in the child.

To make fork go faster on modern machines: mark the copy to be read only and copy on write. When you try to write into these pages tables, you will fault.

 

i.e. EMACs: uses vfork

 

Reason of controversies

Fork vs vfork. We have traded complexity for speed.

 

 

 

Distributed Systems

You can do virtual memory with no disk: “over the net”

This works (paging is more expensive)

Cost to do a page fault is higher

 

Need support for dist. Systems:

Read_over_the_net() syscall

Write_over_the_net() syscall

 

Remote Procedure Call (RPC)

 

Caller on machine A

Callee is on machine B (machine with disk)

Like an ordinary function call

1.       But hard modularity

2.       No call by reference: only call by value (caller and callee do not have same address space)

3.       Caller could be x86, callee could be MIPS: don’t need same instruction sets

 

Disadvantage

1.       Performance downside: Slow.