CS 111

Lecture 14, 11/12/09

File System Robustness (cont.) and Virtual Memory

by Gordon Lin and Alfonso Roman

File System Robustness (cont.)

Problem Case: Power Failure

Our file system needs to be robust - even when the power gets pulled.

Simple case:

Suppose we are editing a file with a text editor. This text editor has a [save] button which will write the changes to the file to disk. Suppose we made some changes to the file, and we hit the save button. As the disk is writing the changes to the disk, we lose power to our machine. As we reboot the machine, we want to find the file in either its old state before the changes were made, or in its new sate, after the changes were applied. What we don't want to find is a garbage file that is halfway in between the the transition from the old version to the new version. In other words, we want to design a file system such that the saves are atomic.

One possible solution one might think of would be to have two copies of the file. One copy would represent the current state of the file, and one temporary file used to implement atomic saves. When we save the file, we first write our changes to the temporary file. Once the changes have been successfully written to the temporary file, we then write the same changes to the current file. If we have a power failure during the save, we can reboot the machine, peek at the current file and temporary file, and decide which file to recover from. If we lost power while writing to the temporary file, we can recover the old file from the current copy. If we lost power while writing the changes to the current file, we can recover using the temporary copy.

There is a problem with this solution because now we don't know which file to choose after we reboot from a power failure.  We need additional metadata to help us pick.

Idea 1: Commit Record

One possible solution would be to add a commit record to the temporary copy of the file. This commit record is a one bit flag that will act as a record of intent, and will be written atomically. The commit record is set after the disk finishes writing to the temporary file, and before the disk begins writing to the current file. The commit record is then cleared once the disk finishes writing to the current file.



    Here is the general pattern for use of the commit record.

       Begin
        -Precommit Phase
        -Commit                 (atomic)
        -Post-commit Phase (clean up/performance/clear commit record)
       End

    Notice how you can build large atomic instructions from smaller atomic (all or nothing) operations.  We could nest this a few levels up.
   
    Although simple, there's a catch: the commit record is quite SLOW, since we have to worry about two copies of the same file. We might be able to do better.

Idea 2: Use a Journal

In a based system journal, we split the file system into two pieces. The first piece would be data blocks that would represent the state of the file system, and the second piece would be a journal. The journal would consist of (mostly) append only entries, each entry would detail the changes made to the data blocks and a commit record to keep track of whether the change has been made.  If you want to make changes to the file system, you must log it in the journal before changing the actual data blocks.



    Here are the basics steps of how a journal would work.
        -When we want to make changes to the data blocks, we write changes to be made to the journal.
        -Then we set the commit record.
        -On reboot, we rescan the journal from oldest to newest.
        -We install any changes detailed by the journal, changes which have the commit record set.
        -Lastly, we clear the commit records.

Should the reboot fail (say, by a power outage) again, recovery of the journal idempotent, that is repeated recoveries have the same effect as one.

When designing a journal based system, it is important to pick a good journal size.  This is an engineering trade-off.  If the journal is too small, it will fill up quickly and will cause stalling.  If the journal is too big, it will waste disk space. However, if you do design a journal well, it does have certain performance advantages. First, writes to the journal are sequential. Second, the journal tail can be caches for faster access. Third, the journal itself can be stored in a different disk drive separate from the data blocks.

An extreme case of of journal usage is using a disk journal and ram disk to hold our data blocks.  This allows us to write data quickly as we clear commits from the journal.  While this approach is good for performance, it's not ideal for all systems, especially real-time systems (think of a black-box data recorder on an airplane).

Two Types of Journals
Write-Ahead Log
: Write new values to the log, then commit. Requires post commit phase in order to copy new data into cell data.
   
Write-Behind Log : Write old values to the log first, then write new values to the log. Then commit. No post commit phase.  The advantage of this method is that when doing recovery, we can go back through the log (a cheaper operation).

Aside on the ext4 file system

The man page for ext4 lists in its notes (bugs) section that potential data loss may occur in this file system.  This happens because the designers of ext4 wanted to get the most possible performance from the file system.  One way they achieved this was by using extents which allocate a big file very quickly, but trades off atomicity.  From wiki: An extent is a range of contiguous physical blocks, improving large file performance and reducing fragmentation.  You can read more about ext4 here [link to wiki page: http://en.wikipedia.org/wiki/Ext4 ] .

As an example of how ext4 might loose your data: say our system is doing an operation:
    write("...", 8192)
    -> and we crash!
    when we try to read our file after reboot we read in 0's!

BUG: We've taken an atomic operation and split it into two parts:

    (atomic)                      1a. Allocate space
    (atomic for small files)   1b. Copy data into new space

But this whole operation is not atomic.  We're satisfying file system invariants, but there is a trade-off for performance over durability.

Virtual Memory

Problem Case: Unreliable processes with bad memory references. Big Problem!

Solution 1: Fire your programmers. Get better ones...

    -This works, but better programmers also cost more money
    -Expensive

Solution 2: Runtime checking to programs

We could routinely check the subscripts of arrays to see if we are accessing within bounds (subscript checking like you were taught in CS31).  We also could keep checking null pointers, and make sure we are not accessing invalid memory. However, we would need many conditional branches in order to perform these checks, and it would impact our performance. We would also have to trust applications to do the checking.  This would be possible if all our applications were compiled under a single trusted compiler which would add all the checks for us, but this is still slow.

Solution 3: Base + Bound registers

In this solution, process will now use two privileged registers, a base register and a bound register, in order to prevent access to invalid memory. The base and bound registers will point to two memory addresses, and the hardware supporting base and bounds would be modified so that all memory access R will be between "BASE address" ≤ R ≤ "BOUND address". In essence, process will now live within their memory jail. However, this does pose several problems.


Problems
    -Processes may be bounded with inflexible memory sizes - and we can't easily add memory.
    -Processes are not able to share memory among themselves.
    -Cannot run same program with the use of absolute memory addresses in multiple processes.
    -IO must be done through kernel.

Solution 4: Multiple Base-Bounds registers

We can solve the inflexibility of memory, and the inability to share memory if we use multiple base bound registers. However, this approach still have problems.



Problems:
    -Because segments are different sizes, we'll get external fragmentation.
    -Not easy to change the size of a single segment.


Page-Based System

A Page-Based system attempts to solve the problems found in the previous solution examples. We chop up the RAM into small fixed memory sizes call pages. We use a page table to store each piece of RAM as a page table entry. Each page table entry will consist of the page # the entry refers to, and an offset within the page.


Lets look at an x86 machine, and see how a page table works.

    Suppose we use 32 bit addresses. The first 20 bits of the address will contain the page number, and the bottom 12 bits will hold the offset. Given a page table entry, we can map it to its memory location.



    size_t pmap( void *a )    {
        size_t vpn  = ((intprt_t) *a) >> 12;
        return PAGE_TABLE[vpn];                    //The pointer to PAGE_TABLE is store in the %cr3 register
    }

Although a page based system solves the issue of external fragmentation, and memory sharing, it still has problems of its own. Suppose we are using a 4K page table. If each page table entry is 4 bytes large, the total size of the page table would be the size of entries multiplied by the total number of entries, which will be 2^20. Then the page table size will equal 4 * 2^20, or 4MiB per process! This will not scale well.  1000 processes would require 4GB of RAM.

2-Level Page Table




In order to address the scaling issues with the previous page table system example, we will now explore the concept of the 2-level page table system. Instead keeping a page table with entries for every memory location in RAM, we only need to keep entries that need to mapped. Instead of using the first 20 bits to keep the page number, the virtual address will now be split into three parts, which will consist of a HI, LO, and an OFFSET segments. The HI and LO segments will utilize 10 bits each, and the offset will use the last 12 bits. Using the %cr3 register, we can access the Level 1 page table, called the page directory. Then we will access the HI segment of the virtual address and find the proper entry within page directory. This entry in the page directory will lead us into a second, level 2 page table. Using the LO segment of the virtual address, we can now access the correct entry within the in the level 2 page table, which will now direct us to the right memory page. The last 12 bit OFFSET will point us to the location within the memory page. The following function is an example of how this may work.

    size_t pmap( void *a )    {
        intptr_t A = (intptr_t)a;
        int hi = A >> 22;
        int lo = (A >> 12) & (( 1 << 10 ) - 1 );
        size_t *p = PAGE_TABLE[hi];
        if (!p)
            return FAULT;
        else
            return p[lo];
    }

Page Faulting

Page faulting is when the page table says 'no page there' or is not accessible for instruction.  If this happens we trap and enter the kernel (which is told the bad address and PC, etc..).  The Kernel can then decide to terminate the process if it's suspicious or send a SIGSEV signal (which the process needs to know how to handle, or it will die). 

Or the kernel can fix the page table entry to make it work and re-start the instruction.  The motivation for this is that a program that needs 4GB of RAM can now run on 1GB physical RAM.



Problems:
    -This is very slow since we're going to the disk to get data.
    -Page thrashing also kills this idea.

Can we cheat this approach? 

    -We can try write to the %cr3 register -- but kernel already makes this a privileged register. 
    -We can also try to write to the page table itself -- the kernel doesn't let the application see its own page table. 

So we can't cheat this approach so long as the kernel does its job.