CS111 Scribe Notes - 11/24/09

Chris McAndrews
Steven Pease
Sarkis Khachatryan
Gevork Palyan

Robustness of NFS

Process 1:

clock_gettime();
read(fd, buf, sizeof(buf));

Process 2:

write(fd, buf, sizeof(buf));
clock_gettime();

Suppose process 2 is writing to the start of the file, and process 2's execution precedes process 1, but the two processes are called at roughly the same time. Is it possible for process 1's read to be able to retrieve the file's data before it is overwritten?

The timestamps of the two processes will differ on a distributed system such as NFS, because the two processes could be running on different client machines. This situation is called clock skew and can undermine the robustness of the system.

To fix clock skew, one could use NTP, or the Network Time Protocol to sync the client and server's clocks. NTP can bring the machines to within millisecond accuracy agreement.

While this approach may work for some machines, it will not be sufficient if the OS requires nanosecond accuracy. So why bother keeping timestamps at all?

The Unix command 'make' uses timestamps to make sure files are compiled in order. In its first implementation (1975), ties meant 'rebuild'. As processors grew faster, more and more ties occurred. This led to a revision (1977) where ties meant 'don't rebuild'.

With NFS, we cannot assume reads and writes will have agreement between timestamps. NFS does not have write-to-read consistency and cannot commit changes to a file via a write.

However, NFS can have close-to-open consistency because a close is typically slow, and can therefore commit changes to a file with a close.

The fsync() system call can be used to flush all data (including metadata) to a file before continuing with any other operation. fdatasync() is similar except that it does not flush metadata.

This scenario is an example of a waterbed effect, as in order to increase the robustness of NFS we must sacrifice performance (closes are slower than writes).

In a parallel NFS implementation, a client connects to a central server, which keeps track of file metadata. The central server talks to several data servers which hold the actual file data. The data servers may utilize the idea of RAID to improve robustness.

RAID

RAID stands for Redundant Array of Inexpensive (later Independent) Disks. It was developed as a way to improve the robustness of a filesystem.

There are many flavors of RAID which have been developed.

The first, RAID 0, does not actually improve robustness. Its purpose is to increase the size of the filesystem by concatenating many small disks into one large 'disk'. RAID 0 can also be nested, where a RAID 0 system can be a concatenation of many smaller RAID 0 systems.

RAID 0 can be implemented using another method known as striping, where the data written to the concatenated disk is distributed in segments between each of the underlying physical disks. This approach can then use the read/write heads on each of the disks to access the segmented data in unison. This can improve the speed of a read or write by up to n times, where n is the number of physical disks used in the RAID system.

RAID 0 is not robust. If one physical disk fails, the whole system fails.

RAID 1 uses a mirroring technique to improve robustness. It works by using two (or more) disks as clones of one another. Data is written to each drive in unison, so that if one drive fails, the system still retains the data. There is no storage gain using this technique, and will cost twice (or more) as much, for the same amount of space. There is no performance gain with writing data, as the whole file must be written to each drive. However, there is a performance gain with reads, as each head on a disk can be used in parallel to retrieve different segments of a file.

RAID 2 and RAID 3 were made obsolete by RAID 4 and 5.

RAID 4 was developed to address the cost and performance concerns, while maintaining the robustness of RAID 1. RAID 4 borrows the concatenation and striping ideas from RAID 0, but adds another physical disk as a checksum disk. The checksums are often implemented using parity, where the segments of data between the disks are XOR'd together. If a data disk fails, the data can be reconstructed from the remaining drives and the parity.

There do exist some problems with RAID 4 however. Although there can be an increase in performance over RAID 1, RAID 4 is still bottlenecked by the checksum disk. For each write which is performed, RAID 4 must also compute the parity checksum. Also, if the checksum drive fails, the system reduces to the vulnerability of RAID 0. If a second drive fails without first replacing the checksum drive, the system fails.

RAID 5 is an improvement of RAID 4, in which the checksums are distributed over all the disks, instead of being concentrated in one checksum disk. This increases the robustness of RAID 4, as there is no longer the danger of losing the checksum disk. However, the system can still fail if more than one disk is lost at a time. It takes time to compute the lost data of a drive, and if a second drive fails before repair is complete, the data is irrecoverable and the whole system fails.

Sun's ZFS is based on ideas of RAID 5. RAID 6 is similar to RAID 5, except that it distributes two parity blocks for each segment of data across the disks, removing the vulnerability of RAID 5 if one disk fails.

RAID can be implemented by both hardware and software. Hardware RAID is implemented via the device controller. Software RAID is controlled by a device driver of the operating system. Software RAID is the more flexible of the two, as it can implement RAID on disk partitions as well as physical disks.

To determine the failure rate of a disk, we must first measure the mean time to failure (MTTF), which is the time between when a disk is up and working and the disk fails, the mean time to repair (MTTR) which is the time between when a disk fails and when a disk is back up again, and the mean time between failure (MTBF), which is the MTTF plus the MTTR. The availability of the drive is the ratio of the MTTF to the MTBF. The downtime of the disk is 1 minus the availability.

Typical graphs of failure rates follow a bathtub curve model, in which there is a steep decline from a high failure rate to a low failure rate over a short period of time, followed by a much more gradual increase in failure rate over time.

Annual failure rate (AFR) graphs tend to show a rise in failures after about 2 years of use. A RAID 4 system using 5 disks (4 data disks, 1 parity disk) can achieve failure rates of 0.004%. The highest end uptimes that can be achieved are near 0.99999%.

NFS Security

Ordinary filesystem code is trusted, but since NFS runs on a network and may be run over the internet, so additional protection is needed.

Problem: Network traffic can be insecure and vulnerable to interception or forgery (such as sending a packet that looks like a valid read request)

Solution: Encrypt traffic between NFS clients and the server.

Problem: One of the NFS clients for a given server may have a different user ID than another. For instance, if one machine assigns user ID 1000 to "eggert", and another assigns user ID 1000 to "kohler". The filesystem itself only stores the userid.

Solution: Require credentials from client to ensure username-userid correspondence. This credentials server can either be on the NFS server itself, or on a separate machine.

Parallel NFS Security

Parallel NFS splits up filesystem operations over a set of servers. The main NFS server will store filesystem metadata. Block servers store actual file data. NFS service is parallelized by having the main NFS server redirect read requests to the proper block server.

In addition to the concerns with ordinary NFS, Parallel NFS must also contend with additional security issues. For instance, a client could join the network and attempt to read directly from a block server it doesn't have access to.

Security

In the real world security protects you against attacks via force and fraud.

In the virtual world, force attacks are still somewhat of an issue (laptops can be stolen). Fraud is what we are mainly interested in in the virtual world (e.g. cameras capturing passwords).

There are generally three main types of attacks:

1) Privacy: Attacks against privacy deal with the unauthorized release of information. (reading)

2) Integrity: Attacks against integrity can be unauthorized modification of information. (writing)

3) Service (a Denial of Service attack)

Defender's Goals

- Allow authorized access: we can easily check if this goal is met by using it on real authorized users.

- Disallow unauthorized access: This is hard to test.

- Good performance: we can easily test this as well.