scribe notes

Lecture 16 Scribe Notes

by Kevin Nguyen, Andrew Kirchhoff, Tak Chiu Chan

RPC Performance Issues (again)

Problem 1: Time is taken to do the unmarshal/thinking/marshal, but most of the time spent is waiting for the request/response . We are wasting a lot of cycles waiting for the requests.

Solution 1: Use batching to process more data per request. This is similar to batching when reading from physical disks. Another way is to use asynchronous requests by requesting more data before the previous requests have finished.

Problem 2: What if requests depend on each other? In particular, what if one request fails?

Solution 2: This can be tricky and there is usually a tradeoff between performance, correctness, and simplicity . Examples of RPC protocol that handle this situation is HTTP and pipelining .

Problem 3: Can we support multiple kinds of file systems on the same machine?

Solution 3: Yes! (just look at lab 3) . We must use object oriented programming . The kernel specifies an interface that a file system must implement . This interface is used in order for the kernel to know what code to execute in order to manipulate the file system.

Problem 3a: C isn't an object oriented programming language so how can we use object oriented programming?

Solution 3a: The kernel specifies a file system struct whose components are pointers to code . When a program needs to modify the file system, the kernel will point it to the right code in order to execute what the user wants.

Network File System (NFS)

A NFS uses RPC to send requests from a client processes to a server that contains a network file system that handles all the operations .

NFS problems and design issues

What happens when packets are lost?

Partial solution: We can send a packet again if it is lost. This approach only works if the server is idempotent.

We can lose transactions implemented via links

Suppose no B, our code has:

if (unlink("foo") != 0)

error();

If the response times out, then the code would report an error, even if it succeeds later.

The other solution is to have a server cache.

In this situation, the server caches a list of resent requests and reports “OK” if it gets a duplicate from the same client. One problem with this is that the cache is a limited size so the packet could still end up being sent twice if there is a long enough delay between the sending of the two packets. This will most likely occur when a server gets overloaded because the cache has to overwrite older requests with newer ones once it runs out of space.

Here is a possible work around in user code:

if (unlink(“foo”) != 0 && !nfs())

error();

Though this is not a great solution, it is better than leaving the “!nfs()” out because NFS is so unreliable. For this reason, NFS is not POSIX compliant!

“Stateless” server:

The contents of RAM don’t matter so that if the server reboots, the clients will wait. The advantages of this are that this implementation is simple and reliable, but the disadvantage is that code such as “$ rm file” can take a while to actually get done because the server could reboot while this command is run.

Code such as this should still work:

fd = open(“foo”, …)

<< server reboots >>

read (fd,…)

NFS file handle:

A file handle uniquely identifies a file on a server. The point of this is that unlike a file descriptor, which would be erased if the server were to reboot, the file handle would saved so that the file would still be accessible after a reboot.

Unix: file handle = dev_t + ino_t + serial number

filesystem inode

ID #

on server

NFS protocol:

(dirfh is the file handle of the directory; fh is the file handle for the file in that directory)

LOOKUP(dirfh, “name”) àfh + attrs (attributes: ex. permissions)

CREATE(dirfh, “name”, attrs) àfh + attrs

MKDIR(dirfh, “name”, attrs) àdirfh + attrs

REMOVE(dirfh, “name”) àstatus

READ(fh) à data + status

WRITE(fh, data) àstatus

This is the Unix file system on wheels. There is also CIFS, which is the Microsoft version of NFS. This is basically the Windows file system on wheels. The Samba server is able to support CIFS atop a Unix filesystem

Now suppose we have a process that sends a lot of writes via NFS:

Most processes will cheat and instead of waiting for the response from the server, it automatically assumes that it succeeds, even if you find out later that it really failed. The server is supposed to be stateless, so it is supposed to do an fsync(), but servers often cheat or put logs into flash in order to increase performance. Ways the client can increase performance are by asynchronous writes, speculative reads, and caching the data.

Another problem:

Client A does a read that takes a while, while client B does a write afterwards, but it receives the response before A (because of a faster network, etc). This leads to and inconsistent view of data. Therefore, we cannot assume file contents are synchronized writes and reads. However, NFS does attempt to have close-to-open constistency, even though this costs them some performance. What this means is that if client A does a close(), then client B’s open() will succeed and read all the data sent from client A. This is because a close() is expensive and no caching is allowed, so all the data is for sure written to the disk.

What else can go wrong and how to fix it?

Power failures

The way we fixed the problem of power failures is by having careful writes or a log so that the machine is always in an acceptable state

Network failures

We do not have as good a solution to this problem, but we somewhat solved it by having a cache and replay

Media failure – when a disk drive fails

Examples include: a crack in the disk causing read() to fail or power outage

Should such failures occur, it is assumed that it is a single point of failure as opposed to multiple device failures.

One form of solution to this is RAID(redundant array of independent disks) and here are a few types of it:

RAID0 utilize concatenation by putting together many disks to have a single large virtual drive

This would allow for faster read but does not address data failure besides limiting the amount of data loss (i.e. if one disks fails, the others are still operational)

Also, by interleaving/stripping, it can allow for both faster read and write

RAID1 utilize mirroring by having two drives with the exact same data

This increases the reliability and faster reads but writes are slow due to the need to write the same thing on both drives

Also, requires double the resources in the form of double the drives

RAID4 utilize a similar approach to RAID1 but does not require double the resources and instead consist of multiple data disks couple with a parity disk

The parity disk is used to recover from failure and does not do error checks due to the assumption that the hardware will do so

This method also assumes that should one of the disks fail, it would be replaced before another disk fails.

The parity disk is created from the result of using exclusive or (^) on all the data disks.

Disadvantages:

The parity disk is hot since every time a write occurs in one of the data disks, there will also be a write on the parity disk causing an I/O bottleneck, meanwhile there are far less load on the data disks themselves

Requires synchronized disk arms to compute the parity disk whenever there is a write

RAID5 is a combination of RAID0 and RAID4 where each data disk contain both part data and part parity

This got rid of the bottleneck in RAID4

However it became more complex to add a new disk or to recover from a failure.

To add a new disk, all the parity parts must be reallocated as well as recalculated

To recover from a failure could be simplified if the parity parts are created in a way that the data of a broken drive can be recover using only the parity parts of the other disks

This is also less flexible because recovery requires the need to read from all other disks which could take a lot of time if there are others accessing the disks

What is the disk failure rate?

Annualized failure rate of RAID4 (assuming no replacement)(guess)

This shows that RAID4 is better in the short run but much worse in the long run

Mean Time to Repair (MTTR) on RAID4

Operators need to:

Hear alert

Walk to machine room

Find a good drive

Find the bad drive and take it out

Put in the new drive

Rebuild data on the new drive

This requires the most time and could take hours, leaving us vulnerable!

Also depends on whether there are others accessing the system which can slow the operation of rebuilding

What if you buy a lot of disks in bulk and the whole batch is faulty? If one fails, it’s likely that more will fail in the batch.

Could use disks from all different manufactures but there may be compatibility issues

Could use disks that are made at different times so that they are not from the same batch