Lecture 16 - Robustness, parallelism, and NFS

Remote Procedure Call

The Goal of a remote procedure call is to make it look like a call.

-However, it can be slower

+can have different architecture

+flexibility

Marshalling is when a system connects all arguments into network byte order, picking a byte representation for each object. Serializing and pickling are other terms used for marshalling.

Marshalling seems like a sensible solution for simple objects. However, for more complex objects, it gets harder to marshall. For example, how would a hash table be marshalled? You would not marshall the hash table because it is too much hassle.

The standard way is to take any data structure and marshall it. For example, take any tree struct data, such as xml, and send over net and reconstruct. Marshalling works by having the caller marshall the data while unmarshalling works by having the callee take all the bytes and put in right location and order. However, marshalling and unmarshalling takes time, more specifically, cpu time. In addition, net latency comes into play because it takes time for the bytes to travel through the net.

Let’s take a look at the total cost of the Remote Procedure Call. There is the cost of local call, cost of Marshalling, cost of Latency, cost of unmarshalling, and finally, the cost to do real work. Now, to send the call back to the sender, the same process has to happen, which essentially doubles the cost per call. Typically, net latency is the major bottle neck.

What can go wrong with protocols?

1. The server can be overloaded. A server can take a long time to send response or it can just not respond at all, and you can’t tell the difference when you are sitting at your desktop. As the client, you can send request again, and same problem can happen again. This is not something that you would experience with ordinary system calls, because the kernel would just do it.

2. Messages can get lost. In the kernel, instruction will get called. However, over the net messages can be corrupted. This might happen pretty often.

3. Messages can be delivered out of order. Messages don’t necessarily take the same route to server. The second message can get to the server before the first message. For example, write before read, resulting in different order.

4. Messages delivered very slowly.

How do we fix the problems?

Suppose you detect corruption, checksum does not match up.

- Ask for other side to resend, we do nothing.

- We don’t know what other guy wanted, so do nothing, everybody does this.

Suppose we are the client and we don’t get response.

- Have a time out, but it’s either going to be too large, or too small. 1^st rule of time outs is that they are always wrong.

- Resend request (keep trying if this times out), until we get a response. This is called at least once RPC and it is suitable for calls like draw pixel because it is idempotent operation. No matter how many times it gets sent, the result will be the same. This isn’t the case for all operations however.

- Suppose we want to debit an account and credit another. We send one request

o We get no response, give up and report error. This is called at most once RPC, which is for operations that better not happen more than once (transctions, etc).

- We want exactly once RPC. How?

o Requires a cache of recent transactions in the server. Records all transactions and finds duplicates (what size? Always wrong)

o No perfect solution

We may have simple action, but net lat will cause wait time. For example, if a client sends a request to a server to change one pixel at a time, then it would take a long time just to change a small square of 20x10 pixels, which would take 200 separate requests as well as responses.

Figure 1 Pixel by Pixel Figure 2 Asynchronous

So to draw rectangle, instead of one pixel at a time, draw a rectangle. This is an example of batching, which is a huge win in RPC, but it is not always easy. What if you were trying to draw a cloud? You would want the client to send the server a lot of requests at a time and you don’t care if you get a response back. You would check to make sure that the request was processed correctly later.

There is a huge performance gain when doing, “Asynchronous RPC.” The caller does not wait for the calle to respond. In the case of the draw pixel functionality, the client does not wait for the server to confirm that the pixel has been drawn, but instead, just sends the requests out. However, the caller can’t assume actions are done yet when caller resumes.

Problems with asynchronous RPC

Idempotency becomes even more important with asynchronous RPC. If you want to use Asynchronous RPC, make sure you don’t have outstanding requests depending on success of earlier requests. In addition, as client, be prepared for asynchronous errors. The draw cloud example is okay for asynchronous RPC.

HTTP is asynchronous

Version 1.1 of HTTP has notion of pipelining. The client can issue a bunch of ‘Get’ requests without waiting for response. This notion is used for images nowadays because the resulting images don’t rely on each other. EXAMPLE in notes.

With the union of the last two major subjects, RPC, and FS, you get Remote FS.

Usually one system has to talk to multiple computers on the network. One can use multiple file systems at the same time by putting the file system name in front of every file! This way every file is connected to the file system in which it is located. This way however is awkward, it requires that every application know about file systems. Ideally users and applications should know nothing about different file systems at all. There should be 1 big file system, which actually is made out of many smaller drives.

How does Unix deal with this?

Unix solves this with the mount command. The mount system calls tells the kernel how to present multiple small file systems as 1 large conglomerate. We choose one file system as a root, for example suppose we have 3 file systems: A, B and C. If A is the root, B and C are accessed through particular leafs in A. All this is done inside the kernel, the kernel will lie about drive boundaries. For example, if we try to “..” out of the root directory of C, in this case the kernel will point to the parent in A that is connected to the root of C. At the point that we glue the sub file systems to the root file system the kernel will lie to maintain the illusion of one large file system.

Downsides of the Unix Implementation

The consequences of this method of organization are that files can no longer be identified simply by inodes. One inode in this scenario can potentially refer to a file in each different file system. We not only need to know the inode number but also the device number. Another problem revolves around hard links. In a one file system implementation a hard link will always work unless permission errors/a file dosn’t exist, in a multi file system implementation however hard links may fail. This is because directory entries only contain a name and inode entry, directory entries can only link to files inside its own file system! Directory entries as we defined them earlier cannot link to files in other file systems. We can solve this by adding another field in the directory entry, specifying the device number. What if the device is unmounted? This approach is very brittle and prone to failure. For this reason hard links are not used in multi file system approaches, and just symlinks are used. We can also create loops with mounts ie. A is root and we mount C to A, but then we mount A to C. For this reason Linux system prohibits multiple mounting of file systems this prohibits mount loops. Another problem is files can vanish eg. Inaccessible: simply mount a tiny file system to make important folders ie. /usr disappear. An attacker can create a drive that contains setuid, mount the drive and then setuid to gain privilege. The solution is to make “mount” privileged. Suppose we try to unmount C, while we are doing that, there is a running process that points to a file in C. Unix will prevent the user from unmounting, the other solution is to unmount C and simply inform the process that one cannot read the file because the file has vanished.

Implementing a mount

The object that we deal with at the top level is called the file system it is a superclass, the subclasses are file systems for dxt3, dxt4, VFAT etc. The class defines all the methods that you need to execute in order to support file systems operations ie. creating a file, linking a file etc. subclasses provide implementation for these methods.

<- subclasses: implementations

<- class: define methods for creating/.linking etc.

Object Oriented?

The Linux kernel is not object oriented, but we can write in OO style despite the fact that the Linux kernel is in C (not OO). One of the members of this struct is a function pointer that points to some actual code that specifies what is a VFAT file system , how to create a VFAT file system, in reality we are writing OO code.

How can we implement a network file system in the kernel that looks to the rest of the file system as EXT4? We are looking for NFS implementation that will use RPC to communicate with a file server that is located remotely on the network. When someone calls the “create file” operation of NFS we will send a create message via NFS protocol to NFS server. In the request, you send a look up request specifying the directory file handler and a name, the response will be a file handle and some attributes for that file ie. Size, owner etc. this is a message that resembles the Unix system call called stat. There is a create message where you specify directory file handle, name and file attributes, it will return the file handle that was successfully created and its attributes. There are also messages for make dir, remove dir, remove, read, write etc. can be found in table 4-1 on page 4-46 in course reader, in fact one way of thinking of NFS is taking this Unix API and they put it on wheels.

Statelessness

NFS is the Unix file system OO API turned into a network protocol. NFS is weird in that it is supposed to be stateless. We are worried that if a client is connected to a server over a network and we are writing to a disk attached to the server. We want this to work even if the server crashed. If the server crashes and reboots we want everything to still work. Ideally the clients should be oblivious that the server crashes at all, the clients may hang for a bit but once the server reboots everything will still work. Stateless means that the state of the server does not matter from the perspective of correctness, ie. server RAM does not contain anything important.

Identifying files on a remote server

We need a unique ID for files, Unix has inodes, in NFS we have file handles, they are like inodes ie. they are a large integer that ids a file on file system. We want these file handles to persist across servers so that clients can look up directories even if servers crash and reboot. How can one implement this? We can use inode numbers as file handlers. ie ship inode numbers across the pipe. Note we can use mount to mount multiple file systems eg. We can have two different files with the same inode number in two locations. From the server perspective the file handler may have to be the inode number + the device number to identify which client contains the file.

Suppose that client 1 does an unlinked system call, the way that gets sent is with the remove message and it specifies the file handle for the file to be removed. In the meantime, the NFS server gets a different remove request to remove the same file from client 2, suppose client 2’s message reaches the server first. If client 2 then creates a new file, and the server creates a file with the same handle as the file removed, client 1’s remove command will delete client 2’s newly created file: this creates a race condition for file handle mapping. For this reason Unix systems have an extra number called the serial number: this number starts at one and counts up each time a file is created using the same file handle. This prevents reused inodes from looking like old inode numbers. In the example above, the file handle of the created file will be different from the file handle of the file that client 1 wished to remove.

Performance on NFS systems

Asynchronous reads

Suppose we read from an NFS file, the client issues reads and the server responds. It is very common to issue many reads in parallel, and the server will respond one after the other. NFS reads are done asynchronously and the user application is oblivious to this. The user application just invokes the system call once even though the call is implemented asynchronously.

Why can’t we do this for writes?

This works for well for reads, however for writes, suppose the first write fails ie. The disk is full: the asynchronous call will just guarantee that the command has been sent, not that data is actually written on the disk. The app thinks that the write works when we actually do not know if the command has succeeded. NFS has flags for this reason, if this flag is on, the server will wait until the write has completed before responding the client by extension will wait for a server response before issuing another read request. The problem with this is this is very slow, by giving up asynchronous writes we are waiting a lot longer than is necessary. Both approaches therefore are not good.

The solution?

In NFS v4, they cache the writes that need to be done on the client disk. When a write is going on we write to the NFS server and the client disk, if a write fails, we can use the cache to reattempt the write later. The advantage of this is a better combination of reliability and performance, the downside is increased complexity, we also need to know when and if the server reboots ie. We must discard the idea of the stateless server.

Parallelism over NFS

Another performance problem that they are looking to solve is the idea of parallelism, one of the ideas is that the client, a big client ie. a cluster or cloud of computers. Suppose the NFS server is trying to service all the clients. One NFS server should only control metadata (inodes, file handles etc) and farms out data to other data servers. There is increased parallelism because clients just consult metadata server to find where to write, actual I/O writing/reading which are the most time consuming operations, are done on data servers.