CS 111 Lecture 11 Scribe Notes, Spring 2012: File System Design

Prepared by Sushma Murthy and Virginia Pham for a lecture by Professor Paul Eggert on May 10, 2012

Cost of Seek Time
Shortest Seek Time First
Elevator Algorithm
Circular Elevator Algorithm
Anticipatory Scheduling
File Systems
Eggert's 1974 (Simple) File System

Compared to RT-11 File System

FAT File System
1975 Unix File System

Cost of Seek Time

Figure 1: Hard disk model.

In Figure 1, we look at a hard disk model. Its cost is | b - h |.
b is where we want to be, and h represents where we are.

We are always worried about seek time but we're also still going to estimate incorrectly. We will tend to overestimate the cost.
Another simplification here is that we ignored latency and acceleration. The |b-h| is more of a back of the envelope estimate.
Let's assume the simplest disk scheduling algorithm we can think of: first come first serve (FCFS) disk scheduling. You service requests in the order they occur; queue them up and service them in order.
If we assume random access, what is the cost is going to be? h is chosen randomly because that's where the previous request was. b is random as well. The average cost of an access? It's an average value of 2 numbers. Suppose h is 0 and you choose b at random. As in Figure 2, this looks like a parabola. It's calculus! We take the integral of something with respect to dh

Figure 2: Graph of average cost vs. h

Equation 1: Calculus in Operating Systems!

Let's say we have two processes.

Table 1: Our two processes

Process 1: reads blocks 1000, 1001, 1002, 1003, ... (nice, simple, sequential)
Process 2: reads blocks 91631, 28193, 1572, 69317 ... (random)

Figure 3: The randomness of Process 2

As in Figure 3, process 2 might be getting in the way of process 1. We should satisfy process 1's requests first. If we're trying to minimize seek costs, it's silly to make sure the oldest request goes first.

Shortest Seek Time First (SSTF)

Different algorithm: Shortest Seek Time First (SSTF)
Always schedule the request that's closest to h (where your disk head currently is). It's easy to see that you'll have better throughput because you're always going to choose minimal seek time (though, we're still ignoring latency and acceleration). It's guaranteed better performance than FCFS. Seems like this algorithm would always dominate. But, say, process 1 requests 1000, 1001, 1000, 1001, etc.. the head will always stay on process 1 and process 2 will starve if its request is far off at, say, 91631.

Elevator Algorithm

How about the "Elevator algorithm"? It is like SSTF but it remembers what direction it was going. (Actually modeled after real-life elevators).
*About starvation in elevators and using SSTF: let's say you're in a ten story building and you're on the tenth floor. Most people are on the first few floors and the entrance is the first floor. Based on SSTF, the elevator would handle elevator call requests for what's closest to where it currently is, meaning it would likely stay in those first few floors. The elevator will constantly be around the first few floors, and the elevator may never come up to you at the top floor!
Schedule requests closest to h in the direction you're moving. This takes care of the starvation problem, but some people are treated less fairly than others: people who are waiting at the bottom or top floor. The elevator algorithm is unfair to the edges, as they are the furthest away.

Circular Elevator Algorithm

To remove the unfairness, there is the "circular elevator algorithm." The idea is that the elevator isn't in a line, but in a circle. The physical properties of the elevator (and our disk) haven't changed, but you're always moving in a positive direction. It (always) goes up. (Of course, you can't always go "up," but this is how it works. In the hypothetical ten-story building, after level ten comes level one.) For example, in Boelter Hall, the elevator always go up. When it is at top floor, it goes all the way to first floor. This provides more fairness and you have fixed the fairness problem. Everyone is treated equally badly. You lose a bit of performance, though: you have to occasionally take the elevator from the top to bottom without doing any work in that time. It's just the cost of being fair.
Don't we all care about performance though? Why would we care about fairness so much? In real time systems, you care more about predictability than you do about performance.

Suppose the requests come in the same order as shown in Table 1.
We satisfy the read request for 1000, and then 91361. The disk has already moved to satisfy this request. It's an inefficient scheduling algorithm. We would like an �oracle� to tell us what's happening next...

Anticipatory Scheduling

Another idea that might help performance: anticipatory scheduling. We'll be doing some speculation!
What the kernel does: it remembers the access pattern of the process and speculates what the process will do, and delay (DALLY). We hope there's another request coming in. For example, wait for 1ms after each process finishes.
Starvation still an issue. This is typically what an anticipatory schedule will do - it will do only a certain number of times.

File Systems

It may seem that all these algorithms are old fashioned!
File systems built from
- Disk ---- seek time
- Flash --- IOPS
- Other stuff
- Network file systems -- network latency throughput.
It's common these days to have network file systems (i.e., Dropbox, Google Drive). These file systems might be local, or they might talk to somewhere in the cloud.

When you design a future file system, you'll have to worry about all that stuff too -- disk, flash, etc. You will still run into fairness issues. The metric you're trying to minimize will just be different.

File System Design and Organization
The idea is tricky. A file system is a data structure that lives in 2 places at once. For now let's just worry about disk.
Think about it as a data structure that lives on disk that provides an abstraction of a set of organized files. What's really on the disk is a bunch of blocks

Eggert's 1974 (Simple) File System

A simple file system (1974 Eggert...before he knew any better!)

Figure 4: Eggert's very simple file system (1974)

Each file starts on a sector boundary. The start location of a file doesn't have to be on a byte count. The sector offset is a multiple of 512.

Comparing this to the RT-11 File System

Similar to the RT-11 file system
What are some properties about this kind of approach?
When you create a file, you must specify its size.
We get fragmentation issues, specifically external fragmentation - we have enough free space, but it's scattered all over the place. We also get internal fragmentation - because each file starts on a sector boundary, there are (-size)%512 bytes wasted in each file.
Some advantages of this approach:
+ it's really simple! We're willing to give up some performance.
+ sequential reads or writes are fast. (A performance advantage)
We have one directory containing several directory entries.
(But, you must know the maximum number of files when you create file system, though.)

FAT File System

FAT file system - (File Allocation Table)
It's the new thing that didn't appear in RT11. We are going to keep track of where free blocks are, and where file blocks are, and files don't need to be contiguous

Figure 5: A FAT file system

An FAT is an array of block numbers
The directory is a file that contains directory entries. It's a powerful notion! Instead of saying the directory is a special region on disk with special properties, each entry looks like what you see in Figure 6.

Figure 6: A directory

We want to know file size to make things easy, so that we don't have to read the whole file, and so we know how many bytes are used or wasted. (See Figure 6)

Both the RT11 and FAT have problem the where they aren't robust if the disk is flaky! If you lose a sector and it's the wrong sector, you can lose data as well. FAT
+ easy to grow a file
+ no external fragmentation
+ no artificial limit to the number of files
- sequential reads can be really slow and can require lots of seeking (to read next block, you may have to do two seeks) How do you deal with this? A defragmenter can fix this.

How do you implement lseek? lseek goes to a random spot in a file system. lseek turns into an order N operation and will slow down applications that want to do random access. Random access into an FAT is slower than in an RT11.

1975 Unix File System

In 1975, the Unix file system was based on inodes. (An inode is a data structure that describes a file.) An inode comes in two forms. On disk -> can also be cached. (= disk, + extra info)
Instead of a single file allocation table, block 0 is special. Block 0 indicates there is no block. You're allowed to have, in effect, null pointers. Another way of thinking about it: files can have holes. Holes are read as 0. If you write to a hole, the file system will automatically allocate some space for it.

Figure 7: An old Unix File System

$ open(-"f",O_CREAT | O_TRUNC , 0666) $ lseek(fd, 1000000000000000, SEEK) $ write(fd, "h", 1)

To do this on SEASNET: $ dd if=/etc/passwd of=big seek=10T dd is a command that will read from the file, write to the file $ ls -l big $ cat big
Suppose Seasnet tries to back up this file as you're doing this... If you do this before 2am... oops. You'll run into problems. The backup is no better than cat.

Directories = arrays of directory entries. See Figure 8.

Figure 8: A Unix directory vs a BSD directory

Compare to fragmentation (Figure 9) $ ln a b $ rm a // like the unlink syscall // marks type field as not there

Figure 9: What happens after $ rm a

External fragmentation. How does it work here compared to the FAT based file system? It works the same.
Internal fragmentation. If you want to have as much internal fragmentation as possible: create a file that's 1 byte and make it as long as possible. There will be lots and lots of holes. You will waste a lot of space. (Seen in Figure 10)

Figure 10: How to use internal fragmentation to waste space

How about lseek? O(1) because there is at most 4 operations. Sure, it's a constant factor, but it does require 4 seeks!
If you have a corrupt file system, there will still be problems here!