CS111: Lecture 10

File system performance

Notes for February 13, 2013
By: Gabriel Lopez and Jose Rodriguez

Go to:

Thinking about performance
How long does it take?
General strategies to improve file system performance
Latency, throughput, and utilization

File System

What do we want in a file system?

persistent storage
organization
fast (or easy) to find stuff (if you know where it is)
lots of space
security
reliability
PERFORMANCE matters!

Let's think about performance:

History of hard disk performance:

We can see that hard disk performance improves slowly over time. However, this is not entirely a fair comparison because we are now getting more data per revolution than we were back then. But it's still a physical limitation that affects the file system.

Design of a file system:

To read data from a random cylinder:

accelerate arms in the right direction
decelerate once you're about half way where you want to go
settle on the proper track
wait for data to rotate under heads

Data is read by the heads and goes into the disk buffer. From there, it could be copied to the CPU and RAM.

How long does it take?

Typical Seagate Barracuda LP 2
    300 Mb/s external transfer (burst) rate (how much data can you get out of the disk buffer per second to the bus)
*  90 Mb/s internal transfer (sustained) rate (for large transfers the limiting factor is how fast the disk's internal mechanism can operate)
    32 MB cache
*  5.5 ms avg. latancy (assuming you're read head is already in the right cylinder, how long it'll take for the sector to get under the head)
*  10 ms avg. seek time (the amount of time it takes from one cylinder to another random cylinder)
    5900 rpm spindle speed = 98.33 Hz
    50,000 contact start/stops (number of times you can turn off and turn on before it burns out)
    10^-14 nonrecoverable read errors/bit
    0.32% AFR annualized failure rate
    6.8 Watts average operation power
    5.5 Watts idle power
    70G operating shock
    300G nonoperating
    2.5 bels acoustic (idle noise)
    2.6 bels acoustic (operating noise)

This are the characteristics that you need to think about when you're trying to do performance.
The characteristics with an asterisk (*) are the main focus for performance.
Can't we do better than 10ms?
It takes one ns to do one instruction.
1 ns = 1 insn
So 10 ms = 10,000,000 insn

It takes the same amount of time to get the first 500 bytes of your file into RAM or to execute 10,000,000 insn. 10 ms is a long time. One solution is to go with flash.

Flash "disk" "SSD" (not really a disk though)
    Corsair Force GT
    36 Gb/s external transfer rate
    sustained transfer rate:
        280 Mb/s read (sequential)
        270 Mb/s write(sequential)
    MTBF (mean time between failure): 1,000,000 hours (about 100 years)
    1000G shock
    2.0 Watts maximum

There's a catch to using flash!
-This is 60 Gb and costs $90.00 while a hard drive is 2 TB and costs $150.00. Therefore, hard drives still have a huge capacity advantage and they are still going to be around for a while.
-Writes to flash tend to be slower than reads. If you do random writes, the speed will go down because flash can't really write. All you can do is erase and then write over zero data. They make this process transparent to the user. 1 million hours MTBF makes the assumption that the user won't be writing too much. Flash electronically wears out for low level reasons (unlike hard disks). If you keep writing to the same low level block it's going to stop working pretty fast. Flash uses methods (like moving writes) to avoid writing to the same block. In conclusion, you don't want to be constantly writing to a flash.

General strategies to get file system performance up!

SPECULATION: guessing what the application will do and do some of the work ahead of time.
e.g. PREFETCHING (special case of SPECULATION): guess what will be read and read ahead of time.

Is PREFETCHING like cache?
Not really. Caching is already has the fetched data because it was requested in the past. PREFETCHING is guessing! It fetches data even if the fetch has never been requested.

e.g. BATCHING (special case of PREFETCHING): application asks for 512 bytes, file system reads 8192 bytes.
It makes sense to get all the blocks in the neighborhood because it's likely that we'll keep fetching data sequentially.

The above strategies are mostly aimed at reading. When writing, a different strategy should be considered.

DALLYING: application tells file system what to do and file system waits a bit before doing it with the idea of saving work overall. If it waits, the "boss" might ask another question that could be done at the same time. It's like the BATCHING for write. Unlike SPECULATION, we wait to see if there is more work (that can be done together) instead of guessing what work needs to be done.

Writing example: Application can say here are 512 bytes that I want to write out to disk. The file system says "okay I'll get to it" and it waits to see if the application has another write request. 512 bytes can be written to the buffer 16 times until you have 8192 bytes and then the buffer is finally written to disk.

Does it make sense to DALLY while reading?
It can help even when reading. If reads are requested from different cylinders, it might make sense to DALLY in cylinders to see if more read requests are from the same cylinder to avoid moving the read arms so much.

LOCALITY OF REFERENCE: file access is not random in practice (it has patterns).
SPATIAL LOCALITY (close in space): references tend to be clustered in space.
Near in space/distance: our next access is often very close to recent accesses.

TEMPORAL LOCALITY: Locality changes over time
Near in time: we will often access the same data again very soon.

example: times 1,2,3 might be in the same general area. Times 4,5,6 are in a different cluster. But later times (7,8) are back in the first cluster.

We want a file system that takes advantage of these types of localities.

What are the metrics of file system performance? How can you tell if the file system has good performance?

latency : delay between request & response
throughput : how many request per second can my file system handle?

It might be better to have a file system with better throughput than latency because it could get more requests per second done and overall it would get more work done. Or maybe you'll care more about latency and not so much of throughput. Latency and throughput can be competitive goals.

utilization (of CPU) = % of CPU devoted to useful work. (As oppose to waiting)
utilization (of disk) = % of disk devoted to useful data.

Imaginary machine (simple #s)
    1 ns cycle time (1 insn per cycle)
    1 PIO (programmed input output) = 1000 cycles = 1 μs (1000 cycles because it needs to go through the bus and collaborate with various devices)
    5 PIOs to send a command to "disk" (or SSD) = 5 μs
    50 μs "disk" latency
    125 cycles/byte = 0.125 μs/B for computation reads data (125 insn to figure out what to do with one byte)
    1 PIO/byte to read data

UNBATCHED

Let's start out with the worst program we can think of: (simple but slow performance)

simple slow program:

	for(;;) {
		char c;
		if(sys_read(&c) < 0)
			break;
		process(c);
	}
	/*This simple program has a simple loop that will read a single byte each time*/
			
	In order to read a byte, we had to send out 5 PIOs, 
	wait 50 μs for the disk to wake and do something (latency), 
	and read the data using inb (in byte instruction) (1 PIO).
			
	send cmd        latency       inb        compute
	    5μs     +     50μs    +   1μs    +   0.125μs    =   56.125μs/byte
	|________________________________|
	        latency = 56μs

latency: it takes 56μs from the time we wanted to read to the time we had the data.
throughput = 1/56.125μs * 1,000,000 = 17,817 bytes/sec.
utilization = 0.125/56.125 ~= 1/500 = 0.2%
-This utilization isn't very good. Only 0.2% of the time it's doing useful work.

BATCH (40 bytes)

Instead of reading a single byte at a time, we can do batching to improve utilization.

	for(;;) {
		char buf[40];
		if(sys_read(buf,40) < 0)
			break;
		process(buf,40);
	}
	send cmd        latency       inb           compute      / buffer size
	(   5μs     +     50μs    +   40μs   +   (0.125 * 40)μs) / 40 bytes    =   2.5μs/byte
	|_________________________________|
	        latency = 95μs

latency: 95μs
throughput = 1/2.5μs * 1,000,000 = 400,000 bytes/sec.
utilization = 5/100 = 5%
-This utilization is a lot better than the unbatched 0.2%.

Are we 40 times faster because we have a 40-byte buffer?
No. We are not that good. This is because computation and inb increases. In this case, a buffer of 40 bytes is about 25 times faster.

What if we change the buffer to 4 million?
Real devices won't let you read blocks this big. But in theory, the latency would increase to 5 seconds. So if you want to read a single byte, it will take 5 seconds. Users normally start running away if the latency is more than 100ms. 5 seconds is unacceptable.

Still, it feels like we should be able to do better than 5% utilization.

I. overlap input & computation

BATCH (40 bytes) + interrupts

We'll use device interrupts:

pseudo:
	do {
		block until there's an interrupt;
		handle interrupt; //issues PIOs for next request and compute
	} while(more data);

While we are computing, we'll be doing IOs at the same time. We will again read 40 bytes at a time and compute 40 bytes at a time, but this time we'll try do things in parallel.

	send cmd + latency||compute + inb) / buffer size
	( 5μs    + 50μs||5μs + 40μs) / 40 bytes

The compute time "vanished" because it's happening at the same time as we are waiting for the disk. However, this utilization is not that much better. Looking at the numbers, the throughput is not much less than what it was before because the compute time (5μs) is small compared to the IO time (50μs).

How do we fix the problem? The buffer is too small. You want the time you last computing to be roughly the same as the amount of time doing IOs. If we have a balanced approach, then the computation and IOs will be done exactly in parallel. Here we could multiply the compute time by 10 to match the latency time (50μs). 400 bytes for the buffer should work!

II. DMA (direct memory access)
We have a smart disk control that can talk directly to memory. We issue the instructions to it and it will interrupt us when it's done. And we won't have to do expensive IO operations to fetch the data out of the disks one at a time because the disk controller will copy the data into RAM. You still need to do the PIOs to set up the operation (5μs). Assume we have a block size of 40 bytes. 50μs in latency (but done in parallel with computation). We don't have to worry about the PIOs per byte (there's no PIOs per byte).

DMA: (5μs + 50μs + 0/byte PIO) / 40 (or 55μs/40)

If we keep adding zeros to the 40 (buffer size) will we keep improving our performance indefinitely?
No. If the buffer keeps increasing the computation time will take over. Computation time will eventually be more than the 50μs for latency. This will be reached when the block size is about 400 bytes.

Something was left out!
An interrupt handler! When you get an interrupt in the low level, the hardware traps and you have to save (such as registers) into RAM. An extra overhead interrupt handler is going to take several instructions (say 5μs).

If a file systems has a lot of IOs, the interrupt handler can become a bottleneck.

III. DMA + polling
The code is pretty much the same code we wrote on our simple program, but we don't tell the disk controller to interrupt. There's no interrupts! Instead we issue the commands to the disk controller, we do our computation, when the disk controller is done we start polling "are you ready?" If we balance our IOs and computation time, when we ask "are you ready?", it'll be ready. So, we avoid the overhead of interrupt handlers, which increases throughput. If we "played our cards right", our utilization will be about 50μs/55μs. We normally don't want a busy wait because it wastes the CPU, but if we timed IOs and computation correctly we should be OK.

On I, II, III we made the assumption that we only have one CPU and one application, which are useful for embedded applications. But in practice, you'll almost always have multiple applications.

IV. Multitasking

	while(inb(0x1f7) & 0xc0 != 0x40) //while disk isn't ready
		/*instead of looping with continue, use yield*/
		yield(); //let somebody else run

Here we are trying to take the CPU and assign it to a thread that can make better use of it. We want the CPU to do useful work as much as possible to try to get the utilization to 1. The idea of multitasking (giving up the CPU to useful threads) is so powerful, that for many applications, we don't have to use I, II, and III. The read() system call in implemented like this.

Next lecture: disk scheduling performance and the actual system design