CS 111 - Spring 2010

Lecture 11: File System Design

Written by: David Chen, Steven Liu, Flint Luu

Lecture Topics

Flash vs. Disk
File System Overview
RT-11 File System
FAT File System
Berkeley Fast File System

Flash vs. Disk

In the last lecture, Professor Eggert briefly mentioned flash drive technology and the differences between flash and regular disk drives.

A Flash drive, or SSD (Solid State Drive), has no moving parts and generally has lower latency for read/write operations in comparison to conventional disk drives. Random access is fast because there are no seek times involved, however blocks must be erased before writing. Flash technology is still very young and it is constantly evolving.

Below is a table comparing 2 different variations of a particular model of SSD (the Corsair Nova Series SSD's).

Capacity	Read	Write	Price
320 GB	192 MB/s	75 MB/s	$100
256 GB	250 MB/s	195 MB/s	$725

Also of interest is a comparison to a disk drive of similar specs. The following are the specs for a Seagate hard drive.

1500 GB

300 MB/s

$100

As we can see the hard drive beats the SSD's in read/write speeds but in general most numbers/figures thrown out by the manufacturers are bogus and don't correspond to actual operation speeds. In general, flash drives will beat out hard drives in write/read speeds.

Flash technology seems good and advantageous, but there are some drawbacks:

Flash wears out after numerous writes. (NAND Flash can be written about 1,000,000 times and NOR Flash 100,000)
Very expensive as capacity increases.

As flash technology progresses and becomes more mainstream, the cost will probably decrease just like it did for harddrives. For the problem of wearing out, a lot of flash drives nowadays use a technique called "wear-leveling" to prolong drive life. "Wear-leveling" simply means writes are directed to the least-written blocks in the device, and thus all the blocks age more uniformly. This technique can be implemented in the OS or the device itself (firmware). However "wear-leveling" has its own problem, as the logic overhead needed to implement it incurs a performance loss on writes.

File System Design

A File system is defined as an on-disk data structure that represents a set of files that are organized by a tree-like directory hierarchy.

RT-11 File System

FAT File System

FAT (File Allocation Table) File System

Beginning in the 1980s, the FAT file system is another computer file system architecture we went over in class which is found in many memory devices now such as memory cards. The memory lay out is shown in the figure below (with the description below that).

fat_layout

Boot Sector

The first sector in the file system (index 0). This sector includes the BIOS Parameter Block and contains the OS’s boot loader code.
For some devices, this is not necessarily the boot sector; for partitioned devices, it is the Master Boot Record.

Super Block

The super block is the next sector in the FAT file system.
This block holds the version, size of the file system, the blocks in use, and the location of the root directory.

File Allocation Table

The file allocation table is a list of entries mapping each cluster on the partition.
It holds the metadata about the files.
It holds integers, one entry for each data block in respective ordering (the first integer is the first data block, second is the second data block, etc.).
The integers are as follows:
- -1: Unused block
- 0: EOF (the last block of the file)
- 0: Index of the next block in the file
Data Blocks
- These are the data blocks in the FAT file system.
- The data blocks can vary, but are typically 4KiB.

For the FAT file system, there can be regular files and directories.

Regular files have their data stored in the assigned blocks, with the file allocation table giving the respective block numbers of the file.
Directories have assigned blocks, much like a file but in the blocks, it is a table of directory entries.
- The directories have a file name, type, size, and the first block number of the file.
- The block is shown below in the figure:

Berkeley Fast File System (Unix-like filesystem)

Developed by a Berkeley graduate student, this file system has been used by many Unix and Unix-like operating system. There have been minor variants to this version but for the most part, the BFFS has a solid foundation. The disk space structure is divided into the following:

Boot Sector: Information on disk booting and loading up the OS.
Super Block: Contains metadata about the filesystem itself (version number, size, amount of free space, and location of rootdir).
Bitmap Block:Indicates used or unused blocks Free blocks are indicated by 0 and used blocks are indicated by 1. Unlike FAT, this method is more efficient because one bit indicates the availability of a block while FAT allocates data corresponding to an array and fragments over time.
Inode Table: This table stores all the inodes, which contain all the file information and form the basis of this file system.
Data blocks: Contains the data; depending on the size of the file, these blocks could instead be pointers that divide the file into even more blocks.

The idea of inodes introduces a few new implementations and the idea of holes. The reason for inodes is we want to find files fast (thus a pointer-type hierarchy) and we want those files to hold a lot of data (data blocks with pointers to more data).

Inodes (Index nodes)

Index nodes are fixed-size file descriptors that hold a file's metadata such as size, permissions, owner, links, and timestamps. A single inode represents either a file or directory, but the interesting thing is they DO NOT hold file/directory names. Read under links for more details. Each inode holds at least a certain number of direct block pointers, a few indirect block pointers, and possibly even doubly indirect block pointers. These pointers point to the actual data that the file stores and depending on the type of UNIX file system, block sizes may vary.

Assuming each block is 1024 (1 KB) and pointers are 4 bytes, then each data block can hold 256 pointers. That means an indirect block can hold up to 256*1024 = 260 KB and that doubly indirect blocks can hold up to 67 KB. This doesn't seem like much but I'm assuming quite a low block size count. The illustration below provides a simple look at inode hierarchy:

inode_hiearchy

Links

Earlier, we mentioned that inodes do not have file/directory names, or to put it more clearly, files and directories in linux are not explicitly identified by their names. Each file has its own inode number and you can use the ls command in linux to find that number and more inode information. Because of identification by inode number, unique files can have the same name.

There are two types of links. The first is a symbolic link, also known as a soft link. Think of this as the shortcuts you create to files when using Windows. If you delete the original file, you can no longer use the symbolic link because it is now broken or dangling. In other words, symbolic links are just pointers to the actual files and its contents are the filename it is pointing to. Note: It is possible for loops to occur if symbolic links point to each other.

The second type of link is a hard link. Each inode has a link count and it refers to the number of hard links that refer to a specific file. A hard link not only points to the original inode but also possesses its same contents by cloning them. As long as the link count is not 0, a file will exist. Otherwise, the file's data block will finally be deallocated.

Holes

The main issue with the inode hierarchy is the existence of holes. In BFFS, file allocation is contiguous and gets rid of the issue of external fragmentation. Unfortunately, in this system, we have to replace deallocated addresses with 0's and that creates holes. When we read from these, we get 0's and when we write, these absent blocks get allocated and hold data. But once those are allocated, they become new holes as well. Holes are quite small and insignificant but they do still take away some memory.

inode_hole

CS111 Operating Systems Principles, UCLA. Paul Eggert. April 27, 2010.