CS 111: Spring 2010

Lecture 13: File System Robustness

By Khai Nguyen

Terminology

 

Error: mistake make by designer (in designer’s head)

Fault: bug in code, not triggered yet

Failure: bug is triggered, user notices.

File System Goals

 

Durability: survive (limited) failure in hardware (need a failure model)

                Ex: power failure while writing a block

 

Atomicity: change are either made or not made

 

Performances: throughput + latency

Case Study

 

System call: rename( “a”, “b”)

 

Simple Pseudo-code

1.       Find data block

2.       Read it to RAM

3.       Change ‘a’ to ‘b’

4.       Write block to disk

 

First glance:

·         Power failure happen at any step is just fine, it would not affect File System integrity

 

Missing:

·         What if ‘b’ already exists?

·         Updated Pseudo-code

1.       Find data block

2.       Read it to RAM

§  If ‘b’ already exist

·         Read B data block

·         Decrement link count

·         Update bitmap

·         Write out to disk

3.       Change ‘a’ to ‘b’

4.       Write block to disk

 

First glance:

·         If power failure happens, our disk data will inconsistent

o   Sector write is atomic, but block write is not.

·         Attempt: why don’t rename() fail if file exist?

o   Ok, what if you try to rename a directory? Create a directory is not atomic.

·         No matter how you fix the code, it will not work correctly

o   Extreme example:

§  rename(“d/f”, “e/g”)

§  Resolving the path, etc… a lot of windows for failure

What we want

 

File system correctness invariants

1.       Every block in file system is used for exactly 1 purpose: boot block, super block, bitmap, inode, data.

a.       No 2 table entry point to same block, except for hardlink

2.       All reasonable blocks are properly initialized

3.       All reachable data are marked used in bitmap

4.       All unreachable data block are marked.

5.       Every inode link count is no less than the directory entry for that inode.

6.       Every inode link count is no GREATER than the directory entry for that inode.

 

Consequences of violation

1.       Disaster!

2.       Disaster!

3.       Disaster!

4.       Memory leak!

5.       Disaster!

6.       Memory leak!

What can we do

 

Recover from power failure:

·         fsck command

o   check for file system, looking for leaks, fixes bitmap accordingly

o   file system should be unmounted while doing fsck

o   also check for disaster and repair them as best (by some heuristic like looking indirect block, etc…)

o   disaster file store in /usr/lost+found/

 

In order to help recovery: write block has to be atomic

 

Q: How can we implement atomic write atop hardware that doesn’t?

A: Think of write as a slow phase between scene a and scene b. The slow phase is undefined and contain garbage data

 

Simple approaches:

 

Write 3 copies:

 

Time t ----->

A

X

?

Y

Y

Y

Y

Y

Y

B

X

X

X

?

Y

Y

Y

Y

C

X

X

X

X

X

?

Y

Y

 

read:

if (A == B)

                return A;

else

                return C;

 

Write 2 copies:

 

Time t ---à

A

X

?

Y

Y

Y

B

X

X

X

?

Y

 

read:

if (A == B)

                return A;

else

                return B;

 

 

Lampson-Sturgi’s assumption

1.       writes may fail, or may corrupt other piece of storage. A later read can detect the bad write

2.       storage may spontaneously decay

3.       Errors are rare such that you can correct them before they cascade.

 

Performance as the enemy of robustness

 

Q: is there any way to speed up write?

 

1.       File system need 3 dozens write done

a.       Sort them by disk such that disk arm can write in one rotation

b.      Write in sorted order

c.       Problem: HAS read/write dependency

d.      Solution: Add extra variable to put dependency while sorting

 

Emacs:

 

Saving file

 

Write()

 

<-------- suppose to crash here = violate Golden Rule of Atomicity: never write over your last copy of file

 

Write()

Write()

 

How to fix:

 

Idea #1: Commit record

                Mark high level atomic by look at commit record

 

High level atomic operation

Begin

                Write write write

Commit

Post commit phase

End

 

Idea #2: Journaling

1.       Divide disk into 2 partion

a.       Data cells

b.      Journal log

                                                               i.      Size depend on the disk

                                                             ii.      May put on disk to increase performance

                                                            iii.      Fsck know your journal system

                                                           iv.      Write to log is sequential

 

A/ Logging protocol: write ahead logging

A.      Begin

a.       Log update into journal

b.      Commit (write commit record)

c.       Write new data to cell

B.      End

 

B/ Logging protocol: write behind logging

A.      Begin

a.       Log all old value

b.      Install new value into the cell

c.       Commit

B.      End

 

Advantage of A

1.       Need bit read the old value

 

Advantage of B

1.       Better for bugging system

2.       Faster reboot

 

Additional Reading: wiki ext4