CS 111 Lecture 16 Scribe Notes - Spring 2012

by Henish Shah & Dony George for a lecture by Professor Paul Eggert on May 29, 2012

 

Continuing from the discussion from the previous class about the consequences of power failure, this class started with a discussion about disk failure.

One possible way to address this issue is to use logging. Whenever there is any change to the file system, this change is logged in a journal. When the system crashes, while recovering data, the system can check the journal and see which operations were being performed at the time of the crash and take appropriate measures to ensure that there is no corruption or incomplete operation. However, this issue would not work in case of a disk failure, as the journal itself is present on the disk.

RAID

RAID 0 - Concatenation

RAID 0

RAID 1 - Mirroring

RAID 0

 

RAID 4 - Parity

RAID 0

RAID 5 - Parity with striping

RAID 5

 

Typical Disk Failure Characteristics

Based on the probability of disk failure provided by Disk drive companies, the Mean Time To Failure (MTTF) of a typical disk is around 300,000 hours or 34 years. However in reality, this is about 5 years. After 5 years, it is cheaper to buy a new disk rather than use the old disk because of the power gains provided by the new disk.

RAID 5

The above figure shows the Probability Density Function (PDF) of a normal disk failing. Interestingly, there is a high probability of failing in the initial period due to manufacturing error. This is followed by a relatively long stable period. As time progresses, this curve starts getting steeper.

RAID 5

The above figure shows the Probability Density Function for a RAID 4 system installed in space failing. Basically, it is a RAID 4 system where the disks cannot be replaced even on failure. Similar to the previous figure, there is a high probability of failing in the initial period. However, this is lower than the probability for a single disk due to the redundancy introduced by the parity disk. The probability of failure as time progresses increases up to 4x times the rate of failure for a simple disk.

RAID 5

The above figure shows the Probability Density Function of failure for a RAID 4 system installed in in a typical server room. In this scenario, it is expected that a operator would replace the disk as soon as it fails. Thus, after the five year period, the failure rate doesn't go up as steeply as it does in the case where we don't replace disks.

RAID would however not make disk backups seem redundant since RAID does not protect against user errors which could corrupt the disk.

Distributed Systems and RPC

Remote Procedure Call (RPC)

RAID 0

RAID 0

You could have very long delays as we can see in the diagram

NFS Protocol


References


Valid HTML 4.01 Transitional