CS 111: Lecture 16 - RAID and Distributed Systems

Scribe Notes for Tuesday, November 23, 2010

By Andrew Reyes
Some images used are from the RAID Wikipedia article
and the Failure Trends in a Large Disk Drive Population article by Google

On Failures

So far we have only covered the types of failures where our machine loses power. So now, we will assume that the disk drives go bad, but we still have power. This is called a Media Failure.

Recall the Lambston-Sturgis Assumptions:

The idea is that will either detect the failure when we write, or when we read and see that the value is wrong. Media faults typically are found on reads, but sometimes on write.

Some people are paranoid about this. So what do they do?

Read-After-Write: They want to detect these failures as soon as possible, so they read after every write.

But all of the techniques we've talked aobut so far are helpless in the presense of media faults! Media faults cannot be solved by logging, because the log goes bad. "You’re still hosed." You might have duplicate copies, but in general, you have data on the disk once and then you lose it.

RAID To The Rescue

Redundant Arrays of Inexpensive Disks

The original motivation was that this solution was inexpensive. Of course, if you wanted to make a lot of money selling RAID solutions, you'd want to charge more for them.
Now RAID stands for Redundant Arrays of Independent Disks

History of RAID


RAID 0: Striping Without Parity Or Mirroring

RAID0

RAID 1: Mirroring Without Parity Or Striping

RAID1

Can We Combine Concatenation and Mirroring?

Yes, we can concatenate mirrored disks!
We can also mirror concatenated disks.

Which is better? This is apparently a controversy, but in a high-level analysis, they are the same.


RAID 4: Striping with Dedicated Parity

Mirroring is great because it gives you redundancy, but it comes at a cost.
If you want 1TB of space, you have to buy 2TB worth of drives.
The idea of RAID 4 is to bring down redundancy overhead from 100% (RAID 1) to (say) 25%.

RAID4

RAID 5: Striping with Distributed Parity

How can we fix the hot spot/bottleneck problem of RAID 4?
Let's make every disk a parity disk! Parity is split between all disks.

RAID5

Failure Terminology and Analysis


MTTF: Mean Time To Failure (How often does it break?)
MTBF: Mean Time Between Failure (How far apart from first fail to second fail?)
MTTR: Mean Time To Repair (How long does it take to fix it?)
Availability: MTTF/MTBF
Downtime: 1 - Availability

MTTF is a very misleading measure. Western Digital claims its disks' MTTF = 300,000 hours, which is about 34 years. Obviously, they didn't test a hard drive for 34 years. Instead, they ran a whole bunch of disks in parallel and calculate from there, most likely with the equation: MTTF = ([a short time period] * [number of items tested]) / [number of items tested which failed within time period]

Annualized Failure Rate (AFR) is a much better measure!
AFR = If you run a disk for a year what is the probability it will make it?
AFR is difficult for disk drive manufacurers to calculate. It's much easier to calculate if you actually use your disks 24/7.
Google published a paper that calculated the AFR for real disks, published in 2007.

Below is the actual graph published by Google, with Eggert's expected curve added by me.

Bathtub Curve

Eggert's expected curve suggests many drives should fail within the first year, but Google's data shows otherwise.
Eggert suspects that disk manufacturers catch many of the flakey hard drives and repair them or throw them away before they reach consumers, thus starting everyone off near the bottom of the bathtub curve.

What is the failure rate of a RAID 5 box?

Annual Failure Rate of a RAID 5 Box

RAID5AFR

Initially, if one disk fails, we're still OK. If two drives fail, then we're hosed!
RAID 5 must have two near-simultaneous failures to fail.
The reliability of a RAID box is heavily dependent on the operator. In practice, the operator is unreliable.

A way to increase reliabilty is to have hot spares, a spare disk that is connected but not used until another disk fails.
The big assumption we make is that failures are independent. This is one of the motivations for Redundant Array of Independent Disks. If failures weren't independent, all of RAID would be pointless. RAID will not work that well during a power spike and all drives fail at once. People will distribute each drive geographically to avoid this problem. Banks do that. UCLA does that.

Distributed Systems

Distributed systems use message passing as means of communications. In order for Computer A to get work done, Computer A sends a request to Computer B. This is a request-response pair. This is all in contrast to function calls or system calls.

ReqRes

One of the most common ways of abstracting request/response is by using Remote Procedure Calls, or RPC. A message could look like: close(19), which looks like a function call, but underneath it's different:

Differences between RPC and traditional system/function calls:

Disadvantages of RPC:

Ways to deal with architectural differences:
Most people choose option three from below.

When we do option one or option three, we need to perform Marshalling.

Marshalling (aka Serilization, Pickling)

Converting data to standard network order, then marshalling across in order, and reconverting it to the desired format.

Marshall

In the big endian/little endian example, a little endian machine would have to convert the data it wants to send over into big endian, marshall it along the network, then the machine at the other end converts it from big endian to whatever it needs.

It would be such a pain to marshall every time you wanted to do a function call. What actually happens in an RPC-based system is that the Application calls a simple function, like close or write. Underneath in the Network Code there are a bunch of sends and receives. And inbetween, there are some Stubs. These are automatically generated from a protocol description.
So in effect, you write a little program (written in a C-like language) that describes what the network protocol looks like and describes the stubs that you want. Then the compiler will automatically generate the code that does this network interface.

An example of RPC in action is the X Windowing System

X Windowing System

In this example, our emacs code wants to make a pixel on our monitor blue. It sends RPC request to do so, it gets a response back, and the pixel turns blue on our machine.

Can you think of an RPC that you use every day?
It turns out HTTP, the Hypertext Transfer Protocol, is an example of an RPC. You send a request, you get a response back.

How RPC Fails

A big advantage is that we get hard modularity, but we also get some disadvantages as well.

RPC Failure Modes:

Solutions:

The above are the two major simple ways of handling errors in RPC.

AT-LEAST-ONCE-RPC:

AT-MOST-ONCE-RPC:

While all this is good, what we want is EXACLTY-ONCE-RPC. Unfortunately, it is very hard to implement. So hard that we've practically given it up. It is simply not done.

Performance Issues That Could Really Hurt RPC

Suppose with emacs, we're drawing a bunch of blue pixels on the screen. Machine A sends a request to draw a blue pixel. Machine B responds OK and the pixel is drawn. Now A sends another request to draw another blue pixel. Machine B responds OK and draws another pixel.

The performance is going to stink. You're spending most of your time waiting for the bytes to fly over the wire, you have some decoding overhead, and the actual amount of time you're doing any real work is practically zero!

One way to avoid this is to coalesce pattern calls. Instead of the the drawPixel command, use DrawRectangle((10,20,5),Blue).
However, this complicates the API unnecessarily.

Asynchronous

Instead, what we do is send requests in rapid succession as fast as you can and not wait for responses. The responses will also come back in rapid succession. If the amount of work necessary to satisfy the request is relatively small, this approach can be way faster. (The assumption is that there is a wire bewteen A and B, so you can't really send requests in parallel.)

What we have, in effect, is asynchronous calls.
What if an outstanding request fails?
We could either have all outstanding requests be independent,
Or be prepared for such asynchronous failures.

Next week, we will combine RAID and RPC in our next topic: Network File Systems.

Valid HTML 4.01 Strict