CS 111

Lecture 16 (5/29/12)

Abimael Arevalo

Media Faults

Disk-fails: solvable via logging

FIGURE 1
If an unchanged sector fails, the data cannot be reconstructed using the journal.

RAID (Redundant Array of Independent Disks)

  • Simulate a large drive with a lot of little drives.
  • Save money and gain reliability.
RAID levels a la Berkeley
  • RAID 0: no redundancy, just a simulated big disk.
    • Concatenation: performance of the virtual disk is roughly that of a physical disk.
      FIGURE 2
    • Striping: Divide A, B, C, and D into pieces and places the pieces into the drives. Each drive can be run in parallel to extract the data. Virtual disk performance is roughly four times faster than a physical disk.
      FIGURE 3
    • Growing is easier in concatenation than in striping.
  • RAID 1: Mirroring
    FIGURE 4
    • Write to both drives.
    • Read from either (can pick the closest disk head).
      ASSUMPTION: reads can detect faults.
  • RAID 2,3,4,5,6,7,...
  • RAID 4
    FIGURE 5
    • Reads are like RAID 0 concatenation and has worse read performance than RAID 0 striping.
    • Writes are like RAID 1 (need to read drive E before writing).
    • If C fails: C = A^B^E^D ('^' = XOR)
Disk Drive Reliability
  • Mean time to failure is (typically) 300,000 hours (34 years), but in reality, drives get replaced every 5 years.
    FIGURE 6
  • Probability distribution function for single disk failure.
    FIGURE 7
  • Probability distribution function for RAID 4 (never replace drives).
    FIGURE 8
  • Probability distribution function for RAID 4 (assuming failed disks are replaced).
    • Disk fails.
    • (60 minutes later) operator replaces it.
    • (8 hours later) rebuilding phase. Depends on drive size.
  • RAID schemes can be nested.
  • Q: Does RAID make backups obsolete?
    A: No, we still need backups for user errors.

Distributed Systems

RPC (Remote Procedure Calls) vs. System Calls and Function Calls

  • Caller sees: x = fft (buf, n);
    Implementation:
         send (buf, n);   // to server
         // Wait for response
         return;
  • Caller and callee do not share address space. There is no call by reference (at least, not efficiently).
  • Caller and callee may be different architectures (ARM vs. SPARC or little vs. big endian).
    Requires conversion:
    FIGURE 9

RPC has different failure modes

  • PRO: Callee cannot trash caller's memory and vice versa (hard modularity).
  • CON: Messages get lost.
  • CON: Messages get corrupted.
  • CON: Messages get duplicated.
  • CON: The network can go down or be slow.
  • CON: The server can go down or be slow.
What should a stub/wrapper do:
  • If corruption - resend
  • If no response - possibilities are:
    • Keep trying - at least once RPC (suitable for idempotent operations).
    • Give up, return error - at most once RPC (suitable for transactional operations).
    • Exactly once RPC (Holy Grail of RPC).
RPC examples:
  • HTTP client -> "GET /foo/bar.html HTTP\r\n"
    Server reponse -> "HTTP /1.1 200 OK\r\n"
  • SOAP (Simple Access Object Protocol)
  • X - Remote screen display
    FIGURE 10
    • Works even if they are in the same machine.
    • Use of higher level primitives (e.g. fillRectangle).

Perfomance Issues with RPC

FIGURE 11
Solutions:
  • Have higher level primitives.
  • Asynchronous RPC - better performance but can complicate caller
  • Cache in caller (for simple stuff)

NFS (Network File System): File system built atop HTTP

The NFS protocol is like the UNIX file system but on wheels.
  • LOOKUP (dirfh, name) // Request fh and attributes (size, owner, etc.)
    fh = file handle, a unique id for a file within a file system
  • CREATE (dirfh, name, attr) // Returns file handle and attributes
  • REMOVE (dirfh, name) // Returns status
  • READ (fh, size, offset) // Returns data
  • WRITE (fh, size, offset, data) // Returns status
We want our NFS to be reliable even if the file server reboots.
FIGURE 12
"Stateless Server
FIGURE 13
  • Whenever the client does a write, it has to wait for a response before continuing.
  • NFS server cant' respond to a write request until data hits disk.
  • NFS will be slow for writes because it forces writes to be synchronous.
  • To fix this problem we "cheat":
    • Use flash on the server to store pending write requests.
    • Writes don't really wait for the server to respond, if a write fails, a later "close" will fail.
      Can use "fsync" (written all data) and "fdatasync" (written all data to disk) to make sure data is written. But these operations slow performance.

  • In general, most clients won't see a consistent state.
  • NFS by design doesn't have read/write consistency (for performance reasons).
  • It does have open/close consistency through fsync and fdatasync.