Buffer Management Algorithms

                  CS560 NOTES on I/O Buffers in Unix

0. REFERENCES: Notes on Unix I/O buffer Management
               M.J Bach: The Design of the UNIX Operating System, Chapter 3

1. Purpose:

   A set of NBUF buffers in K space is used as a cache memory between
   block devices, e.g. disks, and processes doing block read/write.
   The goal is to reduce the number of actual I/O operations.

                       Basic principle:

   When a process wants to read from (dev,blk), it first searches the
   buffer cache for a buffer assigned to this (dev,blk).  If such a
   buffer exists with valid data, it simply reads data from the buffer
   without incurring any I/O operation.  If such a buffer does not exist,
   it tries to find a free buffer, assigns the buffer to (dev,blk), issues 
   a DiskRead() operation, waits for I/O completion, then reads data from 
   the buffer.  Once a (dev,blk) is read in, the buffer will remain in the 
   buffer cache for next possible read requests for the same (dev,blk) by ANY
   process..

   When a process wants to write to (dev,blk), it writes to a buffer 
   assigned to (dev,blk).  Actual writing to the device may take place 
   much later.

2. Block I/O Functions (Single_CPU/Uniprocessor Unix)

6221  readi(aip)    // basic logic
      {
         aip points at inode (containing dev);
         from current offset, compute logical blk, then physical blk number.
         bp = bread(dev, blk);  returns a BUFFER pointer containing data
         iomove(bp, ....);      copy data from this BUFFER to dest location
       }

4754  struct buf *bread(dev blk)
       {
          bp = getblk(dev,blk);
          if (bp->b_flags & b_DONE) 
              return bp;
          bp->b_flags |= B_READ;
          bp->b_wcount=256;
          bp to devtab's I/O queue; (may start I/O)
          iowait(bp);  // sleep on bp until DONE
          return bp;
       }

Goodie: breada(): pre-read the next block, so no wait for DONE


6276 writei(aip):
      aip points at inode (contianing dev)
          map logical block to physical block blk;
          if (writing whole block)
             bp = getblk(dev, blk);   // find a bp 
          else
             bp = bread(dev, blk);    // read in the block into *bp
          iomove(data into *bp)
          Then either  bawrite(bp) OR // write in order
                       bdwrite(bp);   // random access files

4809 
     bwrite(bp)   // sync write; wait for I/O done
     {
       mark bp for B_WRITE;
       put bp into dev's I/O queue (may start I/O)
       iowait(bp)l  // wait until DONE
       brelse(bp);  // release the buffer
     }

4836 bdwrite(bp)   // delayed write
     {
       mark bp DELWRI and DONE;
       release the buffer;
     }

     bawtie(bp)    // async write; do not wait for I/O done
     {
       mark bp ASYNC
       call bwrite(bp) to put bp into dev I/O queue;
       but do NOT wait for DONE; 
       // bp will be released by interrupt handler
     }      



2. Buffer Management in Unix:

(1). Each block device, e.g. a disk, has a dev number, and a corresponding 
     device-table 

4551          struct devtab{ .........};
     
     Each devtab maintains a dev_list containing I/O buffers currently 
     assigned to the device, and an IO-queue containing bufs for pending I/O 
     operations on the device.

(2). A set of NBUF I/O buffers 

4520-4586     struct buf{ ..........} buf[NBUF]; 

     is allocated in K space.  Each struct buf is a buffer header 
     containing fields for buffer management.  Each buffer header
     has two sets of linking pointers;  (b_forw,b_back) for free_list, 
     (av_forw, av_back) for av_list, and a data pointer pointing at 
     the buf's actual data area

(4720)        char   buffer[NBUF][514]; 
  

(3). Initialization of bufs:

5055 binit():  is called during system booting. It links all bufs 
               into two doubly linked lists headed by a special buf

4567          struct buf  bfreelist;
     
     Initially, all bufs are on the av_list.   Whenever a buf is
     assigned to a (dev,blk), it is taken out of the av_list and 
     inserted into the dev_list of the device's devtab structure.  
     If the buf is currently in use, it is marked BUSY, and removed from
     the av_list.  A BUSY buf may be in the I/O queue of a devtab,
     using its av_list pointer.  When a buffer is no longer BUSY, it 
     is released back to the av_list but remains on the dev_list.

3. Block read/write algorithms
   As shown above, bread, bwrite and bdwrite depend on getblk()/brelse().

4. getblk()/brelse() algorithms: (M.J. Bach TEXT: Figure 3.4, Figure 3.6)

4921 bp = getblk(dev,blk){
        loop:
        (1). search devtab's dev_list for a bp = (dev,blk);
        (2). if (found such a bp){
                if bp is BUSY:{
                   mark bp WANTED; 
                   sleep on bp;
                   ************** 
                  goto loop;
                }
                /* bp not NUSY */
                take bp out of av_list; mark bp BUSY;
                return(bp); 
              } /* end found */
         ----------------------------------------------
        (3). /* not found; try to allocate a free buf from av_list */
             if (bfreelist's av_list is empty){
                 mark bfreelist WANTED;
                 sleep on bfreelist;
                ************************
                 goto loop;
             }
        (4). /* at least one buf on av_list */
             take first bp out of av_list; 

             if (this bp is for DELAYed WRITE){ 
                 write bp out ASYNC;
                 *******************
                 goto loop; 


        (5). mark bp BUSY; assigned bp to (dev,blk);
             relink bp to (new) dev_list;
             return(bp);
     }
                             
4869 brelse(bp){
       if (bp is WANTed)
           wakeup() ALL sleeping on bp;
       if bfreelist is WANTed)
           wakeup() ALL sleeping on bfreelist;
       put bp back to the (tail of ) av_list; 
     }
--------------------------------------------------------------------------

              COMMENTS on getblk()/brelse():

(1).  Data Consistency:
      In order to ensure data consistency, getblk() must never assign
      two buffers to the same (dev,blk).  =====> go to retry loop after 
      waking up from sleep() because what it wanted may already exist.
     
      During a WRITE operation, data are written to a buffer, which is
      marked DELWRI (Delayed Write) but remains in the buffer pool 
      until it is to be reassigned to a different (dev,blk). 

      Dirty buffers are written out before they are reassigned.

(2). Cache effect:
      Cache effect is achieved mainly by:
      brelse(bp) puts bp back to the (tail of) av_list but let it remain in 
      the dev_list and retain its (dev,blk) identity until it is grabed for 
      reassignment.

      Once a bp is assigned to a specific (dev,blk), all efforts are 
      made to prolong its life, e.g. by 
        Delayed Write, and relesaing to the tail, but grabbing from the
        front, of the av_list. (LeastRecentlyUsed principle).

(3). Critical Regions:
      Disk interrupt handlers may manipulate the buf lists, e.g. 
      dequeue a bp from a devtab's IO-queue, change its status and
      call brelse(bp).
      So, in getblk()/brelse(), disk interrupts are masked out in these
      critical regions.

(4). Shortcomings of the algorithm:

     1. Inefficiency: the algorithm relies on re-try loops after sleep()/
        wakeup().
    
     2. No concurrent reads (for multiprocessor kernel).

     3. Possible starvation.

     4. Use sleep()/wakeup(), good only for Uniprocessor kernel.
-----------------------------------------------------------------------


                     CS 560 TAKEHOME EXAM
                      
        NOTE: THIS IS AN EXAM! ABSOLUTELY INDEPENDENT WORK !!!!

                     PROBLEM SPECIFICATIONS:

  Use P,V on counting semaphores to design a set of NEW I/O buffer 
  management algorithms: 
             Part A: for UniPocessor    (UP) Kernel
             Part B: for MultiProcessor (MP) Kernel
  that meet the following requriements:

  (NOTE: The conditions are ordered by their relative importance, which will 
         also be the basis of GRADing)


                    PART A: DUE Nov, 9, 2011
 
   Assume Uniprocessor Kernel (One process at a time)

    (1). Data consistency.

    (2). Cache effect.
    ********  (1) and (2) are the same as in Unix ********

    (3). Efficiency:    
         No re-try loops.
         No unnecessary process "wakeups", i.e. a blocked process
         is not "awakened" unless it can actually get a buffer.

    (4). Free of starvation.

  
                             NOTE AGAIN:
 1. Merely replacing sleep()/wakeup() in Unix algorithms by P()/V() on
    semaphores is NOT an acceptble solution. You MUST redesign the algorihtms 
    by using semaphores ONLY. 

 2. (1)(2)(3)(4) are the ranking of their relative importance.
    For example, if an algorithm cannot guarantee data consistency, 
    it would be INCORRECT no matter how efficient it is. Similarly,
    since (4) ranks far below (2), your algorithm must ensure no 
    starvation BUT NOT AT THE EXPENSE OF REDUCED CACHE EFFECT.


   Express your algorithm(s) in Pseudo-C with lots of comments 
   and/or a separate document to explain your algorithm(s).

   TEXT EDIT your work suitable for printing hard copies.
 
   *************************************************************************
   Nov 9,2011 : Part A due.     

   Grading   : Each of your algorithms will be graded in 2 steps:
               First draft on the posted due date.
               ONE revision 1 week after original due date.

  =========================================================================
   PART B: Firsst draft DUE : Nov 16, 2011
           Revision     DUE : Nov 30  2011

   Assume Multiprocessor Kernel. Buffers are maintained in hash queues, 
   as in Bach Chapters 3, 12.  Add these additional requirements:

   (0). High degree of concurrency (which MP algorithms must have)

   (5). Allow concurrent readers on the same buffer.

   (6). Free of starvation and deadlock.


                    NOTE FOR MP ALGORITHMS:

   In addition to P()/V(), you may define any other "primitive" operations 
   on semaphores, e.g. CP (Conditional P) as in Bach's Chapter 12.
============================================================================
                      Time Table:

Algorithms: Completed by Nov 30, 2011
Project   : Before Finals Week.

                      PROJECT
-----------------------------------------------------------------------------
Implement AND demo your algorithms on Multisking platform that simulates either
UP or MP kernel (to be intruduced before Thanksgiving break).

Close Week: Project demonstration.
-----------------------------------------------------------------------------