32-bit MTX Using Dynamic Paging

    The third version of 32-bit protected mode MTX, MTX32.3, uses dynamic 
paging. As in MTX32.2, the 4GB virtual address space is divided into two equal 
haves, except that in this case the kernel's virtual address space is mapped 
high, from 2GB to the high end of physical memory, and the user mode virtual 
address space is mapped low, from 0 to Umode image size. This organization 
conforms with most other Unix-like systems, e.g. Linux and xv6. In Linux, user
space is from 0 to 3GB and kernel space is 3GB and above. In xv6, user space is
from 0 to 2GB and kernel space is 2GB and above, which is the same as in MTX. As
before, we assume that the Umode image size is 4MB. The virtual address space 
mapping is accomplished in several steps, which ressembles that of multi-stage 
booting. The following describes these steps.

1. SETUP:
    During booting, the booter loads SETUP to 0x90200 and the MTX kernel to 
0x10000. Then it jumps to 0x90200 to run SETUP. The initial GDT in SETUP is

   setup_gdt:  .quad  0x0000000000000000 # null descriptor
	       .quad  0x00cF9a000000FFFF # kcs 00cF 9=PpLS=1001
               .quad  0x00cF92000000FFFF # kds

which only defines two 4GB kernel code and data segments. As before, SETUP moves
the GDT to 0x9F000 and loads the GDTR register pointing at the GDT. Then, it 
enters protected mode, moves the MTX kernel to 1MB and ljmp to the entry address
of the MTX kernel at 1MB. Unlike the two previous versions of MTX, the initial
GDT here is only temporary. Its only purpose is to provide the initial 4GB flat
segments for the MTX kernel to get started. 

2. The entry.s File.

    pm_entry is the entry point of the MTX kernel. Upon entry, the CPU is 
already in 32-bit protected mode. In order to support paging, entry.s defines 
the following data structures: a page directory, pgdir, at the offset 0x1000, a
page table, pg0, at 0x2000, a new GDT at 0x3000, and a page directory at 0x4000,
as shown below. 
#----------------------------------------------------------------------------
.org 0x1000
pgdir:	                     # page directory
        .long 0x00102007     # point at pg0 at 0x102000
        .fill 511,4,0        # fill with 511 0 entries
        .long 0x00102007     # point at pg0 at 0x102000
        .fill 511,4,0        # fill with 511 0 entries
.org 0x2000               
pg0:	                     # page table: to be set in entry.s
.org 0x3000
kgdt: 	.quad	0x0000000000000000 # null descriptor
	.quad	0x00cF9a000000FFFF # kcs 00cF 9=PpLS=1001
	.quad	0x00cF92000000FFFF # kds
	.quad	0x0000000000000000 # tss
	.quad	0x00cFFa000000FFFF # ucs 00cF F=PpLS=1111
	.quad	0x00cFF2000000FFFF # uds
gdt_desc:
	.word	.-kgdt-1
	.long	0x80103000  # hard coded but can be altered if needed

# IDT table will be installed here at 0x3040

.org 0x4000
kpgdir:
        .fill 1024,4,0
.org 0x5000

#               actual MTX kernel code begins here
#----------------------------------------------------------------------------
    In the page directory, pgdir, entry 0 points to the page table pg0, which is
filled with page frames from 0 to 4MB. This creates an identity mapping of the 
virtual address range VA=[0 to 4MB] to PA=[0 to 4MB]. Entry 512 of pgdir also 
points to pg0, which maps the virutal address range VA=[2GB to 2GB+4MB] to 
[0 to 4MB] also. The new GDT, kgdt, defines 6 segments, where tss, ucs and uds 
are for the TSS, user mode code and user mode data segments of the current 
running process. The actions of entry.s are as follows.
	
(1). Set up initial page table pg0, which maps VA=[0-4MB] to PA=[0-4MB]
	    movl    $pg0-KVA, %ebx     # KVA = 0x80000000
            movl    $0x007,   %eax
            movl    $1024,    %ecx
     loop0:
	    movl    %eax,   0(%ebx)
	    addl    $4,       %ebx
	    addl    $4096,    %eax
            loop    loop0
	
(2). load CR3 with physical address of pgdir and turn on paging.
            movl    $pgdir-KVA,%eax    # physical address of pgdir
	    movl    %eax, %cr3         # load CR3 with PA(pgdir)
	    movl    %cr0, %eax         # enable paging
	    orl     $0x80000000,%eax
	    movl    %eax, %cr0

(3). Do a jmp to flush the instruction pipeline.
	        jmp 1f
         1:
(4). Do another jmp to force the CPU to use virtual address
  	        movl    $2f,%eax
	        jmp     *%eax
         2: 

     The second jmp is tricky but essential. Before this jmp, the CPU was 
     executing with real addresses in the range 0-4MB. This jmp forces the CPU 
     to switch to virtual addresses in the range 0x80000000 to 0x80000000+4MB. 

(5). load GDTR with the new gdt descriptor.
	       lgdt gdt_desc      # load GDT at 0x103000
     The new GDT defines 2 kernel mode segments, a TSS and 2 user mode segments.
     All code and data segments are flat 4GB segments, as required by paging.

(6). Set stack pointer to the PROC[0]'s kstack  Then, call init() in init.c to 
     initialize the MTX kernel.

3. init.c file

   init() first initializes the video driver to make printf() work. At this 
moment, the kernel's virtual address space is limited to 4MB. The next step is 
to expand kernel's virtual address range to the entire available physical 
memory. Assume 512MB physical memory. Then, the kernel mode virtual address 
space of every process is from 2GB to 2GB+512MB. When the MTX kernel starts, 
only the virtual address range from 2GB to 2GB+4MB is mapped to physical memory
0-4MB. Assume that the MTX kernel itself occupies 2MB, i.e. from 1MB to 3MB. 
Then the area between 3MB and 4MB is free. We shall build new page tables in 
this area, which map the kernel's virtual address range [2GB to 2GB+512MB] to 
the entire physical memory [0 to 512MB]. This is done by the kpgtable() function
shown below. 

#define KPG_DIR 0x80004000             // at VA=4KB
#define KPG_TAB 0x80300000             // at VA=3MB
#define NPGTABLES      128             // 512MB PA needs 512/4=128 pgtabels
#define PGSIZE        4096
extern u32 *kpgdir;                    // defined in entry.s

void kpgtable(void)
{
  int i, j;
  u32 *pd = (u32 *)KPG_DIR;             // kpgdir is at 4KB in entry.s
  u32 *pt = (u32 *)KPG_TAB;             // pgtables begin at 3MB
  u32 pte = (u32  )0x3;                 // begin PA=0, |Kpage|W|P
  for (i=512; i<512+NPGTABLE; i++){     // from kpgdir[512], 128 entries
    pd[i] = PA(pt)+3;                   // pointing at pgtables.
    for (j=0; j<1024; j++){             // pgtabe, 4KB each
      pt[j] = pte;                      
      pte += 4096;
    }
    pt += 1024;
  }
}

    As in MTX32.2, the kernel page directory, kpgdir, plays two roles. First, it
will be the pgdir of the initial process P0, which runs (in Kmode) whenever no 
other process is runable. Second, it will be the prototype of the page direcotry
of all other processes. Each process has its own page directory and associated 
page tables. Since the kernel mode spaces of all processes are the same, the 
last 512 entries of all page directories are identical. So, when creating a new 
process we simply copy the last 512 entries of kpgdir into the process pgdir. 
The low (0 to 511) entries of a process pgdir defines the user mode page tables
of that process. These entries will be set up when the process is created.
 
3-2. Switch CR3 to the new kernel page directory, kpgdir. This allows the kernel
to access all the physical memory from 0 to 512MB.
    
3-2. kernel_init():

     Then, init() calls kernel_init() (in t.c file) to initialize the MTX kernel
data structures, such as free PROC lists, reqadyQueue and sleepList, etc. Then,
it creates the initial process P0, which uses kpgdir as its page directory. In 
P0's TSS, the level-0 interrupt stack is set to P0's kstack. The system is now 
running the initial process P0.

3-3. Remap IRQ interrupt vectors. Set up IDT and install exception/interrupt
handlers. These steps are required for all protected mode operations, which are
described 15.2. The GDT is located at the physical address 0x103000 (VA 
0x80103000). The IDT contains 256 8-byte entires, which requires 2KB space. It 
is placed in the same page as the GDT, at 0x103040 (VA 0x80103040). init() 
proceeds to initialize the IDT table and install the execption and I/O interrupt
vectors in the IDT. Exception handler entry points are defined in traps.s. 
Exception handlers are in trapc.c. I/O interrupt entry points are defined in 
ts.s by calls to the INTH macro. I/O interrupt handler functions are in the 
various device drivers in the driver/ directory. Among the interrupts, vector 
0x80 is for system calls.

3-4. Initialize I/O buffers, device drivers and timer.
     After setting up the IDT and interrupt vectors, init() continues to 
initialize I/O buffers and I/O device drivers. When initializing the hard disk 
driver, the driver reads the HD partition table to find out the MTX partition. 
Then it initializes the file system and mount the MTX partition as the root file
system.

3-5. Free Page List.
    MTX uses a free page list, pfreeList, for allocation/deallocation of page
frames. pfreeList is constructed by
  
   u32 *free_page_list(char *start_va, char *end_va) 
   {
      int i = 0; u32 *p;
      printf("build pfreeList : start=%x end=%x\n", sva, endva);
      p = (u32 *)(end_va - 4096);
      while(p > (start_va)){
           *p = (u32)(p - 1024);
           p -= 1024;
      }
      ptail = p;  *p = 0;
      return end_va-4096;
   }
   u32 *pfreelist = free_page_list(4MB,512MB);

pfreeList threads all the free page frames from 512MB to 4MB in a link list. 
Each element of pfreeList contains the address of the next page frame. As usual,
the list ends with a 0 pointer. In order for the kernel to access the entries of
pfreeList, the link pointers must use virtual addresses of the page frames. When
allocating a page frame, the virtual address of the page must be convert to 
physical address. Conversion between virtual address and physical address are 
done by the PA/VA macros.

   #define  PA(x)  (u32)(x) - 0x80000000
   #define  VA(x)       (x) + 0x80000000

   With the free page link list, palloc() allocates a free page from pfreeList,
and pdealloc(VA(page frame)) inserts a deallocated page frame to pfreeList. The
simplest way to implement pdealloc() is to insert the deallocated page frame to
the beginning of pfreeList. However, this tend to always use the pages in the 
front part of pfreeList. In order to "verify" that all the pages are used
correctly, we add a pfreeTail, which points at the last element of pfreeList, 
and insert deallocated pages to the tail of pfreeList.  
 
3-6. call main() in t.c file

4. main():

   In main(), P0 calls kfork("/bin/init") to create the INIT process P1, which
has /bin/init as its Umode image. Then, P0 switches process to run P1. P1 forks 
a login process and waits for ZOMBIE children. Then the login process runs, and
MTX is ready for use.

5. fork.c file.

   The fork.c file contains fork1(), kfork() and fork(). fork1() is the 
beginning part of both kfork() and fork(). In order to support dynamic paging, 
fork1() is modified as follows.

PROC *fork1()
{ 
   int i; u32 *pgtable, pte;
  (1). creat a new proc, p, initialize its TSS, as in MTX32.2

  (2). p->res->pgdir = palloc();   // allocate a pgdir from pfreeList
  (3). for (i=0; i<1024; i++)      // copy kpgdir into p->pdir; 
         p->res->pgdir[i] = kpgdir[i];
  (4). pgtable = palloc();         // allocate a pgtable 
       p->res->pgdir[0] = PA((u32)pgtable) + 7; // record its PA+7 in pgdir[0]
  (5) // create pgtable entries that maps PROC's [0-4MB] to these page frames 
      for (i=0; i<1024; i++){ 
          pte = palloc();          // get a page frame; its PA is the frame
          pgtable[i] = PA(pte+7);
      }
      return p;
}

    After creating a new proc p, fork1() allocates a pgdir and builds a pgtable
for the new proc. The 1024 pgtable entries map the proc's 4MB virtual address 
space to the physical addresses of the page frames. All the page frames are
dynamically allocated from pfreelist.

6. loader:

   kfork() is only used by P0 to create the init process P1. After creating the
pgdir and pgtable of P1, it loads the Umode image, /bin/init, into the page
frames of P1. In order to load image files to page frames, which may not be
contiguous, the MTX loader is also modified accordingly. Instead of loading the 
(1KB) blocks of a file to a linear address, it now loads every 4 blocks of a 
file to a page frame in the process page table. Since the modification is almost
trivial, it is not shown. If we extend the (EXT2) file system of MTX to 4K 
block size, then the loader can be modified to load file blocks to page frames.
This will be implemented in future extensions of MTX. 
 
7. fork():

   After P1, every new process is created by fork(). In order to support dynamic
paging, fork() only needs a slight modification. After creating a child process,
fork() copies the Umode image of the parent to the child. In this case, copy_
image() must copy the contents of the parent page frames to that of the child. 
In general, copy_image() must traverse the parent pgdir to find the page tables,
from which to locate the page frames of the page table. Then copy each page 
frame to a corresponding page frame of the child. Since we assume that every 
Umode image is 4MB, copy_image() is much simpler since the pgdir has only one 
entry (0) and there is only one page table with 1024 page frames. The following
code shows the simplified version of copy_image().

// copy parent's page frames to child' page frames
void copyimage(PROC *parent, PROC *p)
{
  int i; u32 *ppgtable,*cpgtable,*ppa,*cpa;
  ppgtable = VA(running->res->pgdir[0]&0xFFFFF000);
  cpgtable = VA(p->res->pgdir[0]&0xFFFFF000);
  for (i=0; i<1024; i++){
      ppa = VA(ppgtable[i]&0xFFFFF000);
      cpa = VA(cpgtable[i]&0xFFFFF000);
      memcpy(cpa, ppa, PGSIZE);
  }
}

8. vfork(): The Intel paging hardware supports Copy-On-Write (COW) pages. This
feature can be used to implement vfork() without actually copying the Umode 
image. Using COW, vfork() creates a child which share the same Umode page frames
of the parent until either process tries to write to a shared page, at which
time the shared COW page is split. MTX does not yet implement this kind of 
vfork(). It is left as an exercise.

9. exec():

   In general, when a process exec(), it deallocates the old image pages and 
then allocate pages for the new image. Since we assume that every Umode image 
size is 4MB, exec() can be simplified. Instead of releasing the old image pages,
it uses the same page frames of the old image. So, it only needs to call the 
(new) loader to load the new image into the old page frames. Thus, no changes 
are needed in exec(). How to handle the general case in which the image size may
differ is also left as an exercise.

10. Address Parameters in Syscalls

    The process pgdir defines both Umode and Kmode pgtables of a process. Umode
addresses are translated by low (0 to 511) entries in the process pgdir and
the corresponding page tables. When a process executes in kernel (at privilege 
level 0), it can access its Umode pages, which have privilege level 3. This 
implies that all Umode virtual addresses are also valid in Kmode. Therefore, the
address parameters in syscalls can be used directly in Kmode without any
conversion. Conversions between VA and PA are needed only in Kmode when 
manipulating pgdir and pgtables. Such tables contains physical addresses but 
must be accessed by their virtual addresses in kernel.

11. Demand Paging:

    In the current implementation, MTX does not support demand paging, mainly 
because we want to keep the system simple. Another reason is that demand paging
is unnecessary in the current MTX because all the Umode programs need only a few
pages to execute. However, we can demonstrate the principle of demand paging by
simulated page faults and then handle the page faults.

    In fork1(), when allocating pages for a new process, we can intentionally
mark some of the entries in the page table as not present, e.g. by setting the 
P bit of pgtable[1] or pgtable[2] to 0. When the process executes in Umode, if
the missing page is within the Umode virutal address range, it would generate a
page fault, even though the page frame is already loaded with the image code or
data. When the process traps to Kmode, the exception handler recognizes the page
fault, fix up the "faulty page" and let the process return to Umode to continue.
The page fault exception handler is a part of ehandler() in tarpc.c. The 
algorithm of the page fault handler is shown below.

    void ehandler(u32 signal, ... u32 err_nr, ..) // err_nr is the error number
    {
       u32 cr2, cr3, pgentry, *pgtable;
       if (exception occurred in Umode, i.e. running->inkmode==1){
            if (err_nr==14){               // page fault   
	          cr2 = CPU.CR2 register;  // get offending VA from CR2
                  pgentry = cr2/4096;      // convert VA to pgtable entry number
                  pgtable = VA(running->res->pgdir[0]&0xFFFFF000);
                  pgtable[pgentry] |= 0x1; // mark page table entry PRESENT
                  load CPU.CR3 with PA(running->res->pgidr); // flush CPU.TLB
            }
            return;  // return to Umode to continue 
       }
       // page fault occurred in Kmode: PANIC and stop
    }

  In the above pseudo code, the page fault handler gets the vritual address that
caused the page fault (in CR2), converts it to a pgtable entry index, marks the
"missing" page entry as Present and reloads the CR3 register with the process 
pdgir to flush the CPU's TLB cache. When the process returns to Umode, it will 
re-execute the instruction that casued the page fault earlier. This time it will
not cause any page fault because the page table entry is now Present. In the 
general case, the "missing" page may not exist. If so, the page fault handler 
must allocate a page frame, load the needed page contents into the page frame 
and mark the page entry present before letting the process continue. 

11. Page Replacement:

    Since MTX does not support demand paging, the problem of page replacement
also does not arise. Nevertheless, it is still wrothy discussing the principles
of page replacement in demand paging. In the page fault handler, if a new page 
frame must be allocated but there is no free pages available, some existing 
pages must be evicted to make room for the needed page frame. This is known as 
page replacement problem in virtual memory. Page replacement rules are discussed
extensively in many OS books. The readers should consult such books for more 
information. Here we only focus on some of the page replacement issues

1. Local Vs. Global

   When trying to evict some existing pages, the first question is where to look
for such pages? The search is local if we only try to evict pages of the current
process. It is global if we may evict pages of other processes in the system. In
the latter case, non-runable processes, i.e. blocked or sleeping, are usually 
the candidates.

2. Replacement Rule:

   Among the many page replacement rules, the most popular one is based on the
Least-Recently-Used (LRU) principle. Since the page table entry only has an A 
(Accessed) and D (Dirty) bits, it cannot support LRU directly. However, the 
kenrenl may clear these bits periodically (by timer interrupts). After clearing
the bits, A=0 means the page has not been accessed since the last time it was 
cleared, and D=0 means it is not modifed since the last time. The A bit can be 
used as an approximation of LRU. The D bit can be used to decide whether or not
to write the page back to disk when it is to be replaced. Based on the (A,D) 
bits, a simple page replacement algorithm is as follows.

      (A, D)                       choice
      ------: --------------------------------------------------------------
      (0, 0): best candidate since page is not accessed nor modified;
      (0, 1): this cannot happen if (A,D) are cleared at the same time; 
      (1, 0): second best, the page will probably be accessed again soon.
      (1, 1): worst candidate, replacing this page must write it back first.
      ----------------------------------------------------------------------

3. Page Replacement by LRU
  
   In order to support LRU, additional information must be kept for the page
frames. Each page frame can be maintained in a data structure containing a 
usage count. The OS kernel clears the A bit of every page entry periodically. 
Whenever the A bit changes to 1, the kernel increments the page's usage count
by 1. The count value represents the number of page references in the last time
period. This is a much better approximation of LRU, although at the expense
of additional overhead in the OS kernel.

7. Unified Page and I/O Buffer Management.

   In many OS which support demand paging, I/O buffers for files blocks are 
treated as pages. When a process needs a file block, it maps the file block to 
a page in its virtual address. Such pages are initially marked invalid or 
absent. When a process attempts to access such a file block, it will generate a
page fault. The paging subsystem can allocate a page frame for the missing page,
page-in the data, if necessary, and let the process continue. Once a file block
is in the paged buffer cache, it can be reused to avoid physical I/O. This 
scheme integrates file block I/O with demand paging. It is used in Linux, as 
well as many other Unix-like systems. The primary advantage of using a paged 
buffer cache is that it maximizes the use of the physical memory. Almost all 
spare physical memory is used for file system caching. The primary disadvantage
of using paged buffer cache is that, under intensive file I/O demands, it may 
lead to thrashing, which seriously degrades the system performance. For this 
reason, most OS limit the size of paged buffer cache in order to prevent
thrashing. In addition, they also use an internal buffer cache outside the 
paging subsystem for maintaining important system information, such as the 
superblock, bitmaps and directories, etc. 

Conclusion: In this chapter we presented 3 versions of 32-bit MTX for protected
mode operations. MTX32.1 uses protected segmentation. MTX32.2 uses static paging
and MTX32.3 uses dynamic paging. In all cases, we only need to add virtual 
address mapping and set up the IDT for exception/interrupt processing in 
protected mode to the MTX kennel. Other than that, it required either very 
little or no modifications at all to the MTX kernel. These facts provide strong
support for the argument that when studying the principle of OS design, the 
actual memory management hardware is unimportant. What we have shown is that a 
well designed OS kernel for real mode operation also works in protected mode. 

Extensions of 32-bit MTX
   The main advantage of 32-bit protected mode is that it opens the door for 
many possible extensions to the MTX system. The following is a list of such 
areas.

1. EXT4 file system: Extend the file system to EXT3/EXT4 with 4KB block size.
2. ELF executables:  Modify the loader to support a.out or ELF executable files.
                     Modify the execution environment by using shared and 
                     dynamic libraries.
3. Unix compatible sh: port or develop a Unix compatible sh.
4. Development Environment: port editors and GCC to MTX for program development.
5. Device drivers for PCI bus: Support SATA drives and USB devices.
6. Networking: Develop network drivers and port TCP/IP for networking.
7. Other Unix software: port Unix utilities and X-Windows, etc.

    These extensions would greatly enhance the capability of MTX, making it a
more useful system.