32-bit MTX Using Dynamic Paging
The third version of 32-bit protected mode MTX, MTX32.3, uses dynamic
paging. As in MTX32.2, the 4GB virtual address space is divided into two equal
haves, except that in this case the kernel's virtual address space is mapped
high, from 2GB to the high end of physical memory, and the user mode virtual
address space is mapped low, from 0 to Umode image size. This organization
conforms with most other Unix-like systems, e.g. Linux and xv6. In Linux, user
space is from 0 to 3GB and kernel space is 3GB and above. In xv6, user space is
from 0 to 2GB and kernel space is 2GB and above, which is the same as in MTX. As
before, we assume that the Umode image size is 4MB. The virtual address space
mapping is accomplished in several steps, which ressembles that of multi-stage
booting. The following describes these steps.
1. SETUP:
During booting, the booter loads SETUP to 0x90200 and the MTX kernel to
0x10000. Then it jumps to 0x90200 to run SETUP. The initial GDT in SETUP is
setup_gdt: .quad 0x0000000000000000 # null descriptor
.quad 0x00cF9a000000FFFF # kcs 00cF 9=PpLS=1001
.quad 0x00cF92000000FFFF # kds
which only defines two 4GB kernel code and data segments. As before, SETUP moves
the GDT to 0x9F000 and loads the GDTR register pointing at the GDT. Then, it
enters protected mode, moves the MTX kernel to 1MB and ljmp to the entry address
of the MTX kernel at 1MB. Unlike the two previous versions of MTX, the initial
GDT here is only temporary. Its only purpose is to provide the initial 4GB flat
segments for the MTX kernel to get started.
2. The entry.s File.
pm_entry is the entry point of the MTX kernel. Upon entry, the CPU is
already in 32-bit protected mode. In order to support paging, entry.s defines
the following data structures: a page directory, pgdir, at the offset 0x1000, a
page table, pg0, at 0x2000, a new GDT at 0x3000, and a page directory at 0x4000,
as shown below.
#----------------------------------------------------------------------------
.org 0x1000
pgdir: # page directory
.long 0x00102007 # point at pg0 at 0x102000
.fill 511,4,0 # fill with 511 0 entries
.long 0x00102007 # point at pg0 at 0x102000
.fill 511,4,0 # fill with 511 0 entries
.org 0x2000
pg0: # page table: to be set in entry.s
.org 0x3000
kgdt: .quad 0x0000000000000000 # null descriptor
.quad 0x00cF9a000000FFFF # kcs 00cF 9=PpLS=1001
.quad 0x00cF92000000FFFF # kds
.quad 0x0000000000000000 # tss
.quad 0x00cFFa000000FFFF # ucs 00cF F=PpLS=1111
.quad 0x00cFF2000000FFFF # uds
gdt_desc:
.word .-kgdt-1
.long 0x80103000 # hard coded but can be altered if needed
# IDT table will be installed here at 0x3040
.org 0x4000
kpgdir:
.fill 1024,4,0
.org 0x5000
# actual MTX kernel code begins here
#----------------------------------------------------------------------------
In the page directory, pgdir, entry 0 points to the page table pg0, which is
filled with page frames from 0 to 4MB. This creates an identity mapping of the
virtual address range VA=[0 to 4MB] to PA=[0 to 4MB]. Entry 512 of pgdir also
points to pg0, which maps the virutal address range VA=[2GB to 2GB+4MB] to
[0 to 4MB] also. The new GDT, kgdt, defines 6 segments, where tss, ucs and uds
are for the TSS, user mode code and user mode data segments of the current
running process. The actions of entry.s are as follows.
(1). Set up initial page table pg0, which maps VA=[0-4MB] to PA=[0-4MB]
movl $pg0-KVA, %ebx # KVA = 0x80000000
movl $0x007, %eax
movl $1024, %ecx
loop0:
movl %eax, 0(%ebx)
addl $4, %ebx
addl $4096, %eax
loop loop0
(2). load CR3 with physical address of pgdir and turn on paging.
movl $pgdir-KVA,%eax # physical address of pgdir
movl %eax, %cr3 # load CR3 with PA(pgdir)
movl %cr0, %eax # enable paging
orl $0x80000000,%eax
movl %eax, %cr0
(3). Do a jmp to flush the instruction pipeline.
jmp 1f
1:
(4). Do another jmp to force the CPU to use virtual address
movl $2f,%eax
jmp *%eax
2:
The second jmp is tricky but essential. Before this jmp, the CPU was
executing with real addresses in the range 0-4MB. This jmp forces the CPU
to switch to virtual addresses in the range 0x80000000 to 0x80000000+4MB.
(5). load GDTR with the new gdt descriptor.
lgdt gdt_desc # load GDT at 0x103000
The new GDT defines 2 kernel mode segments, a TSS and 2 user mode segments.
All code and data segments are flat 4GB segments, as required by paging.
(6). Set stack pointer to the PROC[0]'s kstack Then, call init() in init.c to
initialize the MTX kernel.
3. init.c file
init() first initializes the video driver to make printf() work. At this
moment, the kernel's virtual address space is limited to 4MB. The next step is
to expand kernel's virtual address range to the entire available physical
memory. Assume 512MB physical memory. Then, the kernel mode virtual address
space of every process is from 2GB to 2GB+512MB. When the MTX kernel starts,
only the virtual address range from 2GB to 2GB+4MB is mapped to physical memory
0-4MB. Assume that the MTX kernel itself occupies 2MB, i.e. from 1MB to 3MB.
Then the area between 3MB and 4MB is free. We shall build new page tables in
this area, which map the kernel's virtual address range [2GB to 2GB+512MB] to
the entire physical memory [0 to 512MB]. This is done by the kpgtable() function
shown below.
#define KPG_DIR 0x80004000 // at VA=4KB
#define KPG_TAB 0x80300000 // at VA=3MB
#define NPGTABLES 128 // 512MB PA needs 512/4=128 pgtabels
#define PGSIZE 4096
extern u32 *kpgdir; // defined in entry.s
void kpgtable(void)
{
int i, j;
u32 *pd = (u32 *)KPG_DIR; // kpgdir is at 4KB in entry.s
u32 *pt = (u32 *)KPG_TAB; // pgtables begin at 3MB
u32 pte = (u32 )0x3; // begin PA=0, |Kpage|W|P
for (i=512; i<512+NPGTABLE; i++){ // from kpgdir[512], 128 entries
pd[i] = PA(pt)+3; // pointing at pgtables.
for (j=0; j<1024; j++){ // pgtabe, 4KB each
pt[j] = pte;
pte += 4096;
}
pt += 1024;
}
}
As in MTX32.2, the kernel page directory, kpgdir, plays two roles. First, it
will be the pgdir of the initial process P0, which runs (in Kmode) whenever no
other process is runable. Second, it will be the prototype of the page direcotry
of all other processes. Each process has its own page directory and associated
page tables. Since the kernel mode spaces of all processes are the same, the
last 512 entries of all page directories are identical. So, when creating a new
process we simply copy the last 512 entries of kpgdir into the process pgdir.
The low (0 to 511) entries of a process pgdir defines the user mode page tables
of that process. These entries will be set up when the process is created.
3-2. Switch CR3 to the new kernel page directory, kpgdir. This allows the kernel
to access all the physical memory from 0 to 512MB.
3-2. kernel_init():
Then, init() calls kernel_init() (in t.c file) to initialize the MTX kernel
data structures, such as free PROC lists, reqadyQueue and sleepList, etc. Then,
it creates the initial process P0, which uses kpgdir as its page directory. In
P0's TSS, the level-0 interrupt stack is set to P0's kstack. The system is now
running the initial process P0.
3-3. Remap IRQ interrupt vectors. Set up IDT and install exception/interrupt
handlers. These steps are required for all protected mode operations, which are
described 15.2. The GDT is located at the physical address 0x103000 (VA
0x80103000). The IDT contains 256 8-byte entires, which requires 2KB space. It
is placed in the same page as the GDT, at 0x103040 (VA 0x80103040). init()
proceeds to initialize the IDT table and install the execption and I/O interrupt
vectors in the IDT. Exception handler entry points are defined in traps.s.
Exception handlers are in trapc.c. I/O interrupt entry points are defined in
ts.s by calls to the INTH macro. I/O interrupt handler functions are in the
various device drivers in the driver/ directory. Among the interrupts, vector
0x80 is for system calls.
3-4. Initialize I/O buffers, device drivers and timer.
After setting up the IDT and interrupt vectors, init() continues to
initialize I/O buffers and I/O device drivers. When initializing the hard disk
driver, the driver reads the HD partition table to find out the MTX partition.
Then it initializes the file system and mount the MTX partition as the root file
system.
3-5. Free Page List.
MTX uses a free page list, pfreeList, for allocation/deallocation of page
frames. pfreeList is constructed by
u32 *free_page_list(char *start_va, char *end_va)
{
int i = 0; u32 *p;
printf("build pfreeList : start=%x end=%x\n", sva, endva);
p = (u32 *)(end_va - 4096);
while(p > (start_va)){
*p = (u32)(p - 1024);
p -= 1024;
}
ptail = p; *p = 0;
return end_va-4096;
}
u32 *pfreelist = free_page_list(4MB,512MB);
pfreeList threads all the free page frames from 512MB to 4MB in a link list.
Each element of pfreeList contains the address of the next page frame. As usual,
the list ends with a 0 pointer. In order for the kernel to access the entries of
pfreeList, the link pointers must use virtual addresses of the page frames. When
allocating a page frame, the virtual address of the page must be convert to
physical address. Conversion between virtual address and physical address are
done by the PA/VA macros.
#define PA(x) (u32)(x) - 0x80000000
#define VA(x) (x) + 0x80000000
With the free page link list, palloc() allocates a free page from pfreeList,
and pdealloc(VA(page frame)) inserts a deallocated page frame to pfreeList. The
simplest way to implement pdealloc() is to insert the deallocated page frame to
the beginning of pfreeList. However, this tend to always use the pages in the
front part of pfreeList. In order to "verify" that all the pages are used
correctly, we add a pfreeTail, which points at the last element of pfreeList,
and insert deallocated pages to the tail of pfreeList.
3-6. call main() in t.c file
4. main():
In main(), P0 calls kfork("/bin/init") to create the INIT process P1, which
has /bin/init as its Umode image. Then, P0 switches process to run P1. P1 forks
a login process and waits for ZOMBIE children. Then the login process runs, and
MTX is ready for use.
5. fork.c file.
The fork.c file contains fork1(), kfork() and fork(). fork1() is the
beginning part of both kfork() and fork(). In order to support dynamic paging,
fork1() is modified as follows.
PROC *fork1()
{
int i; u32 *pgtable, pte;
(1). creat a new proc, p, initialize its TSS, as in MTX32.2
(2). p->res->pgdir = palloc(); // allocate a pgdir from pfreeList
(3). for (i=0; i<1024; i++) // copy kpgdir into p->pdir;
p->res->pgdir[i] = kpgdir[i];
(4). pgtable = palloc(); // allocate a pgtable
p->res->pgdir[0] = PA((u32)pgtable) + 7; // record its PA+7 in pgdir[0]
(5) // create pgtable entries that maps PROC's [0-4MB] to these page frames
for (i=0; i<1024; i++){
pte = palloc(); // get a page frame; its PA is the frame
pgtable[i] = PA(pte+7);
}
return p;
}
After creating a new proc p, fork1() allocates a pgdir and builds a pgtable
for the new proc. The 1024 pgtable entries map the proc's 4MB virtual address
space to the physical addresses of the page frames. All the page frames are
dynamically allocated from pfreelist.
6. loader:
kfork() is only used by P0 to create the init process P1. After creating the
pgdir and pgtable of P1, it loads the Umode image, /bin/init, into the page
frames of P1. In order to load image files to page frames, which may not be
contiguous, the MTX loader is also modified accordingly. Instead of loading the
(1KB) blocks of a file to a linear address, it now loads every 4 blocks of a
file to a page frame in the process page table. Since the modification is almost
trivial, it is not shown. If we extend the (EXT2) file system of MTX to 4K
block size, then the loader can be modified to load file blocks to page frames.
This will be implemented in future extensions of MTX.
7. fork():
After P1, every new process is created by fork(). In order to support dynamic
paging, fork() only needs a slight modification. After creating a child process,
fork() copies the Umode image of the parent to the child. In this case, copy_
image() must copy the contents of the parent page frames to that of the child.
In general, copy_image() must traverse the parent pgdir to find the page tables,
from which to locate the page frames of the page table. Then copy each page
frame to a corresponding page frame of the child. Since we assume that every
Umode image is 4MB, copy_image() is much simpler since the pgdir has only one
entry (0) and there is only one page table with 1024 page frames. The following
code shows the simplified version of copy_image().
// copy parent's page frames to child' page frames
void copyimage(PROC *parent, PROC *p)
{
int i; u32 *ppgtable,*cpgtable,*ppa,*cpa;
ppgtable = VA(running->res->pgdir[0]&0xFFFFF000);
cpgtable = VA(p->res->pgdir[0]&0xFFFFF000);
for (i=0; i<1024; i++){
ppa = VA(ppgtable[i]&0xFFFFF000);
cpa = VA(cpgtable[i]&0xFFFFF000);
memcpy(cpa, ppa, PGSIZE);
}
}
8. vfork(): The Intel paging hardware supports Copy-On-Write (COW) pages. This
feature can be used to implement vfork() without actually copying the Umode
image. Using COW, vfork() creates a child which share the same Umode page frames
of the parent until either process tries to write to a shared page, at which
time the shared COW page is split. MTX does not yet implement this kind of
vfork(). It is left as an exercise.
9. exec():
In general, when a process exec(), it deallocates the old image pages and
then allocate pages for the new image. Since we assume that every Umode image
size is 4MB, exec() can be simplified. Instead of releasing the old image pages,
it uses the same page frames of the old image. So, it only needs to call the
(new) loader to load the new image into the old page frames. Thus, no changes
are needed in exec(). How to handle the general case in which the image size may
differ is also left as an exercise.
10. Address Parameters in Syscalls
The process pgdir defines both Umode and Kmode pgtables of a process. Umode
addresses are translated by low (0 to 511) entries in the process pgdir and
the corresponding page tables. When a process executes in kernel (at privilege
level 0), it can access its Umode pages, which have privilege level 3. This
implies that all Umode virtual addresses are also valid in Kmode. Therefore, the
address parameters in syscalls can be used directly in Kmode without any
conversion. Conversions between VA and PA are needed only in Kmode when
manipulating pgdir and pgtables. Such tables contains physical addresses but
must be accessed by their virtual addresses in kernel.
11. Demand Paging:
In the current implementation, MTX does not support demand paging, mainly
because we want to keep the system simple. Another reason is that demand paging
is unnecessary in the current MTX because all the Umode programs need only a few
pages to execute. However, we can demonstrate the principle of demand paging by
simulated page faults and then handle the page faults.
In fork1(), when allocating pages for a new process, we can intentionally
mark some of the entries in the page table as not present, e.g. by setting the
P bit of pgtable[1] or pgtable[2] to 0. When the process executes in Umode, if
the missing page is within the Umode virutal address range, it would generate a
page fault, even though the page frame is already loaded with the image code or
data. When the process traps to Kmode, the exception handler recognizes the page
fault, fix up the "faulty page" and let the process return to Umode to continue.
The page fault exception handler is a part of ehandler() in tarpc.c. The
algorithm of the page fault handler is shown below.
void ehandler(u32 signal, ... u32 err_nr, ..) // err_nr is the error number
{
u32 cr2, cr3, pgentry, *pgtable;
if (exception occurred in Umode, i.e. running->inkmode==1){
if (err_nr==14){ // page fault
cr2 = CPU.CR2 register; // get offending VA from CR2
pgentry = cr2/4096; // convert VA to pgtable entry number
pgtable = VA(running->res->pgdir[0]&0xFFFFF000);
pgtable[pgentry] |= 0x1; // mark page table entry PRESENT
load CPU.CR3 with PA(running->res->pgidr); // flush CPU.TLB
}
return; // return to Umode to continue
}
// page fault occurred in Kmode: PANIC and stop
}
In the above pseudo code, the page fault handler gets the vritual address that
caused the page fault (in CR2), converts it to a pgtable entry index, marks the
"missing" page entry as Present and reloads the CR3 register with the process
pdgir to flush the CPU's TLB cache. When the process returns to Umode, it will
re-execute the instruction that casued the page fault earlier. This time it will
not cause any page fault because the page table entry is now Present. In the
general case, the "missing" page may not exist. If so, the page fault handler
must allocate a page frame, load the needed page contents into the page frame
and mark the page entry present before letting the process continue.
11. Page Replacement:
Since MTX does not support demand paging, the problem of page replacement
also does not arise. Nevertheless, it is still wrothy discussing the principles
of page replacement in demand paging. In the page fault handler, if a new page
frame must be allocated but there is no free pages available, some existing
pages must be evicted to make room for the needed page frame. This is known as
page replacement problem in virtual memory. Page replacement rules are discussed
extensively in many OS books. The readers should consult such books for more
information. Here we only focus on some of the page replacement issues
1. Local Vs. Global
When trying to evict some existing pages, the first question is where to look
for such pages? The search is local if we only try to evict pages of the current
process. It is global if we may evict pages of other processes in the system. In
the latter case, non-runable processes, i.e. blocked or sleeping, are usually
the candidates.
2. Replacement Rule:
Among the many page replacement rules, the most popular one is based on the
Least-Recently-Used (LRU) principle. Since the page table entry only has an A
(Accessed) and D (Dirty) bits, it cannot support LRU directly. However, the
kenrenl may clear these bits periodically (by timer interrupts). After clearing
the bits, A=0 means the page has not been accessed since the last time it was
cleared, and D=0 means it is not modifed since the last time. The A bit can be
used as an approximation of LRU. The D bit can be used to decide whether or not
to write the page back to disk when it is to be replaced. Based on the (A,D)
bits, a simple page replacement algorithm is as follows.
(A, D) choice
------: --------------------------------------------------------------
(0, 0): best candidate since page is not accessed nor modified;
(0, 1): this cannot happen if (A,D) are cleared at the same time;
(1, 0): second best, the page will probably be accessed again soon.
(1, 1): worst candidate, replacing this page must write it back first.
----------------------------------------------------------------------
3. Page Replacement by LRU
In order to support LRU, additional information must be kept for the page
frames. Each page frame can be maintained in a data structure containing a
usage count. The OS kernel clears the A bit of every page entry periodically.
Whenever the A bit changes to 1, the kernel increments the page's usage count
by 1. The count value represents the number of page references in the last time
period. This is a much better approximation of LRU, although at the expense
of additional overhead in the OS kernel.
7. Unified Page and I/O Buffer Management.
In many OS which support demand paging, I/O buffers for files blocks are
treated as pages. When a process needs a file block, it maps the file block to
a page in its virtual address. Such pages are initially marked invalid or
absent. When a process attempts to access such a file block, it will generate a
page fault. The paging subsystem can allocate a page frame for the missing page,
page-in the data, if necessary, and let the process continue. Once a file block
is in the paged buffer cache, it can be reused to avoid physical I/O. This
scheme integrates file block I/O with demand paging. It is used in Linux, as
well as many other Unix-like systems. The primary advantage of using a paged
buffer cache is that it maximizes the use of the physical memory. Almost all
spare physical memory is used for file system caching. The primary disadvantage
of using paged buffer cache is that, under intensive file I/O demands, it may
lead to thrashing, which seriously degrades the system performance. For this
reason, most OS limit the size of paged buffer cache in order to prevent
thrashing. In addition, they also use an internal buffer cache outside the
paging subsystem for maintaining important system information, such as the
superblock, bitmaps and directories, etc.
Conclusion: In this chapter we presented 3 versions of 32-bit MTX for protected
mode operations. MTX32.1 uses protected segmentation. MTX32.2 uses static paging
and MTX32.3 uses dynamic paging. In all cases, we only need to add virtual
address mapping and set up the IDT for exception/interrupt processing in
protected mode to the MTX kennel. Other than that, it required either very
little or no modifications at all to the MTX kernel. These facts provide strong
support for the argument that when studying the principle of OS design, the
actual memory management hardware is unimportant. What we have shown is that a
well designed OS kernel for real mode operation also works in protected mode.
Extensions of 32-bit MTX
The main advantage of 32-bit protected mode is that it opens the door for
many possible extensions to the MTX system. The following is a list of such
areas.
1. EXT4 file system: Extend the file system to EXT3/EXT4 with 4KB block size.
2. ELF executables: Modify the loader to support a.out or ELF executable files.
Modify the execution environment by using shared and
dynamic libraries.
3. Unix compatible sh: port or develop a Unix compatible sh.
4. Development Environment: port editors and GCC to MTX for program development.
5. Device drivers for PCI bus: Support SATA drives and USB devices.
6. Networking: Develop network drivers and port TCP/IP for networking.
7. Other Unix software: port Unix utilities and X-Windows, etc.
These extensions would greatly enhance the capability of MTX, making it a
more useful system.