MTX in 32-bit Protected Mode

1. Introcuction to 32-bit Protected Mode

   When the Intel x86 CPU starts, it is in 16-bit real mode. While in real mode,
the CPU uses segment registers and 16-bit offsets to generate memory addresses.
It uses the interrupt vectors in the low 1KB memory to process interrupts and a
few exceptions. The CPU can be switched to 32-bit protected mode by setting bit
0 of CPU's control register CR0 to 1. Once in protected mode, the CPU's 
operating environment changes completely. First, the meaning of the segment 
registers changes. Instead of holding a segment base address, each segment 
register becomes a segment selector, which seletcs a segment descriptors in 
either a Global Descriptor Table (GDT) or a Local Descriptor Table (LDT). Each 
segment descriptor specifies the segment's base address and size limit. A 
logical address consists of a segment selector and a 32-bit offset in the form
[segment:offset]. The linear address is the segment base plus the offset. 

Second, exception and interrupt vectors are no longer in the low 1KB memory as 
they are in real mode. Instead, they are represneted by Interrupt Descriptors 
in an Interrupt Descriptor Table (IDT). The CPU automatically uses GDT or IDT 
for address translation, and uses IDT for exception and interrupt processing. 
These tables must be set up properly prior to switching the CPU to protected 
mode. The following is a brief description of protected mode operations.

1.1 Memory Management

    Memory management includes both address translation and protection. The x86
processor in protected mode can uses either segmentation or paging for memory 
management. 

1.1.1.Segmentation 

    A x86 processor in protected mode has 6 segment register, denoted by cs,ds,
ss,es,fs and gs. Among these, only cs,ds,and ss are needed by a program. Each 
segment register is 16 bits but its content is no longer a base address as in 
real mode. Instead, it specifies the index of a segment descriptor in a 
descriptor table. The segment descriptor contains the base address and size
limit of the intended segment. The linear address, which is also the physical 
address, is the segment's base address plus the offset, both 32 bits. The format
of a segment register is

           15              3 2 1 0  
           ------------------------
           |     index      |T|RPL|
           |   (13 bits)    | |   |
           ------------------------
           
where index = 13-bit offset into a descriptor table, T=0 means the Global 
Descriptor Table (GDT), T=1 means the Local Descriptor Table (LDT), RPL is the 
segment's privilege level for protection. The 2-bit privilege level varies from
00, which the highest, to 11, which is the lowest. In general, a program 
executes at the privilege level of its code segment selector. It can access 
segments at the same or lower privilege level, but not those at a higher level.
The four privilege levels form a protection ring, which can be used to implement
secure operating systems with layers of protections. Most Unix-like systems, 
e.g. Linux and MTX, use only two levels, kernel mode and user mode, for 
protection. In order to simplify the discussion, we shall assume two privilege 
levels, RPL=00 for kernel mode and 11 for user mode. When a process executes in 
kernel mode at privilege level 00, it can access any segment and execute any 
instruction. When a process executes in user mode at privilege level 11, it 
cannot access any segment whose RPL is 00. This prevents a user mode program 
from executing kernel code or accessing kernel data segments. As usual, it can 
enter kernel mode only through interrupts, exceptions or by explicit syscalls. 

In protected mode, a logical address consists of two parts: a 16-bit segment 
slector and a 32-bit offset which specifies the relatvie address within the 
segment. Given a logical address LA=[segment:offset], the CPU uses the segment's
T bit to access either the GDT or the LDT. If T=0, it uses the GDT, which is 
pointed by the CPU's GDTR register. If TI=1, it uses the LDT, whic is pointed by
the CPU's LDTR register. A system typically has only one GDT, which specifies 
the kernel code and data segments that are common to all processes in kernel 
mode. Each process may have its own LDT, which sepcifies the user mode address 
space of that process.

2.2.2. Segment Descriptors

Each segment is described by a 8-byte Segment Descriptor. Segment Descriptors 
are in either the GDT or the LDT. The address and size of the GDT are contained
in the CPU's GDTR register. Similarly, the address and size of the current LDT 
are contained in the CPU's LDTR register. The format of a Code or Data segment 
descriptor is

      63                48 47     40 39                16 15        0
      ----------|----|----|----|----|---------|----------|------------
      |  base   |GD0A| Lm |PpLS|type|        base        |  limit    |
      | (31-24) |    |(4) |    |    |       (23-0)       | (15-0)    |
      ----------------------------------------------------------------
                   Format of Code/Data Descriptor

base  = 32-bit base address in bytes
limit = 20-bit segment size limit (Lm = high 4 bits of limit) 
G = Granularity: 0 in bytes, 1 in 4KB blocks
D = 1 for 32-bit operands
A = 1 = available
P = 1 = segment present
pL (2 bits) = Privilege level (00=kernel, 11=user)
type (4 bits): code or data segment and R/W access

In addition to Code and Data segment descriptors, the GDT may also contain
Task State Segment Descriptors (TSSD). A TSSD refers to a Task State Segment 
(TSS), which is a data structure used by the CPU to save the processor registers
during interrupts or exception. It is also used in hardware task switching. 

1.2. Segmentation Models

(1). Flat model: In the flat segmenation model, at least 2 segment descriprots 
are needed; one for code segment and one for data segment. The segments are 
mapped to the entire physical memory by setting base=0, G=1 (for 4KB units) and 
limit=64K (actual limit value is one less). In this case, the mapping of logical
address to physical address is one-to-one, i.e. a 32-bit virtual address is an 
offset relative to 0, which is also the physical address.

(2) Protected flag model: Segments are mapped to existing memory only by setting
the limit to the available physical memory. Attempts to access memory outside 
the limit will generate a protection error.

(3). Multi-segment model:
In this model, each process has its own segments. The CPU has 6 segment 
registers: cs,ds,ss,es,fs,gs, each of which may be used to select a segment. 
For MTX, only cs,ds and ss are needed. Since ds and ss are the same, only cs,ds
are needed.

When using segmentation, the CPU's memory management functions are summarized
in the following diagram, which is explained below by the labels (1) to (5).

   -------------------------------------------------------------------------
           (1)                                                (2)
         CPU.GDTR                                           CPU.LDTR
  [GDT_limit|GDT_address]                            [LDT_limit|LDT_address]
            |                      (3)                         |
            |                 segment_selector                 |
           GDT                ----------------                LDT
      ---------------   VA =  | index |T|RPL |:offset   -----------------
      | descriptor  |         ----|-----------          |  user_segment |
      ---------------             |                     -----------------
      |             |             |                     |  user_segment |
      | descriptor  |<-- (T=0)----|----(T=1) -----------|> descriptor   |
      -------|-------                                   -------|---------    
             |                     (4)                         |
         base_address                                     base_address

   -----------------------------------------------------------------------
              (5).       PA = base_address + offset
                               protection:
                          (withing descriptor limit)    
                         ( RPL: privilege permitting )
  ------------------------------------------------------------------------

(1). CPU's GDTR register is loaded with a GDT descriptor=[u16 GDT_limit| u32 
     GDT_address], which points to the GDT. The GDT contains 8-byte global 
     segment descriptors. Each descriptor has a 32-bit base_address, a size 
     limit and a 2-bit privilege level.  

(2). Like the GDT, the LDT also contains 8-byte local segment descriptors. Each
     descriptor has a 32-bit base_address, a size limit and a 2-bit privilege 
     level. However, Unlike the GDTR, which can be loaded with a GDT descriptor
     anywhere in memory, the LDTR must be loaded with a LDT selector in the GDT.
     Thus, in order to use any LDT, the LDT's descriptor must be placed in the 
     GDT, and load LDTR with its selector in the GDT. To beginners, this may be
     quite confusing.
        
(3). A virtual address VA = [16-bit segment_selector:32-bit offset]. The 
     13-bit index is used to access a segment descriptor in either the GDT
     (T=0) or the LDT (T=1). The privilege level (2-bit RPL value) of the 
     current executing segment must be <= selector's RPL. If so, the linear or
     physical address PA = segment base_address + offset, which must be within
     the segment limit.

1.3. Paging:

   In protected mode, memory can also be managed by paging. When using paging,
we first create a flat segment of 4GB. Then, enable paging by setting bit 0 of 
the control register CR0 to 1. With paging, a 32-bit linear address is treated 
by the CPU's memory Management Unit (MMU) as a triple, [directory|table|offset],
as in

         31      22 21       12 11        0
        ------------------------------------
        | directory|   table   |  offset   |
        ------------------------------------
           Format of a Linear address
 
This is a two-level paging scheme, in which directory refers to an entry in the 
level-1 page table, table refers to an entry in a level-2 page table, and offset
is the relative address in the page. Each page is 4KB in size. Given a 32-bit 
linear address
           | 10 bits | 10 bits | 12 bits|
      LA = |directory|  table  | offset |

The CPU first uses the control register CR3 to locate the directory page table.
Then, it uses the 10-bit directory to access the entry in the directory page 
table. Each page table entry has the format

        31                         12 11                    0
        ------------------------------------------------------
        | page frame address (31-12) |AVAIL|0 0 D A P P U R P|
        |                                           C W / /  |
        |                                           D T S W  |
        -----------------------------------------------------

P=1 if page preset; 0 if not
R/W = read or write
U/S = user or system flag for protection
PWT = page write transparent
PCD = page cache disable
A   = accessed
D   = dirty or modified
AVIAL=avaiable for systems programmer use

Assume that the directory page table entry is present and the access checking
is OK. Then, it uses the 20-bit page frame address to locate the level-2 page
table. Then, it uses the 10-bit table to locate the entry in the level-2 page
table. Assuming that the page entry is present and access checking is OK also,
the level-2 page table entry contains the page frame address in memory. The 
final physical address is

       (physical page frame address << 12) + offset

The translation procedure is depicted in the diagram below.
 
  ----------------------------------------------
  | directory  |   table      |    offset      |
  -|------------------|------------------|------  PageFrame
   |                  |                  |      -------------
   |   level-1        |     level-2      |      |           |
   |   PageTable      |     PageTable    |----->|-> operand |
   |  -----------     |   ------------          -------------
   |  |         |     |   |          |                |
   |->|PGTentry |--|  |-->| PGTentry |-----------------
      |         |  |      |          |
      -----------  |      ------------
          |        |          |
CR3 -------        |-----------

Since paging relies on a flat segmentation model, protection by checking the 
segments limit no longer makes sense. With paging, protection is enforced by the
inidvidual page table entries. A page is either present or not present. Attempt
to access a non-present page generates a page fault error. In addition, a page 
table entry can be marked as either read-only or writeable. The access (A) and 
dirty (D) bits are used to implement page replacement in demand paging.

Translation Lookaside Buffer (TLB)

In order to speed up the paging translation process, the CPU stores the most 
recently used page table entries in a internal cache, called the TLB. Most
paging is performed by using the contents of the TLB. Bus cycles are performed 
only when a new page is used. Whenever the page tables are changed, the OS 
kernel must flush the TLB to dispose of its page table entries in order to 
prevent it from using old page entries in the TLB. Flusing the TLB can be done 
by reloading the CR3 control register. Individual entries in the TLB can also be
flushed by the INVPLG instruction.

1.4. Static Paging:

     The simplest paging is static paging. In this scheme, all the pages of a 
process image are allocated physical page frames at once, and the page tables 
are set up accordingly. Once an image is loaded into memory, its pages are 
always present.

1.5. Demand Paging:

     In demand paging, the page tables of a process image are built according 
to the image size, but not all the pages are allocated page frames. Those pages
which do not have page frames are marked not present. The page frame address in
an absent page table entry may point to its location in a physical device, e.g.
a block number in a file system or a block number in a swap disk containing the
page image. During execution, when a process attempts to reference a page that 
is not present, it generates a page fault, which traps to the OS kernel. The OS
kernel can allocate a physical page frame for the page, load the page into the 
page frame, change the page table entry to present and pointing to the page 
frame. Then it lets the process continue with the valid page table entry. Demand
paging supports virtual memory in which the virtual address of a process image 
can be much larger than the physical memory allocated to it. However, this 
scheme depends on a good page repalcement policy in order to reduce the number 
of page swaps.

1.6. Interrupt and Exception Processing in Protected Mode

     Interrupt and exception processing in protect mode differ from that of real
mode in two areas. First, in protected mode the first 32 interrupt vectors, 0x00
to 0x1F, are reserved for exceptions, which are

Exception  | Description
-----------|-------------------------------------------------
  0x00     | Divide error:
  0x01     | Single-step/debug exception:
  0x02     | Nonmaskable interrupt:
  0x03     | Breakpoint by INT 3 instruction
  0x04     | Overflow
  0x05     | Bounds check
  0x06     | Invalid opcode
  0x07     | Coprocessor not available
  0x08     | Double fault
  0x09     | Coprocessor segment overrun
  0x0A     | Invalid TSS
  0x0B     | Segment not present
  0x0C     | Stack exception
  0x0D     | General protection violation
  0x0E     | Page fault
  0x0F     | (Reserved)
  0x10     | Coprocessor error
  0x11-0x1F| (Reserved)
---------------------------------------------------------------

  Since the exception vectors ovelap with the traditional interrupt vectors of
IRQ0 to IRQ8 (0x08 to 0x0F), the IRQ interrupt vectors must be remapped to
different vector locations.

  Second, the exception vectors are no longer in the low 1KB memory area as in 
real mode. Instead, they are defined as interrupt descriptors in an Interrupt 
Decsriptor Table (IDT), which is pointed by the CPU's IDTR register. The 
contents of the IDT are essentially "descriptors" but Intel chooses to call them
interrupt or trap gates. The format of interrupt and trap gates is

   63                48 47            32 31          16 15         0
   -----------------|-------------------|--------------|-------------
   |  offset        |PpL0|TYPE|000-|----|    segment   |  offset    |
   | (31-16)        |                   |    selector  |  (15-0)    |
   ------------------------------------------------------------------
                   Format of  Interrupt/Trap Gate

Where P is the present bit, pL is the privilege level, TYPE=1110 for interrupt
gates and 1111 for trap gates. The difference is that invoking an interrupt gate
automatically disables interrupts but invoking a trap gate does not. Since 
hardware interrupts and exceptions are processed in kernel mode, the privilege 
level, pL, must be set to 00. It can be set to 11 to allow user mode programs
to handle software gnerated interrupts, but this is rarely used in practice. 

In addition to interrupt and trap gates, the IDT may also contain call gates and
task gates. Calls to a task gate (or an interrupt gate) may trigger a task
switch by hardware. Current Linux kernel does not use hardware task switching. 
likewise, MTX also does not use TSS gates and hardware task switching for a 
number of reasons. First, task switching involves much more than just switching
the hardware context of tasks. Second, even for hardware context only, task 
switching by TSS gates is not necessarily faster than software task switching. 
Hardware task switching may save a few instruction fetch cyles. With CPU's 
insturction cache, the saving is most likely rather insignificant. Software task
switching is more flexible and is under direct control of the OS designer. Last
but not least, hardware task switching is only supported in 32-bit x86 CPUs. It
is no longer supported in 64-bit x86 CPUs. Using some hardware features that 
will soon become obsolet makes little sense. Afterall, the concept of context 
switching by calling an interrupt or task gate is somewhat misleading and 
confusing. Unlike a function, which can be called, calling an interrupt or task
gate cannot cause a task switch unless the environment of the intended task 
already exists. Therefore, we shall not use hardware task switching. However, 
the x86 CPU does require a TSS descriptor in the GDT. Like the LDTR register, 
the CPU's Task Register, TR, must be loaded with a selector to a TSS descriptor
in the GDT; the actual TSS structure is located somewhere else, e.g. in the 
process table. The role of the TSS descriptor is as follows. When the CPU is 
executing, its Task Register (TR) points at a TSS structure of the form

     /***** Contents of TSS **********/
     u32 *TSS       // low 2 bytes
     u32 esp0
     u32 ss0        // low 2 bytes
     u32 esp1
     u32 ss1        // low 2 bytes
     u32 esp2
     u32 ss2        // low 2-bytes
     u32 CR3
     u32 eip,eglags,eax,ecx,edx.ebx.esp,ebp,esi,edi
     u32 es,cs,ss,ds,fs,gs // all in low 2 bytes
     u32 ldt
     u32 iomap
    /*********************************/

The minimum size of a TSS is 26*4=104 bytes. The fields from eip to gs are for 
saving the "hardware context" of the current process during hardware task 
switching. The most important fields are esp0 and ss0, which define the CPU's 
interrupt stack. Assuming that the CPU is executing a process. Then, [ss0:esp0]
must point to the process kernel mode stack. This is because when an interrupt 
or exception occurs, the CPU automatically uses [ss0:esp0] in the TSS as the 
interrutpt stack. The following helps clarify this point. 

    Assuming that the CPU is executing a process in user mode. At this moment, 
the process kernel mode stack is empty. The CPU's TSS (which is pointed by the 
TR register) [ss0:esp0] points to the (high end of) the process kernel stack, as
shown in the following diagram.

 CPU.TR -->   TSS          hi                                 low
            ---------      -------------------------------------- 
            [ss0:ssp0] --->| (emtpy kernel mode stack)
                           --------------------------------------
                                 PROC.kstack[ ]
 
When an interrupt occurs, the CPU saves uSS,uSP,uflags,uCS,ueip into the 
interrupt stack, whose contents become

 CPU.TR -->   TSS          hi                   esp              low
            ---------      -------------------------------------- 
            [ss0:ssp0] --->|uSS|uesp|uflag|uCS|ueip|
                           --------------------------------------
                                 PROC.kstack[ ]

where the prefix u denotes user mode registers. If an exception occurs, the 
situation is exactly the same, except that for some exceptions, the CPU also 
pushes an error number, err#, onto the interrupt stack, which becomes 

 CPU.TR -->   TSS          hi                   esp              low
            ---------      -------------------------------------- 
            [ss0:ssp0] --->|uSS|uesp|uflag|uCS|ueip|err#|
                           --------------------------------------
                                 PROC.kstack[ ]

Upon entry to the interrupt/trap handler routine, the porcess kernel stack 
contents are as shown in the above diagram. While in kernel mode, if another 
interrupt or trap occurs, the CPU continues to use the same interrupt stack to 
push on one more layer of interrupted context. However, if the CPU is already in
kernel mode, re-enter kernel mode does not involve a privilege change. So the 
saved context only has |kflags|kCS|keip|, as shown in the next diagram.

 CPU.TR -->   TSS          hi                   esp              low
            ---------      ---------------------------------------------------- 
            [ss0:ssp0] --->|uSS|uesp|uflag|uCS|ueip|err#|......|kflags|kCS|ksip
                           ----------------------------------------------------
                                 PROC.kstack[ ]

When returning from an interrupt/trap handler routine, the iret operation checks
whether changing from the current code sgement to the next involves a change of
privilege. If no change in privilege, i.e. from kernel back to kernel, it only 
pops the saved kernel mode context. If the privilege changes, i.e. from kernel 
back to user mode, it pops the saved user mode context, which includes the saved
uesp and uSS.
 
    Summarizing, for protected mode operation, we must remap the IRQ vectors, 
install interrupt/trap handlers as interrupt/trap gates in the IDT, and have a 
TSS, which defines the CPU's interrupt stack. As in real mode, knowing the 
kernel mode stack contents is key to understanding how to set up the execution 
environment of a process.