MTX in 32-bit Protected Mode
1. Introcuction to 32-bit Protected Mode
When the Intel x86 CPU starts, it is in 16-bit real mode. While in real mode,
the CPU uses segment registers and 16-bit offsets to generate memory addresses.
It uses the interrupt vectors in the low 1KB memory to process interrupts and a
few exceptions. The CPU can be switched to 32-bit protected mode by setting bit
0 of CPU's control register CR0 to 1. Once in protected mode, the CPU's
operating environment changes completely. First, the meaning of the segment
registers changes. Instead of holding a segment base address, each segment
register becomes a segment selector, which seletcs a segment descriptors in
either a Global Descriptor Table (GDT) or a Local Descriptor Table (LDT). Each
segment descriptor specifies the segment's base address and size limit. A
logical address consists of a segment selector and a 32-bit offset in the form
[segment:offset]. The linear address is the segment base plus the offset.
Second, exception and interrupt vectors are no longer in the low 1KB memory as
they are in real mode. Instead, they are represneted by Interrupt Descriptors
in an Interrupt Descriptor Table (IDT). The CPU automatically uses GDT or IDT
for address translation, and uses IDT for exception and interrupt processing.
These tables must be set up properly prior to switching the CPU to protected
mode. The following is a brief description of protected mode operations.
1.1 Memory Management
Memory management includes both address translation and protection. The x86
processor in protected mode can uses either segmentation or paging for memory
management.
1.1.1.Segmentation
A x86 processor in protected mode has 6 segment register, denoted by cs,ds,
ss,es,fs and gs. Among these, only cs,ds,and ss are needed by a program. Each
segment register is 16 bits but its content is no longer a base address as in
real mode. Instead, it specifies the index of a segment descriptor in a
descriptor table. The segment descriptor contains the base address and size
limit of the intended segment. The linear address, which is also the physical
address, is the segment's base address plus the offset, both 32 bits. The format
of a segment register is
15 3 2 1 0
------------------------
| index |T|RPL|
| (13 bits) | | |
------------------------
where index = 13-bit offset into a descriptor table, T=0 means the Global
Descriptor Table (GDT), T=1 means the Local Descriptor Table (LDT), RPL is the
segment's privilege level for protection. The 2-bit privilege level varies from
00, which the highest, to 11, which is the lowest. In general, a program
executes at the privilege level of its code segment selector. It can access
segments at the same or lower privilege level, but not those at a higher level.
The four privilege levels form a protection ring, which can be used to implement
secure operating systems with layers of protections. Most Unix-like systems,
e.g. Linux and MTX, use only two levels, kernel mode and user mode, for
protection. In order to simplify the discussion, we shall assume two privilege
levels, RPL=00 for kernel mode and 11 for user mode. When a process executes in
kernel mode at privilege level 00, it can access any segment and execute any
instruction. When a process executes in user mode at privilege level 11, it
cannot access any segment whose RPL is 00. This prevents a user mode program
from executing kernel code or accessing kernel data segments. As usual, it can
enter kernel mode only through interrupts, exceptions or by explicit syscalls.
In protected mode, a logical address consists of two parts: a 16-bit segment
slector and a 32-bit offset which specifies the relatvie address within the
segment. Given a logical address LA=[segment:offset], the CPU uses the segment's
T bit to access either the GDT or the LDT. If T=0, it uses the GDT, which is
pointed by the CPU's GDTR register. If TI=1, it uses the LDT, whic is pointed by
the CPU's LDTR register. A system typically has only one GDT, which specifies
the kernel code and data segments that are common to all processes in kernel
mode. Each process may have its own LDT, which sepcifies the user mode address
space of that process.
2.2.2. Segment Descriptors
Each segment is described by a 8-byte Segment Descriptor. Segment Descriptors
are in either the GDT or the LDT. The address and size of the GDT are contained
in the CPU's GDTR register. Similarly, the address and size of the current LDT
are contained in the CPU's LDTR register. The format of a Code or Data segment
descriptor is
63 48 47 40 39 16 15 0
----------|----|----|----|----|---------|----------|------------
| base |GD0A| Lm |PpLS|type| base | limit |
| (31-24) | |(4) | | | (23-0) | (15-0) |
----------------------------------------------------------------
Format of Code/Data Descriptor
base = 32-bit base address in bytes
limit = 20-bit segment size limit (Lm = high 4 bits of limit)
G = Granularity: 0 in bytes, 1 in 4KB blocks
D = 1 for 32-bit operands
A = 1 = available
P = 1 = segment present
pL (2 bits) = Privilege level (00=kernel, 11=user)
type (4 bits): code or data segment and R/W access
In addition to Code and Data segment descriptors, the GDT may also contain
Task State Segment Descriptors (TSSD). A TSSD refers to a Task State Segment
(TSS), which is a data structure used by the CPU to save the processor registers
during interrupts or exception. It is also used in hardware task switching.
1.2. Segmentation Models
(1). Flat model: In the flat segmenation model, at least 2 segment descriprots
are needed; one for code segment and one for data segment. The segments are
mapped to the entire physical memory by setting base=0, G=1 (for 4KB units) and
limit=64K (actual limit value is one less). In this case, the mapping of logical
address to physical address is one-to-one, i.e. a 32-bit virtual address is an
offset relative to 0, which is also the physical address.
(2) Protected flag model: Segments are mapped to existing memory only by setting
the limit to the available physical memory. Attempts to access memory outside
the limit will generate a protection error.
(3). Multi-segment model:
In this model, each process has its own segments. The CPU has 6 segment
registers: cs,ds,ss,es,fs,gs, each of which may be used to select a segment.
For MTX, only cs,ds and ss are needed. Since ds and ss are the same, only cs,ds
are needed.
When using segmentation, the CPU's memory management functions are summarized
in the following diagram, which is explained below by the labels (1) to (5).
-------------------------------------------------------------------------
(1) (2)
CPU.GDTR CPU.LDTR
[GDT_limit|GDT_address] [LDT_limit|LDT_address]
| (3) |
| segment_selector |
GDT ---------------- LDT
--------------- VA = | index |T|RPL |:offset -----------------
| descriptor | ----|----------- | user_segment |
--------------- | -----------------
| | | | user_segment |
| descriptor |<-- (T=0)----|----(T=1) -----------|> descriptor |
-------|------- -------|---------
| (4) |
base_address base_address
-----------------------------------------------------------------------
(5). PA = base_address + offset
protection:
(withing descriptor limit)
( RPL: privilege permitting )
------------------------------------------------------------------------
(1). CPU's GDTR register is loaded with a GDT descriptor=[u16 GDT_limit| u32
GDT_address], which points to the GDT. The GDT contains 8-byte global
segment descriptors. Each descriptor has a 32-bit base_address, a size
limit and a 2-bit privilege level.
(2). Like the GDT, the LDT also contains 8-byte local segment descriptors. Each
descriptor has a 32-bit base_address, a size limit and a 2-bit privilege
level. However, Unlike the GDTR, which can be loaded with a GDT descriptor
anywhere in memory, the LDTR must be loaded with a LDT selector in the GDT.
Thus, in order to use any LDT, the LDT's descriptor must be placed in the
GDT, and load LDTR with its selector in the GDT. To beginners, this may be
quite confusing.
(3). A virtual address VA = [16-bit segment_selector:32-bit offset]. The
13-bit index is used to access a segment descriptor in either the GDT
(T=0) or the LDT (T=1). The privilege level (2-bit RPL value) of the
current executing segment must be <= selector's RPL. If so, the linear or
physical address PA = segment base_address + offset, which must be within
the segment limit.
1.3. Paging:
In protected mode, memory can also be managed by paging. When using paging,
we first create a flat segment of 4GB. Then, enable paging by setting bit 0 of
the control register CR0 to 1. With paging, a 32-bit linear address is treated
by the CPU's memory Management Unit (MMU) as a triple, [directory|table|offset],
as in
31 22 21 12 11 0
------------------------------------
| directory| table | offset |
------------------------------------
Format of a Linear address
This is a two-level paging scheme, in which directory refers to an entry in the
level-1 page table, table refers to an entry in a level-2 page table, and offset
is the relative address in the page. Each page is 4KB in size. Given a 32-bit
linear address
| 10 bits | 10 bits | 12 bits|
LA = |directory| table | offset |
The CPU first uses the control register CR3 to locate the directory page table.
Then, it uses the 10-bit directory to access the entry in the directory page
table. Each page table entry has the format
31 12 11 0
------------------------------------------------------
| page frame address (31-12) |AVAIL|0 0 D A P P U R P|
| C W / / |
| D T S W |
-----------------------------------------------------
P=1 if page preset; 0 if not
R/W = read or write
U/S = user or system flag for protection
PWT = page write transparent
PCD = page cache disable
A = accessed
D = dirty or modified
AVIAL=avaiable for systems programmer use
Assume that the directory page table entry is present and the access checking
is OK. Then, it uses the 20-bit page frame address to locate the level-2 page
table. Then, it uses the 10-bit table to locate the entry in the level-2 page
table. Assuming that the page entry is present and access checking is OK also,
the level-2 page table entry contains the page frame address in memory. The
final physical address is
(physical page frame address << 12) + offset
The translation procedure is depicted in the diagram below.
----------------------------------------------
| directory | table | offset |
-|------------------|------------------|------ PageFrame
| | | -------------
| level-1 | level-2 | | |
| PageTable | PageTable |----->|-> operand |
| ----------- | ------------ -------------
| | | | | | |
|->|PGTentry |--| |-->| PGTentry |-----------------
| | | | |
----------- | ------------
| | |
CR3 ------- |-----------
Since paging relies on a flat segmentation model, protection by checking the
segments limit no longer makes sense. With paging, protection is enforced by the
inidvidual page table entries. A page is either present or not present. Attempt
to access a non-present page generates a page fault error. In addition, a page
table entry can be marked as either read-only or writeable. The access (A) and
dirty (D) bits are used to implement page replacement in demand paging.
Translation Lookaside Buffer (TLB)
In order to speed up the paging translation process, the CPU stores the most
recently used page table entries in a internal cache, called the TLB. Most
paging is performed by using the contents of the TLB. Bus cycles are performed
only when a new page is used. Whenever the page tables are changed, the OS
kernel must flush the TLB to dispose of its page table entries in order to
prevent it from using old page entries in the TLB. Flusing the TLB can be done
by reloading the CR3 control register. Individual entries in the TLB can also be
flushed by the INVPLG instruction.
1.4. Static Paging:
The simplest paging is static paging. In this scheme, all the pages of a
process image are allocated physical page frames at once, and the page tables
are set up accordingly. Once an image is loaded into memory, its pages are
always present.
1.5. Demand Paging:
In demand paging, the page tables of a process image are built according
to the image size, but not all the pages are allocated page frames. Those pages
which do not have page frames are marked not present. The page frame address in
an absent page table entry may point to its location in a physical device, e.g.
a block number in a file system or a block number in a swap disk containing the
page image. During execution, when a process attempts to reference a page that
is not present, it generates a page fault, which traps to the OS kernel. The OS
kernel can allocate a physical page frame for the page, load the page into the
page frame, change the page table entry to present and pointing to the page
frame. Then it lets the process continue with the valid page table entry. Demand
paging supports virtual memory in which the virtual address of a process image
can be much larger than the physical memory allocated to it. However, this
scheme depends on a good page repalcement policy in order to reduce the number
of page swaps.
1.6. Interrupt and Exception Processing in Protected Mode
Interrupt and exception processing in protect mode differ from that of real
mode in two areas. First, in protected mode the first 32 interrupt vectors, 0x00
to 0x1F, are reserved for exceptions, which are
Exception | Description
-----------|-------------------------------------------------
0x00 | Divide error:
0x01 | Single-step/debug exception:
0x02 | Nonmaskable interrupt:
0x03 | Breakpoint by INT 3 instruction
0x04 | Overflow
0x05 | Bounds check
0x06 | Invalid opcode
0x07 | Coprocessor not available
0x08 | Double fault
0x09 | Coprocessor segment overrun
0x0A | Invalid TSS
0x0B | Segment not present
0x0C | Stack exception
0x0D | General protection violation
0x0E | Page fault
0x0F | (Reserved)
0x10 | Coprocessor error
0x11-0x1F| (Reserved)
---------------------------------------------------------------
Since the exception vectors ovelap with the traditional interrupt vectors of
IRQ0 to IRQ8 (0x08 to 0x0F), the IRQ interrupt vectors must be remapped to
different vector locations.
Second, the exception vectors are no longer in the low 1KB memory area as in
real mode. Instead, they are defined as interrupt descriptors in an Interrupt
Decsriptor Table (IDT), which is pointed by the CPU's IDTR register. The
contents of the IDT are essentially "descriptors" but Intel chooses to call them
interrupt or trap gates. The format of interrupt and trap gates is
63 48 47 32 31 16 15 0
-----------------|-------------------|--------------|-------------
| offset |PpL0|TYPE|000-|----| segment | offset |
| (31-16) | | selector | (15-0) |
------------------------------------------------------------------
Format of Interrupt/Trap Gate
Where P is the present bit, pL is the privilege level, TYPE=1110 for interrupt
gates and 1111 for trap gates. The difference is that invoking an interrupt gate
automatically disables interrupts but invoking a trap gate does not. Since
hardware interrupts and exceptions are processed in kernel mode, the privilege
level, pL, must be set to 00. It can be set to 11 to allow user mode programs
to handle software gnerated interrupts, but this is rarely used in practice.
In addition to interrupt and trap gates, the IDT may also contain call gates and
task gates. Calls to a task gate (or an interrupt gate) may trigger a task
switch by hardware. Current Linux kernel does not use hardware task switching.
likewise, MTX also does not use TSS gates and hardware task switching for a
number of reasons. First, task switching involves much more than just switching
the hardware context of tasks. Second, even for hardware context only, task
switching by TSS gates is not necessarily faster than software task switching.
Hardware task switching may save a few instruction fetch cyles. With CPU's
insturction cache, the saving is most likely rather insignificant. Software task
switching is more flexible and is under direct control of the OS designer. Last
but not least, hardware task switching is only supported in 32-bit x86 CPUs. It
is no longer supported in 64-bit x86 CPUs. Using some hardware features that
will soon become obsolet makes little sense. Afterall, the concept of context
switching by calling an interrupt or task gate is somewhat misleading and
confusing. Unlike a function, which can be called, calling an interrupt or task
gate cannot cause a task switch unless the environment of the intended task
already exists. Therefore, we shall not use hardware task switching. However,
the x86 CPU does require a TSS descriptor in the GDT. Like the LDTR register,
the CPU's Task Register, TR, must be loaded with a selector to a TSS descriptor
in the GDT; the actual TSS structure is located somewhere else, e.g. in the
process table. The role of the TSS descriptor is as follows. When the CPU is
executing, its Task Register (TR) points at a TSS structure of the form
/***** Contents of TSS **********/
u32 *TSS // low 2 bytes
u32 esp0
u32 ss0 // low 2 bytes
u32 esp1
u32 ss1 // low 2 bytes
u32 esp2
u32 ss2 // low 2-bytes
u32 CR3
u32 eip,eglags,eax,ecx,edx.ebx.esp,ebp,esi,edi
u32 es,cs,ss,ds,fs,gs // all in low 2 bytes
u32 ldt
u32 iomap
/*********************************/
The minimum size of a TSS is 26*4=104 bytes. The fields from eip to gs are for
saving the "hardware context" of the current process during hardware task
switching. The most important fields are esp0 and ss0, which define the CPU's
interrupt stack. Assuming that the CPU is executing a process. Then, [ss0:esp0]
must point to the process kernel mode stack. This is because when an interrupt
or exception occurs, the CPU automatically uses [ss0:esp0] in the TSS as the
interrutpt stack. The following helps clarify this point.
Assuming that the CPU is executing a process in user mode. At this moment,
the process kernel mode stack is empty. The CPU's TSS (which is pointed by the
TR register) [ss0:esp0] points to the (high end of) the process kernel stack, as
shown in the following diagram.
CPU.TR --> TSS hi low
--------- --------------------------------------
[ss0:ssp0] --->| (emtpy kernel mode stack)
--------------------------------------
PROC.kstack[ ]
When an interrupt occurs, the CPU saves uSS,uSP,uflags,uCS,ueip into the
interrupt stack, whose contents become
CPU.TR --> TSS hi esp low
--------- --------------------------------------
[ss0:ssp0] --->|uSS|uesp|uflag|uCS|ueip|
--------------------------------------
PROC.kstack[ ]
where the prefix u denotes user mode registers. If an exception occurs, the
situation is exactly the same, except that for some exceptions, the CPU also
pushes an error number, err#, onto the interrupt stack, which becomes
CPU.TR --> TSS hi esp low
--------- --------------------------------------
[ss0:ssp0] --->|uSS|uesp|uflag|uCS|ueip|err#|
--------------------------------------
PROC.kstack[ ]
Upon entry to the interrupt/trap handler routine, the porcess kernel stack
contents are as shown in the above diagram. While in kernel mode, if another
interrupt or trap occurs, the CPU continues to use the same interrupt stack to
push on one more layer of interrupted context. However, if the CPU is already in
kernel mode, re-enter kernel mode does not involve a privilege change. So the
saved context only has |kflags|kCS|keip|, as shown in the next diagram.
CPU.TR --> TSS hi esp low
--------- ----------------------------------------------------
[ss0:ssp0] --->|uSS|uesp|uflag|uCS|ueip|err#|......|kflags|kCS|ksip
----------------------------------------------------
PROC.kstack[ ]
When returning from an interrupt/trap handler routine, the iret operation checks
whether changing from the current code sgement to the next involves a change of
privilege. If no change in privilege, i.e. from kernel back to kernel, it only
pops the saved kernel mode context. If the privilege changes, i.e. from kernel
back to user mode, it pops the saved user mode context, which includes the saved
uesp and uSS.
Summarizing, for protected mode operation, we must remap the IRQ vectors,
install interrupt/trap handlers as interrupt/trap gates in the IDT, and have a
TSS, which defines the CPU's interrupt stack. As in real mode, knowing the
kernel mode stack contents is key to understanding how to set up the execution
environment of a process.