1. Multiprocessor Systems (MP)
A multiprocessor system consists of two or more processors that share
memory and periphral devices. If the processors have both local and shared
memories, it is called a nonuniform memory access (NUMA) system. If all
processors share the same main memory, it is called a uniform memory access(UMA)
system. If the roles of the processors are not identical, e.g. some processors
can only do I/O and handle interrupts or only some processors can execute the
operating system functions, it is called an asymmetric MP system. If all the
processors are functionally identical, it is called a symmetric MP (SMP) system.
With the advent of multi-core processors, SMP has become the most popular form
of MP systems.
SMP-compliant Systems.
A SMP system requires much more than just a multiple number of processors or
processor cores. The system architecture must provide additional support for
SMP operations. Intel's Multiprocessor Specification defines a SMP-compliant
system as a PC/AT compatible system with the following features.
(1). Support interrupts routing and inter-processor interrupts:
Interrupts from I/O devices can be configured and routed to individual
processors as needed by the SMP kernel. Processors can interrupt each other by
inter-processor interrupts (IPIs) for communication and synchronization. In a
SPM-compliant PC/AT system, these are provided by a set of Advanced Programmable
Interrupt Controllers (APICs). A SMP-compliant system consists of one or more
IOAPICs and a set of local APICs of the individaul processors. Together, the
APICs implements a distributed inter-processor communication protocal, which
supports interrupts routing and inter-processor IPIs.
(2). An extended BIOS, which can detect and build SMP configuration information
for the operating system to use.
(3). Designate one of the processors as the boot processor (BSP), which executes
the booter when a SMP system starts. All other processors are Application
Processors (APs), which are disabled initially but can receive IPIs from the
boot processor. After booting up, all the processors are functional identical.
2. Start up a SMP system.
When a SMP system starts, BIOS detects the hardware configuration and
creates structures containing information for the operating system to use. The
structures include a Floating Pointer Structure (FPS), which is required, and a
MP Configuration Table, which is optional but generally present. If the FPS
itself is absent, the system is not SMP-compliant. If the configuration table
does not exist, the operating system may set up the system by using a default
configuration, which is valid for systems with only 2 processors. The FPS may be
in one of three places, which the OS must search in order: (1) in the first 1KB
of the extended BIOS data area, (2) in the last 1KB of real-mode base memory,
(3) in the BIOS read-only memory space between 0xF00000 and 0xFFFFF. The FPS is
a 16-byte structure, which begins with a 4-byte signature "_MP_". The next field
is a 4-byte pointer to the MP configuration table. The configuration table is
divided into three parts: a header, a base section, and an extended section. The
header begins with the four-byte signature "PCMP". It contains OEM information
and the number of entries in the base section. The base section consists of a
set of entries that describe processors, system buses, I/O APICs, I/O interrupt
assignments and local APIC interrupt assignments. The first byte of each entry
is the entry type. For processor entries, the entry length is 20 bytes. All
other entries are 8 bytes. The OS must parse the configuration table to create
OS specific data structures, such as the APIC ID, version and type of each
processor, as well as the address of the system's I/O APIC.
As an example, the MTX mp command scans the MP configuration table of a
SMP-compliant PC. For a VMware virtual machine on a dual-core host PC, it
generates the following outputs. For other dual-core PCs, the outputs vary but
are similar.
/************ Sample outputs of MTX's mp command ***********/
// Search for MP flopating pointer structure
Search 1KB in extended BIOS data area segment=0x9F80 (not found)
Search last 1KB of base memory: 0x9FC00 to 0x9FFFF (not found)
Serach BIOS ROM area 0xF0000 to 0xFFFFF (found)
segment=0xF680 offset=0x1B0 address=0xF69B0 // FPS location
// MP Floating Pointer Structure
signature = _MP_ // signature
mp_table_addr=0x9FD70 // MP configuration table address
0x1 0x4 0x2A 0x0 // 16-byte length,rev,cksum,feature
0x0 0x0 0x0 0x0 // other feature bytes
//--------- MP Table -----------
signature = PCMP // signature
base_len=268 rev=0x4 cksum=0x6B // byte length, MP version 1.4
0x49 0x4E 0x54 0x45 I N T E // OEM information
0x4C 0x20 0x20 0x20 L
0x34 0x34 0x30 0x42 4 4 0 B
0x58 0x20 0x20 0x20 X
0x20 0x20 0x20 0x20
oemTablePtr=0x0 // no OEM table
oemSize=0 entryCount=25 // number of entries=25
localAPICAddr=0xFEE00000 // local APIC address
// processor entries (type 0)
******* CPU #0 ********
0x0 0x0 0x11 0x3 // APICid=0; 0x3=BSP,enabled
cpu_signature=0x6F6
flags=0xFEBFBFF
******** CPU #1 ********
0x0 0x1 0x11 0x1 // APICid=1; 0x1=AP, enabled
cpu_signature=0x6F6
flags=0xFEBFBFF
// bus entries (type 1)
0x00 PCI // bus ID, string
0x02 PCI // bus ID, string
0x23 ISA // bus ID, string
// IOAPIC entry (type 2)
0x2 0x2 0x11 0x1
addr=0xFEC00000 // I/O APIC address
// IOAPIC Interrupts Assignment entries (type 3)
0x23 0x0 0x2 0x0 // ISA IRQ0 routing
//... One line per IRQ ....
// Local APIC Interrupt Assignemnt entry (type 4)
0x23 0x0 0xFF 0x0 // ISA IRQ0 to Int0 of all local APICs
The above listing shows that the VMware VM has only one IOAPIC, which is at
the memory-mapped physical address 0xFFC00000 of all the processors. Each
local APIC is at the memory-mapped address 0xFEE00000 of the corresponding
processor. Each processsor may use the same address to access its own APIC.
Upon booting up, a SMP kernel running on the BSP must initialize the system
hardware for SMP operations by the following steps.
(1). Configure IOAPIC to route interrupts to local APICs.
(2). Configure and enable BSP's local APIC.
(3). Send INIT and STARTUP IPIs to activate other APs.
(4). Continue to initialize the OS kernel until it is ready to run tasks.
Details of these steps follow.
1. Configure IOAPIC:
(1). Set up IOAPIC interrupt registers:
Most SMP systems have only one IOAPIC. An IOAPIC has 24 (64-bit) interrupt
registers, which specify how to route the interrupts and map IRQs to interrupt
vectors. The registers are accessed indirectly through a pair of IOREGSEL and
IOWIN registers, which are located at 0xFEC00000 and 0xFEC00010, respectively.
Other IOAPIC registers are denoted by byte offsets. All registers must be
accesses by 32-bit reads/writes. To access an IOAPIC register, first selects the
register by wirting its byte offset to IOREGSEL. Then, read/write a u32 data
from/to the IOWIN register. For each IOAPIC interrupt register, it takes two
operations to read/write the two 32-bit halves of the register. For example,
assuming that we have the functions
select_ioapic(u8 reg); // write reg to IOREGSEL at 0xFEC00000
write_ioapic(u32 dw); // write dw to IOWIN at 0xFEC00010
the following code segment sets up the 0_th IOAPIC interrupt register at the
byte offset 0x10.
select_io(0x10); // select low 32-bit half of IntReg. 0
write_io((u32)0x00009000 + (u8)vector); // write to low 32-bit half
select_io(0x11); // select high 32-bit half of IntReg. 0
write_io((u32)0xFF000000) // write to high 32-bit half of IntReg 0
Similarly for other interrupt registers. Standard assignments of the IOAPIC
interrupt registers are
IRO 0-15 : IOAPIC registers 0-15
PCI A-D : IOAPIC registers 16-19
For IRQs 0-15 interrupts, they must be set to edge-triggered and active high.
For PCI A-D interrupts, they must be set to level-triggered and active low. For
PCs in protected mode, the first 32 interrupt vectors 0x00-0x1F are reserved.
Each of the IOAPIC's interrupt registers can be programmed with any of the
remaining 244 vectors. The top 4 bits of an interrupt vector is also its
interrupt priority. Higher interrupt vector numbers have higher interrupt
priorities. The standard PIC interrupt priorities are IRQ 0-2,8-15,3,4,5,6,7,
where IRQ0 has the highest priority. These must be mapped by the IOAPIC
interrupt registers to vectors that preserve their priorities. There are 16
different interrupt priorities, 0 to F. Since we cannot use the vectors 0x00-
0x1F, this leaves only 15 available interrupt priorities. If we reserve some of
the interrupts, e.g. 0x20-0x2F, for special usage, the number of distinct
interrupt priorities is even less. Thus, some of the IRQs would have to be
mapped to vectors with the same priority. As an example, if we choose the
vectors 0x30 to 0xAF for the (16) PIC IRQs and assign 2 IRQs to each priority
level, we may assign the IOAPIC's interrupt registers as follows.
PIC IRQs : 0 1 2 8 9 10 11 12 13 14 15 3 4 5 6 7
vectors 0x : A0 A1 90 91 80 81 70 71 60 61 50 51 40 41 30 31
Simliarly for the interrupt registers 16-19, which are used to map the PCI
interrupts A-D. In addition to interrupt vectors, each IOAPIC interrupt register
must also be programmed with interrupt delivery mode and destination. The
delivery mode can be either physical or logical. The simpliest way is to use
logical delivery mode and route interrupts to APICs with the lowest priority in
the APIC's task priority register. In that case, the IOAPIC interrupt registers
can be set to 0xFF00000000000900 + (u8)vector_number.
(2). Switch to symmetric I/O mode
In order to route interrupts to local APICs, the system must be switched to
symmetric I/O mode. For PCs that use an Interrupt Mask Control Register (IMCR),
this can be done by
. out_byte(0x70, 0x22); // access Interrupt Mask Control Register (IMCR)
. out_byte(0x01, 0x23); // force PIC IRQs to IOAPIC
(3). Disable the 8259 PICs
After setting up the IOAPIC, the 8259 PICs must be disabled by
. out_byte(0xFF, 0x21); // mask off master 8259 PIC
. out_byte(0xFF, 0xA1); // mask off slave 8259 PIC
Although the principle is simple, programming the IOAPIC properly is no easy
task, despite detailed Intel documentations. It is also worthy pointing out that
the full capabilities of the IOAPIC can only be realized in protected mode.
2. Configure loacl APICs
Each processor has a local APIC at the same base address 0xFEE0000. The
local APIC registers are at offsets from the base address, which can be accessed
directly with 32-bit reads/writes. Some of the local APIC registers and their
usage are listed below. To enable the local APIC, write 0x10F to the spurious
interrupt register at 0x0F0, which also uses 0x0F as the default vector for
spuriour interrupts. Examples of how to set up other APIC registers will be
shown in the next section.
Register Contents
------- ----------------------------------------------
0x020 : ID register (unique ID; to be used as CPU ID)
0x030 : version register
0x080 : task priority (as interrupt priority mask)
0x0B0 : EOI (write 0 to signal end-of-interrupt)
0x0F0 : spurious interrupt vector (also for enable APIC)
0x300 : interrupt command register (generating IPIs)
0x320 : LVT timer (APCI timer: mask bit, mode and vector)
0x350 : LVT LINT0 (INTR input)
0x360 : LVT LINT1 (NMI input)
0x370 : LVT error vector table
0x380 : initial timer count
0x390 : currnet timer count
0x3E0 : timer divider
------------------------------------------------------
(3). Send INIT and STARTUP IPI to activate other APs.
After enabling the local APIC, the BSP can activate other APs by sending
them INIT and STARTUP IPIs. These can be done by writing to the Interrupt
Command register (0x300) as follows.
.write 0x00C4500 to 0x0300; delay(); // issue INIT IPI to all APs
.write 0x00C4611 to 0x0300; delay(); // issue STARTUP IPI to all APs
where the delay is about 20 msec or more. Each AP wakes up to execute a piece of
"trampoline" code in real mode. The trampoline code must begin at a 4KB boundary
in the 1MB real-mode memory. The location is determined by the vector number in
the STARTUP IPI. In the above example, the vector value is 0x11, which tells
each AP to begin execution from (0x1000, 0x1000) in real-mode memory. Each AP
must configure its own local APIC, set up its page table, switch to protected
mode and then enters the OS kernel. During SMP mode operation, a processor may
use IPIs to synchronize with other processrs, such as to flush their TLBs and
invalidate page table entries, etc.
Once a SMP system boots up, all processors are functionally identical. In a
SMP system, processes may run in parallel on different processors. A SMP kernel
must be able to support the parallel executions of multiple processes. Thus, all
the kernel data structures must be protected to prevent corruption and race
conditions. Traditionally, most OS kernels are designed for UP operations
initially. Adapting a UP kernel to SMP is usually done in three stages.
1. Giant Lock Stage:
Early Linux (kernel 2.0) and FreeBSD are adapted to SMP by using a Giant
Kernel Lock (GKL). In this scheme, the entire kernel is treated as a critical
region and protected by a single global lock, which is usually a spin lock. In
order to execute in kernel, a process on a processor must acquire the GKL first.
Only the process which holds the GKL can execute in kernel. The GKL lock is not
released until the process has completed the kernel mode operation. The main
advantage of this approach is its simplicity. It requires very little changes to
a UP kernel to make it work in a SMP environemnt. The disadvantage is that,
while a process executes in kernel, all other processors must be either busily
waiting for the GKL lock or can only execute in user mode. Such a system
supports MP but does not fully realize the capabilities of a SMP system.
2. Coarse grained locking stage:
In order to improve concurrency, which includes parallel executions on MPs,
a SMP kernel may use separate locks to protect different subsystems that have
very little interactions, such as the file system and networking subsystem. This
allows for some degree of concurrency but with only a limited improvements in
performace. For this reason, most SMP OS use coarse grained locking as a
transitory stage.
3. Fine grained locking stage:
System V, AIX and current versions of Linux (2.6) all use fine grained
locking for SMP. In this scheme, small locks are used to protect individual
kernel data structures to imporve both concurrency and system response time.
Although every MP operating system strives for fine grained locking, there is
no general consensus on what constitutes a fine grained lock. The approaches
used by almost SMP kernels are
. Identify the kernel data structures that need protection. Use locks to
implement operations on the data structures as critical regions. Examples of
such data structures include process scheduling queue, page tables, inodes
and I/O buffers, etc. This generally leads to concurrency at the individual
data structure level in the kernel.
. To further improve concurrency, try to decompose the data structures into
smaller units and implement operations on the decomposed units as critical
regions. If some of the decomposed parts are logically connected, devise
schemes to synchronize the interactions among the various parts.
As the locking grain size decreases, the overhead due to additional locks and
synchronizations will increases. The process of reducing the locking grain size
must eventually stop when it no longer improves concurrency and overall speed.
The problem of designing a SMP kernel is essentially the same as that of
designing parallel algorithms. Although the principle is simple and well
understood, the difficulty is how to decompose a data structure that allows for
maximal concurrency yet with minimal interactions among the decomposed parts. So
far, there are no definitive answers.
2. Adapting MTX kernel to SMP:
In order to support SMP, the MTX kernel must be modified but we shall try to
keep the modifications to a mimimum. The following describes the changes to the
MTX kernel in detail.
(1). Running PROCs:
In the UP MTX kernel, running is a global pointer to the current running
PROC. In SMP, a single running pointer is no longer adequate because each CPU
may be executing a different PROC. With N processors, we may define PROC *run[N]
as pointers to the PROCs that are currently running on the N processors. When a
SMP kernel starts, each processor i runs an initial process, which is pointed by
run[i]. Subsequently, run[i] always points at the PROC that is currently running
on the processor i. In order to reference the correct running PROC, we define
running as a C preprocessor symbol by
#define running run[cpuid()]
where cpuid() returns the processor ID (0 for CPU0, 1 for CPU1, etc). This
allows the same running variable to be used in the SMP MTX kernel code without
any modification. However, the scheme depends on how to maintain the processor
IDs, which must be unique.
2. Processor IDs:
In a SMP system, each processor's local APIC has a unique ID number, which
can be used as the processor ID. The local APICs are at the same (per-processor)
address 0xFFE0000. In a SMP system with virtual memory, when a processor starts,
it can read the local APIC ID and store it in a location in its own virtual
address space. During operation, each processor can read the same virtual
address to find out its ID. However, this scheme does not work for MTX because
MTX operates in real mode in which all processors share the same address space.
Thus, we need a different scheme to maintain the processor IDs. In MTX, the CPU
register ES is normally unused, which can be used to carry the processor ID. For
example, when CPU0 starts, we set its ES to 0. When CPU1 starts, we set its ES
to 1. Then, cpuid() simply returns the CPU's ES register as the unique processor
ID.
3. SMP MTX start up sequence:
(1). As in UP, SMP MTX begins from the assembly file ts.x, which calls main()
in t.c. While in main(), it first calls init() to initialize the MTX kernel and
creates an initial process P0. P0 calls kfork("/bin/init") to create the init
process P1. Instead of switching to P1 immediately as in UP, P0 calls smp() in
the smp.c file to set up the PC for SMP operation. Although we only tested SMP
MTX on PCs with 2 processors (both dual-core PCs and VMware VMs hosted on dual-
core PCs), the same principle and technique should also be applicable to multi-
core PCs with more than 2 processors.
(2). During booting, CPU0 is the boot processor (BSP) and CPU1 is the AP. Only
the BSP executes the boot code as in the UP system. The AP is initially held in
the reset state, awaiting an IPI to start up. In smp(), it first finds the
Floating Pointer Structure (FPS). In almost all cases, the FPS is in the BIOS
read only memory area between 0xF0000 and 0xFFFFF. For VMware VMs, it is at
(0xF680, 0x1B0). Since the FPS is in the first 1MB of real memory, it can be
accessed easily in real-mode.
(3). Read/write local APIC information:
The local APIC of every processor is at the memory-mapped address 0xFFE00000.
In order to read/write the local APIC, we need to switch the CPU to protected
mode, even temporarily. However, doing so would require low-level assembly
programming as well as some data sturctures for protected mode. Instead, we
choose to use INT15-87 of BIOS to access high memory. The MTX kernel is loaded
at the segment 0x1000. We use the long word at (0x0000,0xF000) as the data area
to read/write APIC registers by the functions
u32 get_apic(u8 apic_reg) // read from an APIC register
int put_apic(u32 w, u8 apic_reg) // write to an APIC register.
The function get_apic(u8 apic_reg) first sets up a Global Descriptor Table (GDT)
in which the source address=0xFFE00000 and the destenation address=0x0000F000.
Then it issues an INT15-87 to read the APIC register to (0x0000,0xF000). Then,
it uses get_word() to retrive the 2-byte words, assemble them into a long word
as the return value. Similarly for put_apic(u32 w, u8 apic_reg), which puts the
long word to (0x0000, 0xF000) and then issues an INT15-87 to write it to the
APIC register in high memory. By changing the high memory address to 0xFFC00000
and 0xFFC00010, these functions can also be used to read/write IOAPIC registers.
(4). Enable BSP's local APIC and wakeup AP:
During booting, the local APIC of the boot processor BSP is already set up
by BIOS. The BSP is configured to receive all the interrupts from the 8259 PICs.
Other APICs and the IOAPIC are not functional until they are initialized and
enabled. Although we can set up the IOAPIC correctly, it seems that interrupts
routing can take place only in protected mode. For simplicity, we shall not use
the IOAPIC to route interrupts. Instead, we shall let the BSP continue to handle
all the PIC interrupts just as in a UP MTX kernel. This helps reduce the needed
changes to the MTX kernel. When the BSP boots up, its APIC is disabled. In order
to issue IPI to wake up other AP processors, we first enable the BSP's APIC and
then send INIT/STARTUP IPIs to the other AP or APs to activate them, as shown by
the following code segment.
// write 0x10F to Spurious Interrupt register (0x0F0) to enable local APIC
put_apic((u32)0x0000010F, 0x00F0);
// write to IntCommandReg (0x300) to send INIT IPI to all APs except self;
put_apic((u32)0x00C4500, 0x0300); delay();
// issue STARTUP IPI to AP; AP's start up vector=0x11=(0x1000,0x1000)
put_apic((u32)0x00C4611, 0x0300); delay();
When an AP starts up, it begins execution from the 4KB page at (0x1000,0x1000).
Correspondingly, we add the following code segment to the assembly file ts.x.
Since the MTX kernel begins at the segment 0x1000, the added code segment is at
the 4KB aligned page (0x1000,0x1000), which is the entry point of the AP when it
starts up.
.globl _proc1,_procsize,_CPU1
.org 4096 ! aligned at 4KB boundary
mov ax,#0x1000 ! CPU#1: Upon entry, CS=0x1100
mov ds,ax ! let DS=SS=MTX DS
mov ax,2 ! get MTX DS
mov ds,ax
mov ss,ax ! SS = DS, all set to MTX's DS
mov ax,#1 ! set ES=1 as CPU ID of CPU#1
mov es,ax
mov sp,#_proc1 ! set CPU1's sp to high end of proc1
add sp,_procsize
jmpi _CPU1,0x1000 ! execute CPU1() in MTX kernel
When an AP starts, it executes a piece of "trampoline" code in real-mode. In the
trampoline code, the AP continues to initialize itself, such as setting up its
execution environment and configuring its local APIC, etc. Then, it may switch
to protected mode, set up its own virtual address space and eventually enters
the OS kernel to run tasks. In the MTX case, the above "trampoline" code sets up
the AP to execute in the MTX kernel space in real mode. Then, it far jumps to
execute CPU1() in the MTX kernel. The algorithm of CPU1() is
int CPU1()
{
(1). printf("============= CPU #1 starts ============\n");
let run[1] point at proc1, which is the initial PROC of CPU#1.
initialize proc1 with pid = 1234 (differ from 0 to NPROC-1)
(2). read and disply CPU1's local APIC information; use APIC ID as CPU ID;
(3). configure local APIC timer to generate periodic interrupts with count =
0x00110000 and vector=0x41. Install CPU1's local timer interrupt handler
as
int cpu1_thandler()
{
printf("CPU%d APIC timer interrupt ", cpuid()); // optional
// schedule tasks on CPU1
put_apic((u32)0x00000000, 0x00B0); // write to EOI register
}
(4). enable local APIC (by writing 0x1F to SpuriourIntReg at 0x0F0);
putv((u32)0x0000010F, 0x00F0);
(5). // wait for BSP's notification to run tasks
while(!go_smp); // wait for CPU0 to set go_smp to 1
printf("============= CPU #1 ready to run tasks ===========\n");
(6). // enter scheduling loop to run tasks from ready queue
}
(5). Local APIC timer of CPU1:
Every APIC has a timer, which can be used as a time base of the processor.
The following code segment shows how to set up the APIC timer.
put_apic((u32)0x00020041, 0x320); // mode=periodic, unmasked, vector=0x41
put_apic((u32)0x00110000, 0x380); // initial count = 1.1*2**20
put_apic((u32)0x00110000, 0x390); // current count
put_apic((u32)0x0000000B, 0x3E0); // divisor = 1
The APIC timer uses the bus clock to decrement the current count register. When
the count reaches 0, it interrupts the processor with an interrupt vector 0x41,
reloads the counter with the initial count register and repeats. Since the bus
frequency of PCs differ, the APIC's timer count may need to be adjusted in order
to match the PIC's timer for task scheduling. When running on VMware VMs, a
count of 0x00110000 yields about 60 interrupts per second, which is the rate of
the PIC timer. We need the CPU1's local timer for the following reason. Since
CPU0 receives and handles all the PIC interrupts, CPU1 does not have any
interrupts. In order for CPU1 to run tasks, it must be able to examine the ready
queue and do task scheduling. There are several possible ways to do this.
.By IPI: CPU0 may issue IPIs to inform CPU1 to start an action. As an example,
the code segment
put_apic((u32)0xFF000000,0x310); // write to hi 32-bit of ICR
put_apic((u32)0x000C4042,0x300); // write to how 32-bit of ICR
sends an interrupt IPI to CPU1 with an interrupt vector 0x42.
.By shared memory: CPU1 can monitor some memory contents that are changed by
CPU0, but this requires polling of CPU1.
.By local timer: a local timer can interrupt CPU1 periodically, allowing it to
look for work by itself.
Among these, a local timer is the simplest to implement. With a local timer,
CPU1 can use the same task scheduling algorithm as described in Chapter . When
a processor is ready to run tasks, its scheduling loop is of the form
while(1){
if (readyQueue)
tswitch(running);
else
idle();
}
When a processor finds no work to do, it idles with interrupts enabled. While
a processor idles, any interrupt will cause it to get up to process the
interrupt and then try to run tasks again. For CPU0, the interrupts inlcude
both the PIC timer and devices interrupts. For CPU1, the only interrupts are
from the local APIC timer. In addition, CPU1 also uses the APIC timer for
process scheduling.
(6). After setting up the local timer, proc1 running on CPU1 waits for CPU0
to turn on a go_smp flag before entering the scheduling loop.
(7). After starting up CPU1, P0 running on CPU0 continues to initialize the MTX
kernel, including starting the PIC timer. When the MTX kernel is ready to run
tasks, P0 sets the go_smp flag to 1, allowing CPU1 to enter its scheduling loop.
The reader may change the timing sequence to run the PROCs on different CPUs.
For example, if we let P0 wait for the go_smp flag, which is turned on by proc1,
then P1 will start to run on CPU1 first, etc.
(8). Now, MTX is running in the SMP mode. Both CPUs can run tasks from the same
readyQueue.
The SMP MTX kernel:
In order to support SMP, the UP MTX kernel must be modified. The changes to
the MTX kernel are listed below in the order of increasing complexities.
(1). tswitch() in ts.x file:
When a PROC running on a CPU calls tswitch() to switch process, we must
know the calling PROC in order to save the task's context into its kstack[ ].
So, we modify tswitch() to tswitch(running), passing as parameter the running
PROC pointer. After saving its context, the PROC calls nextrun(), which selects
the next runable PROC. In order to protect the readyQueue, we require that a
process calling tswitch() must acquire the spin lock, srQ, which is released at
the end of tswitch(). In the resume part of tswitch(), we must know the resuming
PROC in order to restore its context. So, we modify netrun() to return the next
running PROC pointer. Accordingly, we modify the tswitch() code as follows.
! tswitch(PROC *running)
.globl _nextrun, _srQ
_tswitch:
cli
push bp
mov bp,sp
! push ax,bx,cx,dx,bp,si,di,flag registers as before
mov bx, 4[bp] ! get running pointer
mov 2[bx], sp ! save sp into running->ksp
find: call _nextrun ! nextrun() returns pointer->next running PROC
resume: mov bx, ax ! get current running PROC
mov sp, 2[bx] ! restore saved context of current ruuning PROC
! pop flag,di,si,bp,dx,cx,bx,ax as before
mov sp,bp
pop bp
mov ax,#0 ! release spin lock srQ
xchg ax,_srQ
ret
! -------------- end of tswitch() -------------------
(2). Interrupt entry and exit routines:
Since CPU0 handles all the PIC interrupts, the interrupt entry and exit
routines do not need any changes for CPU0. However, PROCs running on CPU1 also
do syscalls (via int 80) and handle the local timer interrupts. In order to know
which PROC is running and on which CPU, we add the following lines of code to
both the INTH macro and _ireturn
.globl _run ! run[ ] is an array of PROC pointers on CPUs
mov ax,es ! ES = CPUID = 0,1, etc
shl ax,#1 ! change CPUID to an offset
mov bx,#_run ! bx->run[ ]
add bx,ax ! bx points at run[CPUID]
mov bx,[bx] ! bx->PROC running on CPU ID
which sets bx point at the interrupted PROC.
(3). Spin locks and semaphores
In SMP, spin locks are for CPUs to compete for shared resources of short
duration. In order to access a shared resource, a process running on a CPU must
acquire a spin lock associated with the resource, as in
u16 spinlock = 0; // initial value = 0
slock(&spinlock); // acquire spinlock
// access resource
sunlock(&spinlock); // release spinlock
After using the resource, the PROC releases the spinlock to allow other CPUs to
acquire the spinlock. The implementations of slock() and sunlock() are
! ------- implement spin lock slock(&x); -----------------
_slock:
push bp
mov bp,sp
push bx
mov bx,4[bp] ! pointer to spin lock x
spin: mov ax,#1
xchg ax,[bx] ! atomic get x and set x=1
bt ax,#0
jc spin ! spin if x was already 1
pop bx ! return only if x was 0
pop bp
ret
! sunlock(&x)
_sunlock:
push bp
mov bp,sp
push bx
mov bx,4[bp] ! pointer to spin lock x
xor ax,ax ! AX=0
xchg ax,[bx] ! atomic set x=0
pop bx
pop bp
ret
!------------- end of slock()/sunlock() ----------------------
In the original MTX kernel, semaphore operations are simplified because of the
UP environment. In order to support SMP, we modify the semaphore structure and
semphore operations as follows.
typedef struct semaphore{
u16 lock; // per-semaphore spin lock;
int value;
struct proc *queue;
}SEMAPHORE;
where the per-semaphore spin lock is to ensure that every operation on a
semaphore is a critical region. The modified P and V operations are
int P(struct semaphore *s)
{
PROC *p; int ps;
ps=int_off(); // disable CPU interrupts
slock(&s->lock); // spin lock the semaphore
s->value--;
if (s->value < 0){
running->status=BLOCK;
running->sem = s;
enqueue(&s->queue, running);
sunlock(&s->lock); // release spin lock
tswitch(running); // give up CPU
}
else
sunlock(&s->lock); // release spin lock
int_on(ps); // enable CPU interrupts
}
int V(struct semaphore *s)
{
PROC *p; int ps;
ps=int_off(); // disable CPU interrupts
slock(&s->lock); // spin lock
s->value++;
if (s->value <= 0){
p = dequeue(&s->queue);
p->sem = 0;
p->status = READY;
schedule(p);
}
sunlock(&s->lock); // release spin lock
int_on(ps); // enable CPU interrupts
}
Similarly for the mutex sturcture and mutex operations, which are used for
threads synchronization.
(2). Replace sleep/wakeup with semaphores.
As a synchronization mechanism, sleep()/wakeup() works only for UP but is
unsuited to SMP. This is because an event is just a value, which does not have
an associated memory location to record the occurrence of an event. When an
event occurs, wakeup() simply tries to wake up all processes sleeping on the
event. If no process is sleeping on the event, wakeup() has no effect. This
requires a process to go to sleep first before another process tries to wake it
up later. This sleep-first and wakeup-later order can always be achieved in UP
but not in SMP. In a SMP system, processes run in parallel. It is impossible to
guarantee the process execution order. Therefore, a SMP system cannot use
sleep/wakeup for process sychronization. In the MTX kernel, sleep/wakeup are
used only in process management. It is fairly easy to replace them with P/V on
semaphores. As an example, consider the kernel functions kwait() and kexit().
We can define a semaphore PROC.child = 0 in each PROC structure. When a process
waits for a ZOMBIE child, it uses P(&runnnig->child). Correspondingly, when a
process terminates in kexit(), it uses V(&running->parent->child) to unblock the
parent.
(2). Add semaphores to protect kernel data structures.
In the MTX kernel, device drivers and the file system already use semaphores
for synchronization. Despite these, the MTX kernel is not SMP ready because in
many places it assumes only one process can execute in the kernel. As a result,
most kernel data structures are not protected. Examples of such data structures
include PROC lists, readyQueue, free memory list, message buffers, minodes, I/O
buffers and device I/O queues, etc. In order to support SMP, each of these data
structures must be protected to ensure that processes can only access them one
at a time. This is same approach used by many OS when adapting their UP kernels
to SMP. The needed modifications can be classified into three categories.
The first category includes the kernel data structures that are used for
allocation/deallocation of resources, such as
. free PROC/THREAD lists
. free memory list
. pipe structures
. message buffers
. bitmaps for inodes and disk blocks
. in-memory minodes
. open file table entries
. mount table entries
Each of these data structures can be protected by either a spin lock or a lock
semaphore, and modify the allocation/deallocation algorithms as critical regions
of the form
allocate(resource)
{
LOCK(resource_lock);
// allocate resource from the resource data structure;
UNLOCK(resource_lock);
retrun allocated resource;
}
deallocate(resource)
{
LOCK(resource_lock)
// release resource to the resource data structure;
UNLOCK(resource_lock);
}
where LOCK()/UNLOCK() denote either slock()/sunlock() on a spin lock or P()/V()
on a lock semaphore. For example, to protect the free PORC list, we can define
a free_Proc_list semaphore = 1 and modify get_proc()/put_proc() as
PROC *get_proc()
{
LOCK(free_PROC_list);
// PROC *p = remove first PROC from ree PROC list;
UNLOCK(free_PROC_list);
return p;
}
int put_proc(PROC *p)
{
LOCK(free_PROC_list);
// enter p into free PROC list;
UNLOCK(free_PROC_list);
}
where the operations on the free PROC list are exactly the same as they are in
UP. This catagory also includes the data structures for which the behavior of a
process is to access the data structure without pasuing. For example, to protect
the process readyQueue, we can define a spin lock, srQ = 0, and modify the
kernel functions that access the readyQueue as
slock(&srQ);
// access readyQueue;
sunlock(&srQ);
Similarly, we can use locks to protect the superblock and group descriptors of
the file system and implement their updating algorithms as critical regions.
The second category includes the cases in which a process must acquire a
lock first in order to search a data structure for a needed item. If the item
already exists, the process does not create a new item but may have to wait for
the existing item. If so, it must release the lock to allow for concurrency. If
releasing the lock does not create any race condition, then very little change
is needed to make the UP algorithms also work for SMP. As a specific example,
consider the iget() function, which retruns a locked minode. Assume that minodes
_lock is a lock semaphore for all the minodes in memory. We only need to modify
iget() slightly as follows.
MINODE *iget(int dev, int ino)
{
LOCK(minodes_lock);
if (needed minode already exists){
increment minode's refCount by 1;
UNLOCK(minodes_lock);
lock minode; // process may block in minode's lock semaphore
return minode;
}
// needed minode not in memroy
find a free minode;
set minode refCount to 1
lock the minode; // process does not wait here
UNLOCK(minodes_lock);
load inode from disk into minode;
return minode;
}
Note that if the needed minode does not exist, the process creates a new minode
in the critical region of the minodes_lock. This ensures that every newly
allocated minode is unique.
The third category includes the cases in which the UP algorithms must be
modified considerably in order to deal with race conditions caused by the
concurrent executions of processes. As a specific example, consider the I/O
buffer management algorithms of MTX in Chapter 12. All the algorithms assume
that there is only one process executing. This assumption is no longer valid in
SMP. In order to adapt the algorithms to SMP, the simplest approach is to add a
freelist lock to protect the free buffer list and a per device list lock, dev.
lock, to protect each device list. But this does not solve the problem entirely.
For example, in getblk(), in order to search for a buffer, a process must lock
the device list first. If it finds the needed buffer existing but the buffer is
already locked, it must release the dev.lock before doing P(bp) to wait for the
buffer. Otherwise, no process can access the locked device list. However, once
the process releases the dev.lock but before it does P(bp) to wait for the
buffer, many things can happen to the buffer in that time gap. For instance, it
may be released as a free buffer (by a process executing on a different
processor), which is grabed by another process and reassigned to a different
disk block, etc. By the time the process, which found the buffer earlier, waits
for the buffer, the buffer has been changed. If so, the process would be waiting
for the wrong buffer. This kind of race condition does not exist in UP but is
very likely in SMP. Therefore, the UP algorithm must be modified considreably in
order for it to work in SMP. In deed, Chapter 12 of Bach contains part of such
a buffer management algorithm for System V MP Unix.
The algorithm assumes the
following.
1. Free buffers are maintained in a freelist. Assigned buffers are maintained in
hash queues (HQs). An assigned buffer is in a unique HQ but also in the freelist
if it is not in use.
2. The freelist has a lock semaphore=1. Each HQ has a lock semaphore=1 and each
buffer has a lock semaphore=1.
3. The Conditional P operation on semaphores, CP(), is defined as
CP(semaphore){
if (semaphore value > 0) lock the semaphore and return 1
else return 0 without locking the semaphore
}
Thus, while(!CP(semaphore)); is equivalent to a spin lock. The algorithm, as it
appears in Bach, is as follows.
===================== UNIX MP Buffer Algorithm ==========================
BUFFER *mgetblk(dev,blk) // return a locked buffer for exclusive use
{
while(buffer not found){
1. P(HQ); // lock HQ of bp=(dev,bp)
2. if (bp in HQ){
if (!CP(bp)){ // if failed to lock bp
V(HQ); // release HQ lock
P(bp); // wait in bpQ
if (bp changed){
V(bp); // unlock bp
continue; // retry the algorithm again
}
}
// locked bp did not change; bp must be in freelist
while(!CP(freelist)); // spin lock freelist
remove bp from freelist;
V(freelist); // unlock freelist
V(HQ); // release HQ lock
return bp;
}
/******** next case of buffer not in HQ not shown *********/
}
Some specific comments about the algorithm follow.
1. The algorithm uses locks to protect the freelist and hash queues, but it is
essentially the same UP Unix algorithm adapted to MP.
2. In mgetblk(), when a process finds a buffer in the buffer cache but the
buffer is already locked, it releases the HQ lock and waits in the buffer's
semaphore queue. When the process eventually acquires the buffer's lock,
the buffer may be changed, due to the reasons mentioned above. If so, the
process must give up the buffer and re-execute the algorithm again. This not
only reduces the buffer's cache effect but also causes excessive process
retry loops.
3. Because of the buffer lock, every buffer can only be used by one process at
a time. In a SMP kernel, processes should be able to read from the same
buffer concurrently.
4. The maximal degree of concurrency is the number of HQs but the minimal degree
of concurrency is only 1 due to the freelist bottleneck.
(6). Redesign algorithms for SMP.
The MP Unix algorithm shows that simply porting UP algorithms to SMP may
work but the resulting algorithms may not be very efficient. In order to truly
support SMP, some of the algorithms may need to be redesigned completely. In
the following, we shall show a new MP buffer management algorithm that does not
have the above shortcomings.