1. Multiprocessor Systems (MP)

    A multiprocessor system consists of two or more processors that share 
memory and periphral devices. If the processors have both local and shared 
memories, it is called a nonuniform memory access (NUMA) system. If all 
processors share the same main memory, it is called a uniform memory access(UMA)
system. If the roles of the processors are not identical, e.g. some processors 
can only do I/O and handle interrupts or only some processors can execute the 
operating system functions, it is called an asymmetric MP system. If all the 
processors are functionally identical, it is called a symmetric MP (SMP) system.
With the advent of multi-core processors, SMP has become the most popular form
of MP systems.

SMP-compliant Systems.

    A SMP system requires much more than just a multiple number of processors or
processor cores. The system architecture must provide additional support for
SMP operations. Intel's Multiprocessor Specification defines a SMP-compliant 
system as a PC/AT compatible system with the following features.

(1). Support interrupts routing and inter-processor interrupts: 
    Interrupts from I/O devices can be configured and routed to individual 
processors as needed by the SMP kernel. Processors can interrupt each other by 
inter-processor interrupts (IPIs) for communication and synchronization. In a
SPM-compliant PC/AT system, these are provided by a set of Advanced Programmable
Interrupt Controllers (APICs). A SMP-compliant system consists of one or more 
IOAPICs and a set of local APICs of the individaul processors. Together, the 
APICs implements a distributed inter-processor communication protocal, which
supports interrupts routing and inter-processor IPIs.
 
(2). An extended BIOS, which can detect and build SMP configuration information 
for the operating system to use.

(3). Designate one of the processors as the boot processor (BSP), which executes
the booter when a SMP system starts. All other processors are Application 
Processors (APs), which are disabled initially but can receive IPIs from the 
boot processor. After booting up, all the processors are functional identical.


2. Start up a SMP system.

    When a SMP system starts, BIOS detects the hardware configuration and
creates structures containing information for the operating system to use. The 
structures include a Floating Pointer Structure (FPS), which is required, and a
MP Configuration Table, which is optional but generally present. If the FPS 
itself is absent, the system is not SMP-compliant. If the configuration table 
does not exist, the operating system may set up the system by using a default 
configuration, which is valid for systems with only 2 processors. The FPS may be
in one of three places, which the OS must search in order: (1) in the first 1KB
of the extended BIOS data area, (2) in the last 1KB of real-mode base memory,
(3) in the BIOS read-only memory space between 0xF00000 and 0xFFFFF. The FPS is 
a 16-byte structure, which begins with a 4-byte signature "_MP_". The next field
is a 4-byte pointer to the MP configuration table. The configuration table is 
divided into three parts: a header, a base section, and an extended section. The
header begins with the four-byte signature "PCMP". It contains OEM information 
and the number of entries in the base section. The base section consists of a 
set of entries that describe processors, system buses, I/O APICs, I/O interrupt
assignments and local APIC interrupt assignments. The first byte of each entry 
is the entry type. For processor entries, the entry length is 20 bytes. All 
other entries are 8 bytes. The OS must parse the configuration table to create
OS specific data structures, such as the APIC ID, version and type of each 
processor, as well as the address of the system's I/O APIC. 

    As an example, the MTX mp command scans the MP configuration table of a 
SMP-compliant PC. For a VMware virtual machine on a dual-core host PC, it 
generates the following outputs. For other dual-core PCs, the outputs vary but
are similar.

/************ Sample outputs of MTX's mp command ***********/
// Search for MP flopating pointer structure 
Search 1KB in extended BIOS data area segment=0x9F80 (not found) 
Search last 1KB of base memory: 0x9FC00 to 0x9FFFF   (not found)
Serach BIOS ROM area 0xF0000 to 0xFFFFF              (found)
segment=0xF680 offset=0x1B0 address=0xF69B0          // FPS location 
// MP Floating Pointer Structure
signature = _MP_                  // signature
mp_table_addr=0x9FD70             // MP configuration table address 
0x1  0x4  0x2A  0x0               // 16-byte length,rev,cksum,feature 
0x0  0x0  0x0   0x0               // other feature bytes
//--------- MP Table -----------
signature = PCMP                  // signature
base_len=268 rev=0x4 cksum=0x6B   // byte length, MP version 1.4
0x49  0x4E  0x54  0x45  I N T E   // OEM information 
0x4C  0x20  0x20  0x20  L       
0x34  0x34  0x30  0x42  4 4 0 B 
0x58  0x20  0x20  0x20  X       
0x20  0x20  0x20  0x20          
oemTablePtr=0x0                   // no OEM table 
oemSize=0  entryCount=25          // number of entries=25
localAPICAddr=0xFEE00000          // local APIC address
// processor entries (type 0)
******* CPU #0  ********
0x0  0x0  0x11  0x3               // APICid=0; 0x3=BSP,enabled
cpu_signature=0x6F6 
flags=0xFEBFBFF 
******** CPU #1  ********
0x0  0x1  0x11  0x1               // APICid=1; 0x1=AP, enabled
cpu_signature=0x6F6 
flags=0xFEBFBFF 
// bus entries (type 1)
0x00 PCI                          // bus ID, string
0x02 PCI                          // bus ID, string
0x23 ISA                          // bus ID, string
// IOAPIC entry (type 2)
0x2  0x2  0x11  0x1 
addr=0xFEC00000                   // I/O APIC address
// IOAPIC Interrupts Assignment entries (type 3)
0x23  0x0  0x2  0x0               // ISA IRQ0 routing
//... One line per IRQ ....
// Local APIC Interrupt Assignemnt entry (type 4)
0x23  0x0  0xFF  0x0              // ISA IRQ0 to Int0 of all local APICs

    The above listing shows that the VMware VM has only one IOAPIC, which is at
the memory-mapped physical address 0xFFC00000 of all the processors. Each 
local APIC is at the memory-mapped address 0xFEE00000 of the corresponding 
processor. Each processsor may use the same address to access its own APIC.

    Upon booting up, a SMP kernel running on the BSP must initialize the system
hardware for SMP operations by the following steps.

(1). Configure IOAPIC to route interrupts to local APICs.
(2). Configure and enable BSP's local APIC. 
(3). Send INIT and STARTUP IPIs to activate other APs. 
(4). Continue to initialize the OS kernel until it is ready to run tasks.

Details of these steps follow.

1. Configure IOAPIC:

(1). Set up IOAPIC interrupt registers:
    Most SMP systems have only one IOAPIC. An IOAPIC has 24 (64-bit) interrupt
registers, which specify how to route the interrupts and map IRQs to interrupt 
vectors. The registers are accessed indirectly through a pair of IOREGSEL and 
IOWIN registers, which are located at 0xFEC00000 and 0xFEC00010, respectively. 
Other IOAPIC registers are denoted by byte offsets. All registers must be 
accesses by 32-bit reads/writes. To access an IOAPIC register, first selects the
register by wirting its byte offset to IOREGSEL. Then, read/write a u32 data 
from/to the IOWIN register. For each IOAPIC interrupt register, it takes two 
operations to read/write the two 32-bit halves of the register. For example,
assuming that we have the functions

     select_ioapic(u8 reg);      // write reg  to  IOREGSEL at 0xFEC00000
     write_ioapic(u32 dw);       // write dw   to  IOWIN    at 0xFEC00010

the following code segment sets up the 0_th IOAPIC interrupt register at the 
byte offset 0x10.

     select_io(0x10);            // select low 32-bit half of IntReg. 0
     write_io((u32)0x00009000 + (u8)vector); // write to low 32-bit half
     select_io(0x11);            // select high 32-bit half of IntReg. 0
     write_io((u32)0xFF000000)   // write to high 32-bit half of IntReg 0    

Similarly for other interrupt registers. Standard assignments of the IOAPIC 
interrupt registers are
          IRO 0-15 : IOAPIC registers  0-15
          PCI A-D  : IOAPIC registers 16-19
For IRQs 0-15 interrupts, they must be set to edge-triggered and active high. 
For PCI A-D interrupts, they must be set to level-triggered and active low. For
PCs in protected mode, the first 32 interrupt vectors 0x00-0x1F are reserved. 
Each of the IOAPIC's interrupt registers can be programmed with any of the 
remaining 244 vectors. The top 4 bits of an interrupt vector is also its 
interrupt priority. Higher interrupt vector numbers have higher interrupt 
priorities. The standard PIC interrupt priorities are IRQ 0-2,8-15,3,4,5,6,7, 
where IRQ0 has the highest priority. These must be mapped by the IOAPIC
interrupt registers to vectors that preserve their priorities. There are 16 
different interrupt priorities, 0 to F. Since we cannot use the vectors 0x00-
0x1F, this leaves only 15 available interrupt priorities. If we reserve some of
the interrupts, e.g. 0x20-0x2F, for special usage, the number of distinct 
interrupt priorities is even less. Thus, some of the IRQs would have to be 
mapped to vectors with the same priority. As an example, if we choose the 
vectors 0x30 to 0xAF for the (16) PIC IRQs and assign 2 IRQs to each priority 
level, we may assign the IOAPIC's interrupt registers as follows.
       PIC IRQs   :    0  1  2  8  9  10  11  12  13  14  15  3  4  5  6  7 
       vectors 0x :   A0 A1 90 91 80  81  70  71  60  61  50 51 40 41 30 31
Simliarly for the interrupt registers 16-19, which are used to map the PCI 
interrupts A-D. In addition to interrupt vectors, each IOAPIC interrupt register
must also be programmed with interrupt delivery mode and destination. The 
delivery mode can be either physical or logical. The simpliest way is to use 
logical delivery mode and route interrupts to APICs with the lowest priority in
the APIC's task priority register. In that case, the IOAPIC interrupt registers
can be set to 0xFF00000000000900 + (u8)vector_number.

(2). Switch to symmetric I/O mode
    In order to route interrupts to local APICs, the system must be switched to
symmetric I/O mode. For PCs that use an Interrupt Mask Control Register (IMCR),
this can be done by
   . out_byte(0x70, 0x22);   // access Interrupt Mask Control Register (IMCR)
   . out_byte(0x01, 0x23);   // force PIC IRQs to IOAPIC   

(3). Disable the 8259 PICs 
    After setting up the IOAPIC, the 8259 PICs must be disabled by
   . out_byte(0xFF, 0x21);   // mask off master 8259 PIC
   . out_byte(0xFF, 0xA1);   // mask off slave  8259 PIC

Although the principle is simple, programming the IOAPIC properly is no easy
task, despite detailed Intel documentations. It is also worthy pointing out that
the full capabilities of the IOAPIC can only be realized in protected mode. 

2. Configure loacl APICs

    Each processor has a local APIC at the same base address 0xFEE0000. The 
local APIC registers are at offsets from the base address, which can be accessed
directly with 32-bit reads/writes. Some of the local APIC registers and their 
usage are listed below. To enable the local APIC, write 0x10F to the spurious 
interrupt register at 0x0F0, which also uses 0x0F as the default vector for 
spuriour interrupts. Examples of how to set up other APIC registers will be 
shown in the next section.

   Register               Contents
   ------- ----------------------------------------------    
    0x020 : ID register (unique ID; to be used as CPU ID)    
    0x030 : version register                                 
    0x080 : task priority (as interrupt priority mask)       
    0x0B0 : EOI  (write 0 to signal end-of-interrupt)        
    0x0F0 : spurious interrupt vector (also for enable APIC) 
    0x300 : interrupt command register (generating IPIs)
    0x320 : LVT timer (APCI timer: mask bit, mode and vector)  
    0x350 : LVT LINT0 (INTR input)
    0x360 : LVT LINT1 (NMI input)
    0x370 : LVT error vector table
    0x380 : initial timer count
    0x390 : currnet timer count
    0x3E0 : timer divider
   ------------------------------------------------------

(3). Send INIT and STARTUP IPI to activate other APs. 
    After enabling the local APIC, the BSP can activate other APs by sending
them INIT and STARTUP IPIs. These can be done by writing to the Interrupt 
Command register (0x300) as follows.

   .write 0x00C4500 to 0x0300; delay();  // issue INIT IPI to all APs
   .write 0x00C4611 to 0x0300; delay();  // issue STARTUP IPI to all APs

where the delay is about 20 msec or more. Each AP wakes up to execute a piece of
"trampoline" code in real mode. The trampoline code must begin at a 4KB boundary
in the 1MB real-mode memory. The location is determined by the vector number in 
the STARTUP IPI. In the above example, the vector value is 0x11, which tells 
each AP to begin execution from (0x1000, 0x1000) in real-mode memory. Each AP 
must configure its own local APIC, set up its page table, switch to protected 
mode and then enters the OS kernel. During SMP mode operation, a processor may 
use IPIs to synchronize with other processrs, such as to flush their TLBs and
invalidate page table entries, etc.

   Once a SMP system boots up, all processors are functionally identical. In a 
SMP system, processes may run in parallel on different processors. A SMP kernel
must be able to support the parallel executions of multiple processes. Thus, all
the kernel data structures must be protected to prevent corruption and race 
conditions. Traditionally, most OS kernels are designed for UP operations
initially. Adapting a UP kernel to SMP is usually done in three stages.

1. Giant Lock Stage:
   Early Linux (kernel 2.0) and FreeBSD are adapted to SMP by using a Giant
Kernel Lock (GKL). In this scheme, the entire kernel is treated as a critical
region and protected by a single global lock, which is usually a spin lock. In 
order to execute in kernel, a process on a processor must acquire the GKL first.
Only the process which holds the GKL can execute in kernel. The GKL lock is not 
released until the process has completed the kernel mode operation. The main
advantage of this approach is its simplicity. It requires very little changes to
a UP kernel to make it work in a SMP environemnt. The disadvantage is that, 
while a process executes in kernel, all other processors must be either busily 
waiting for the GKL lock or can only execute in user mode. Such a system 
supports MP but does not fully realize the capabilities of a SMP system.

2. Coarse grained locking stage:
   In order to improve concurrency, which includes parallel executions on MPs,
a SMP kernel may use separate locks to protect different subsystems that have
very little interactions, such as the file system and networking subsystem. This
allows for some degree of concurrency but with only a limited improvements in 
performace. For this reason, most SMP OS use coarse grained locking as a 
transitory stage.      

3. Fine grained locking stage:
   System V, AIX and current versions of Linux (2.6) all use fine grained 
locking for SMP. In this scheme, small locks are used to protect individual 
kernel data structures to imporve both concurrency and system response time. 
Although every MP operating system strives for fine grained locking, there is 
no general consensus on what constitutes a fine grained lock. The approaches 
used by almost SMP kernels are

  . Identify the kernel data structures that need protection. Use locks to 
    implement operations on the data structures as critical regions. Examples of
    such data structures include process scheduling queue, page tables, inodes
    and I/O buffers, etc. This generally leads to concurrency at the individual
    data structure level in the kernel.

  . To further improve concurrency, try to decompose the data structures into 
    smaller units and implement operations on the decomposed units as critical 
    regions. If some of the decomposed parts are logically connected, devise 
    schemes to synchronize the interactions among the various parts.

As the locking grain size decreases, the overhead due to additional locks and 
synchronizations will increases. The process of reducing the locking grain size 
must eventually stop when it no longer improves concurrency and overall speed. 
The problem of designing a SMP kernel is essentially the same as that of
designing parallel algorithms. Although the principle is simple and well 
understood, the difficulty is how to decompose a data structure that allows for
maximal concurrency yet with minimal interactions among the decomposed parts. So
far, there are no definitive answers. 

2. Adapting MTX kernel to SMP:
   In order to support SMP, the MTX kernel must be modified but we shall try to
keep the modifications to a mimimum. The following describes the changes to the
MTX kernel in detail.

(1). Running PROCs:
    In the UP MTX kernel, running is a global pointer to the current running 
PROC. In SMP, a single running pointer is no longer adequate because each CPU 
may be executing a different PROC. With N processors, we may define PROC *run[N]
as pointers to the PROCs that are currently running on the N processors. When a
SMP kernel starts, each processor i runs an initial process, which is pointed by
run[i]. Subsequently, run[i] always points at the PROC that is currently running
on the processor i. In order to reference the correct running PROC, we define 
running as a C preprocessor symbol by
             #define running run[cpuid()]
where cpuid() returns the processor ID (0 for CPU0, 1 for CPU1, etc). This 
allows the same running variable to be used in the SMP MTX kernel code without 
any modification. However, the scheme depends on how to maintain the processor 
IDs, which must be unique.

2. Processor IDs: 
    In a SMP system, each processor's local APIC has a unique ID number, which 
can be used as the processor ID. The local APICs are at the same (per-processor)
address 0xFFE0000. In a SMP system with virtual memory, when a processor starts,
it can read the local APIC ID and store it in a location in its own virtual 
address space. During operation, each processor can read the same virtual 
address to find out its ID. However, this scheme does not work for MTX because 
MTX operates in real mode in which all processors share the same address space.
Thus, we need a different scheme to maintain the processor IDs. In MTX, the CPU
register ES is normally unused, which can be used to carry the processor ID. For
example, when CPU0 starts, we set its ES to 0. When CPU1 starts, we set its ES 
to 1. Then, cpuid() simply returns the CPU's ES register as the unique processor
ID.

3. SMP MTX start up sequence: 

(1).  As in UP, SMP MTX begins from the assembly file ts.x, which calls main()
in t.c. While in main(), it first calls init() to initialize the MTX kernel and
creates an initial process P0. P0 calls kfork("/bin/init") to create the init 
process P1. Instead of switching to P1 immediately as in UP, P0 calls smp() in
the smp.c file to set up the PC for SMP operation. Although we only tested SMP 
MTX on PCs with 2 processors (both dual-core PCs and VMware VMs hosted on dual-
core PCs), the same principle and technique should also be applicable to multi-
core PCs with more than 2 processors.

(2). During booting, CPU0 is the boot processor (BSP) and CPU1 is the AP. Only 
the BSP executes the boot code as in the UP system. The AP is initially held in
the reset state, awaiting an IPI to start up. In smp(), it first finds the 
Floating Pointer Structure (FPS). In almost all cases, the FPS is in the BIOS 
read only memory area between 0xF0000 and 0xFFFFF. For VMware VMs, it is at
(0xF680, 0x1B0). Since the FPS is in the first 1MB of real memory, it can be 
accessed easily in real-mode. 
 
(3). Read/write local APIC information:

   The local APIC of every processor is at the memory-mapped address 0xFFE00000.
In order to read/write the local APIC, we need to switch the CPU to protected 
mode, even temporarily. However, doing so would require low-level assembly 
programming as well as some data sturctures for protected mode. Instead, we 
choose to use INT15-87 of BIOS to access high memory. The MTX kernel is loaded 
at the segment 0x1000. We use the long word at (0x0000,0xF000) as the data area 
to read/write APIC registers by the functions

        u32 get_apic(u8 apic_reg)           // read from an APIC register
        int put_apic(u32 w, u8 apic_reg)    // write to  an APIC register. 

The function get_apic(u8 apic_reg) first sets up a Global Descriptor Table (GDT)
in which the source address=0xFFE00000 and the destenation address=0x0000F000. 
Then it issues an INT15-87 to read the APIC register to (0x0000,0xF000). Then, 
it uses get_word() to retrive the 2-byte words, assemble them into a long word 
as the return value. Similarly for put_apic(u32 w, u8 apic_reg), which puts the
long word to (0x0000, 0xF000) and then issues an INT15-87 to write it to the 
APIC register in high memory. By changing the high memory address to 0xFFC00000
and 0xFFC00010, these functions can also be used to read/write IOAPIC registers.

(4). Enable BSP's local APIC and wakeup AP:
    During booting, the local APIC of the boot processor BSP is already set up 
by BIOS. The BSP is configured to receive all the interrupts from the 8259 PICs.
Other APICs and the IOAPIC are not functional until they are initialized and
enabled. Although we can set up the IOAPIC correctly, it seems that interrupts 
routing can take place only in protected mode. For simplicity, we shall not use
the IOAPIC to route interrupts. Instead, we shall let the BSP continue to handle
all the PIC interrupts just as in a UP MTX kernel. This helps reduce the needed
changes to the MTX kernel. When the BSP boots up, its APIC is disabled. In order
to issue IPI to wake up other AP processors, we first enable the BSP's APIC and
then send INIT/STARTUP IPIs to the other AP or APs to activate them, as shown by
the following code segment.

   // write 0x10F to Spurious Interrupt register (0x0F0) to enable local APIC
   put_apic((u32)0x0000010F, 0x00F0);
   // write to IntCommandReg (0x300) to send INIT IPI to all APs except self; 
   put_apic((u32)0x00C4500, 0x0300); delay();
   // issue STARTUP IPI to AP; AP's start up vector=0x11=(0x1000,0x1000)
   put_apic((u32)0x00C4611, 0x0300); delay();

When an AP starts up, it begins execution from the 4KB page at (0x1000,0x1000).
Correspondingly, we add the following code segment to the assembly file ts.x. 
Since the MTX kernel begins at the segment 0x1000, the added code segment is at
the 4KB aligned page (0x1000,0x1000), which is the entry point of the AP when it
starts up.
        .globl _proc1,_procsize,_CPU1
        .org 4096                   ! aligned at 4KB boundary
        mov     ax,#0x1000          ! CPU#1: Upon entry, CS=0x1100
        mov     ds,ax               ! let DS=SS=MTX DS
        mov     ax,2                ! get MTX DS
        mov     ds,ax
        mov     ss,ax               ! SS = DS, all set to MTX's DS
        mov     ax,#1               ! set ES=1 as CPU ID of CPU#1
        mov     es,ax
        mov     sp,#_proc1          ! set CPU1's sp to high end of proc1 
        add     sp,_procsize
        jmpi    _CPU1,0x1000        ! execute CPU1() in MTX kernel 

When an AP starts, it executes a piece of "trampoline" code in real-mode. In the
trampoline code, the AP continues to initialize itself, such as setting up its 
execution environment and configuring its local APIC, etc. Then, it may switch 
to protected mode, set up its own virtual address space and eventually enters 
the OS kernel to run tasks. In the MTX case, the above "trampoline" code sets up
the AP to execute in the MTX kernel space in real mode. Then, it far jumps to 
execute CPU1() in the MTX kernel. The algorithm of CPU1() is

int CPU1()
{
 (1). printf("============= CPU #1 starts ============\n");
      let run[1] point at proc1, which is the initial PROC of CPU#1.
      initialize proc1 with pid = 1234 (differ from 0 to NPROC-1)

 (2). read and disply CPU1's local APIC information; use APIC ID as CPU ID;

 (3). configure local APIC timer to generate periodic interrupts with count =
      0x00110000 and vector=0x41. Install CPU1's local timer interrupt handler
      as
         int cpu1_thandler()
         {
            printf("CPU%d APIC timer interrupt ", cpuid()); // optional 
            // schedule tasks on CPU1
            put_apic((u32)0x00000000, 0x00B0);    // write to EOI register 
         }

 (4). enable local APIC (by writing 0x1F to SpuriourIntReg at 0x0F0);
            putv((u32)0x0000010F, 0x00F0);

 (5). // wait for BSP's notification to run tasks
      while(!go_smp);  // wait for CPU0 to set go_smp to 1
      printf("============= CPU #1 ready to run tasks ===========\n");

 (6). // enter scheduling loop to run tasks from ready queue
}

(5). Local APIC timer of CPU1:
     Every APIC has a timer, which can be used as a time base of the processor.
The following code segment shows how to set up the APIC timer.

     put_apic((u32)0x00020041, 0x320);  // mode=periodic, unmasked, vector=0x41
     put_apic((u32)0x00110000, 0x380);  // initial count = 1.1*2**20
     put_apic((u32)0x00110000, 0x390);  // current count
     put_apic((u32)0x0000000B, 0x3E0);  // divisor = 1

The APIC timer uses the bus clock to decrement the current count register. When
the count reaches 0, it interrupts the processor with an interrupt vector 0x41,
reloads the counter with the initial count register and repeats. Since the bus 
frequency of PCs differ, the APIC's timer count may need to be adjusted in order
to match the PIC's timer for task scheduling. When running on VMware VMs, a 
count of 0x00110000 yields about 60 interrupts per second, which is the rate of 
the PIC timer. We need the CPU1's local timer for the following reason. Since
CPU0 receives and handles all the PIC interrupts, CPU1 does not have any
interrupts. In order for CPU1 to run tasks, it must be able to examine the ready
queue and do task scheduling. There are several possible ways to do this.

 .By IPI: CPU0 may issue IPIs to inform CPU1 to start an action. As an example,
the code segment
         put_apic((u32)0xFF000000,0x310);  // write to hi  32-bit of ICR
         put_apic((u32)0x000C4042,0x300);  // write to how 32-bit of ICR
sends an interrupt IPI to CPU1 with an interrupt vector 0x42.

 .By shared memory: CPU1 can monitor some memory contents that are changed by 
CPU0, but this requires polling of CPU1.

 .By local timer: a local timer can interrupt CPU1 periodically, allowing it to
look for work by itself.
 
Among these, a local timer is the simplest to implement. With a local timer, 
CPU1 can use the same task scheduling algorithm as described in Chapter . When 
a processor is ready to run tasks, its scheduling loop is of the form
            while(1){
               if (readyQueue) 
                   tswitch(running);
               else
                   idle();
            }
When a processor finds no work to do, it idles with interrupts enabled. While 
a processor idles, any interrupt will cause it to get up to process the 
interrupt and then try to run tasks again. For CPU0, the interrupts inlcude
both the PIC timer and devices interrupts. For CPU1, the only interrupts are 
from the local APIC timer. In addition, CPU1 also uses the APIC timer for 
process scheduling.

(6). After setting up the local timer, proc1 running on CPU1 waits for CPU0
to turn on a go_smp flag before entering the scheduling loop.   

(7). After starting up CPU1, P0 running on CPU0 continues to initialize the MTX
kernel, including starting the PIC timer. When the MTX kernel is ready to run 
tasks, P0 sets the go_smp flag to 1, allowing CPU1 to enter its scheduling loop.
The reader may change the timing sequence to run the PROCs on different CPUs. 
For example, if we let P0 wait for the go_smp flag, which is turned on by proc1,
then P1 will start to run on CPU1 first, etc.

(8). Now, MTX is running in the SMP mode. Both CPUs can run tasks from the same
readyQueue.



The SMP MTX kernel:
    In order to support SMP, the UP MTX kernel must be modified. The changes to
the MTX kernel are listed below in the order of increasing complexities.

(1). tswitch() in ts.x file:
     When a PROC running on a CPU calls tswitch() to switch process, we must 
know the calling PROC in order to save the task's context into its kstack[ ]. 
So, we modify tswitch() to tswitch(running), passing as parameter the running 
PROC pointer. After saving its context, the PROC calls nextrun(), which selects
the next runable PROC. In order to protect the readyQueue, we require that a 
process calling tswitch() must acquire the spin lock, srQ, which is released at
the end of tswitch(). In the resume part of tswitch(), we must know the resuming
PROC in order to restore its context. So, we modify netrun() to return the next
running PROC pointer. Accordingly, we modify the tswitch() code as follows.

! tswitch(PROC *running)
         .globl _nextrun, _srQ 
 _tswitch:
          cli
          push   bp
          mov    bp,sp
! push ax,bx,cx,dx,bp,si,di,flag registers as before
	  mov	 bx, 4[bp]     ! get running pointer
 	  mov	 2[bx], sp     ! save sp into running->ksp
find:     call	 _nextrun      ! nextrun() returns pointer->next running PROC
resume:	  mov	 bx, ax        ! get current running PROC
	  mov	 sp, 2[bx]     ! restore saved context of current ruuning PROC
! pop flag,di,si,bp,dx,cx,bx,ax as before
          mov    sp,bp
          pop    bp
          mov    ax,#0         ! release spin lock srQ   
          xchg   ax,_srQ
          ret
! -------------- end of tswitch() -------------------

(2). Interrupt entry and exit routines:
     Since CPU0 handles all the PIC interrupts, the interrupt entry and exit 
routines do not need any changes for CPU0. However, PROCs running on CPU1 also 
do syscalls (via int 80) and handle the local timer interrupts. In order to know
which PROC is running and on which CPU, we add the following lines of code to 
both the INTH macro and _ireturn 

         .globl _run            ! run[ ] is an array of PROC pointers on CPUs
          mov ax,es             ! ES = CPUID = 0,1, etc
          shl ax,#1             ! change CPUID to an offset
	  mov bx,#_run   	! bx->run[ ]
          add bx,ax             ! bx points at run[CPUID]             
          mov bx,[bx]           ! bx->PROC running on CPU ID

which sets bx point at the interrupted PROC.

(3). Spin locks and semaphores
     In SMP, spin locks are for CPUs to compete for shared resources of short
duration. In order to access a shared resource, a process running on a CPU must
acquire a spin lock associated with the resource, as in
     u16 spinlock = 0;     // initial value = 0
     slock(&spinlock);     // acquire spinlock
        // access resource
     sunlock(&spinlock);   // release spinlock
After using the resource, the PROC releases the spinlock to allow other CPUs to 
acquire the spinlock. The implementations of slock() and sunlock() are

! ------- implement spin lock slock(&x); -----------------
_slock:
       push  bp
       mov   bp,sp
       push  bx
       mov   bx,4[bp]  ! pointer to spin lock x
spin:  mov   ax,#1
       xchg  ax,[bx]   ! atomic get x and set x=1
       bt    ax,#0
       jc    spin      ! spin if x was already 1
       pop   bx        ! return only if x was 0
       pop   bp
       ret

! sunlock(&x)
_sunlock:
       push  bp
       mov   bp,sp
       push  bx
       mov   bx,4[bp]  ! pointer to spin lock x
       xor   ax,ax     ! AX=0
       xchg  ax,[bx]   ! atomic set x=0
       pop   bx
       pop   bp
       ret
!------------- end of slock()/sunlock() ----------------------

In the original MTX kernel, semaphore operations are simplified because of the 
UP environment. In order to support SMP, we modify the semaphore structure and 
semphore operations as follows. 

typedef struct semaphore{
  u16 lock;    // per-semaphore spin lock;
  int value;
  struct proc *queue;
}SEMAPHORE;

where the per-semaphore spin lock is to ensure that every operation on a
semaphore is a critical region. The modified P and V operations are

int P(struct semaphore *s)
{
  PROC *p; int ps;
  ps=int_off();             // disable CPU interrupts 

  slock(&s->lock);          // spin lock the semaphore
   s->value--;
   if (s->value < 0){
      running->status=BLOCK;
      running->sem = s;    
      enqueue(&s->queue, running);
      sunlock(&s->lock);   // release spin lock
      tswitch(running);    // give up CPU
   }
   else
     sunlock(&s->lock);    // release spin lock
   int_on(ps);             // enable CPU interrupts  
}

int V(struct semaphore *s)
{
  PROC *p; int ps;
  ps=int_off();            // disable CPU interrupts
  slock(&s->lock);         // spin lock
    s->value++;
    if (s->value <= 0){
        p = dequeue(&s->queue);
        p->sem = 0;
        p->status = READY;
        schedule(p);
    }
 sunlock(&s->lock);       // release spin lock
 int_on(ps);              // enable CPU interrupts 
}

Similarly for the mutex sturcture and mutex operations, which are used for
threads synchronization.

(2). Replace sleep/wakeup with semaphores.
     As a synchronization mechanism, sleep()/wakeup() works only for UP but is 
unsuited to SMP. This is because an event is just a value, which does not have 
an associated memory location to record the occurrence of an event. When an 
event occurs, wakeup() simply tries to wake up all processes sleeping on the 
event. If no process is sleeping on the event, wakeup() has no effect. This 
requires a process to go to sleep first before another process tries to wake it
up later. This sleep-first and wakeup-later order can always be achieved in UP
but not in SMP. In a SMP system, processes run in parallel. It is impossible to
guarantee the process execution order. Therefore, a SMP system cannot use 
sleep/wakeup for process sychronization. In the MTX kernel, sleep/wakeup are 
used only in process management. It is fairly easy to replace them with P/V on 
semaphores. As an example, consider the kernel functions kwait() and kexit(). 
We can define a semaphore PROC.child = 0 in each PROC structure. When a process
waits for a ZOMBIE child, it uses P(&runnnig->child). Correspondingly, when a 
process terminates in kexit(), it uses V(&running->parent->child) to unblock the
parent.

(2). Add semaphores to protect kernel data structures.

    In the MTX kernel, device drivers and the file system already use semaphores
for synchronization. Despite these, the MTX kernel is not SMP ready because in
many places it assumes only one process can execute in the kernel. As a result,
most kernel data structures are not protected. Examples of such data structures
include PROC lists, readyQueue, free memory list, message buffers, minodes, I/O
buffers and device I/O queues, etc. In order to support SMP, each of these data
structures must be protected to ensure that processes can only access them one 
at a time. This is same approach used by many OS when adapting their UP kernels
to SMP. The needed modifications can be classified into three categories.

   The first category includes the kernel data structures that are used for
allocation/deallocation of resources, such as

    . free PROC/THREAD lists
    . free memory list
    . pipe structures
    . message buffers
    . bitmaps for inodes and disk blocks
    . in-memory minodes
    . open file table entries
    . mount table entries
     
Each of these data structures can be protected by either a spin lock or a lock 
semaphore, and modify the allocation/deallocation algorithms as critical regions
of the form

    allocate(resource)
    {
       LOCK(resource_lock);    
        // allocate resource from the resource data structure;
       UNLOCK(resource_lock);
       retrun allocated resource;
    }

    deallocate(resource)
    {
       LOCK(resource_lock)
         // release resource to the resource data structure;
       UNLOCK(resource_lock);
    }

where LOCK()/UNLOCK() denote either slock()/sunlock() on a spin lock or P()/V()
on a lock semaphore. For example, to protect the free PORC list, we can define 
a free_Proc_list semaphore = 1 and modify get_proc()/put_proc() as

    PROC *get_proc()
    {  
       LOCK(free_PROC_list);
         // PROC *p = remove first PROC from ree PROC list;
       UNLOCK(free_PROC_list);
       return p;
    }

    int put_proc(PROC *p)
    {
        LOCK(free_PROC_list);
          // enter p into free PROC list;
        UNLOCK(free_PROC_list);
    }
    
where the operations on the free PROC list are exactly the same as they are in 
UP. This catagory also includes the data structures for which the behavior of a
process is to access the data structure without pasuing. For example, to protect
the process readyQueue, we can define a spin lock, srQ = 0, and modify the 
kernel functions that access the readyQueue as

    slock(&srQ);
      // access readyQueue;
    sunlock(&srQ);

Similarly, we can use locks to protect the superblock and group descriptors of 
the file system and implement their updating algorithms as critical regions. 

    The second category includes the cases in which a process must acquire a 
lock first in order to search a data structure for a needed item. If the item 
already exists, the process does not create a new item but may have to wait for
the existing item. If so, it must release the lock to allow for concurrency. If 
releasing the lock does not create any race condition, then very little change 
is needed to make the UP algorithms also work for SMP. As a specific example, 
consider the iget() function, which retruns a locked minode. Assume that minodes
_lock is a lock semaphore for all the minodes in memory. We only need to modify
iget() slightly as follows. 

MINODE *iget(int dev, int ino)
{
  LOCK(minodes_lock);
     if (needed minode already exists){
        increment minode's refCount by 1;
  UNLOCK(minodes_lock);
        lock minode;         // process may block in minode's lock semaphore
        return minode;
     }
     // needed minode not in memroy
     find a free minode;
     set minode refCount to 1
     lock the minode;        // process does not wait here
  UNLOCK(minodes_lock);
     load inode from disk into minode;
     return minode;
}

Note that if the needed minode does not exist, the process creates a new minode
in the critical region of the minodes_lock. This ensures that every newly 
allocated minode is unique. 

   The third category includes the cases in which the UP algorithms must be
modified considerably in order to deal with race conditions caused by the 
concurrent executions of processes. As a specific example, consider the I/O 
buffer management algorithms of MTX in Chapter 12. All the algorithms assume 
that there is only one process executing. This assumption is no longer valid in
SMP. In order to adapt the algorithms to SMP, the simplest approach is to add a 
freelist lock to protect the free buffer list and a per device list lock, dev.
lock, to protect each device list. But this does not solve the problem entirely.
For example, in getblk(), in order to search for a buffer, a process must lock 
the device list first. If it finds the needed buffer existing but the buffer is
already locked, it must release the dev.lock before doing P(bp) to wait for the
buffer. Otherwise, no process can access the locked device list. However, once 
the process releases the dev.lock but before it does P(bp) to wait for the 
buffer, many things can happen to the buffer in that time gap. For instance, it
may be released as a free buffer (by a process executing on a different 
processor), which is grabed by another process and reassigned to a different 
disk block, etc. By the time the process, which found the buffer earlier, waits
for the buffer, the buffer has been changed. If so, the process would be waiting
for the wrong buffer. This kind of race condition does not exist in UP but is 
very likely in SMP. Therefore, the UP algorithm must be modified considreably in
order for it to work in SMP. In deed, Chapter 12 of Bach contains part of such 
a buffer management algorithm for System V MP Unix. 

The algorithm assumes the 
following.

1. Free buffers are maintained in a freelist. Assigned buffers are maintained in
hash queues (HQs). An assigned buffer is in a unique HQ but also in the freelist
if it is not in use.

2. The freelist has a lock semaphore=1. Each HQ has a lock semaphore=1 and each
buffer has a lock semaphore=1.
   
3. The Conditional P operation on semaphores, CP(), is defined as
      CP(semaphore){
         if (semaphore value > 0) lock the semaphore and return 1
         else return 0 without locking the semaphore
      }   
Thus, while(!CP(semaphore)); is equivalent to a spin lock. The algorithm, as it 
appears in Bach, is as follows.

===================== UNIX MP Buffer Algorithm ==========================
BUFFER *mgetblk(dev,blk)      // return a locked buffer for exclusive use
{
 while(buffer not found){
    1. P(HQ);                 // lock HQ of bp=(dev,bp)
    2. if (bp in HQ){
          if (!CP(bp)){       // if failed to lock bp
             V(HQ);           // release HQ lock
             P(bp);           // wait in bpQ
             if (bp changed){ 
                 V(bp);       // unlock bp
                 continue;    // retry the algorithm again
             }
       }
       // locked bp did not change; bp must be in freelist
       while(!CP(freelist));  // spin lock freelist
         remove bp from freelist;
       V(freelist);           // unlock freelist
       V(HQ);                 // release HQ lock       
       return bp;
    }
    /******** next case of buffer not in HQ not shown *********/
}

Some specific comments about the algorithm follow.

1. The algorithm uses locks to protect the freelist and hash queues, but it is 
   essentially the same UP Unix algorithm adapted to MP.
 
2. In mgetblk(), when a process finds a buffer in the buffer cache but the 
   buffer is already locked, it releases the HQ lock and waits in the buffer's 
   semaphore queue. When the process eventually acquires the buffer's lock,
   the buffer may be changed, due to the reasons mentioned above. If so, the 
   process must give up the buffer and re-execute the algorithm again. This not 
   only reduces the buffer's cache effect but also causes excessive process 
   retry loops. 

3. Because of the buffer lock, every buffer can only be used by one process at
   a time. In a SMP kernel, processes should be able to read from the same
   buffer concurrently. 

4. The maximal degree of concurrency is the number of HQs but the minimal degree
   of concurrency is only 1 due to the freelist bottleneck.

(6). Redesign algorithms for SMP.

     The MP Unix algorithm shows that simply porting UP algorithms to SMP may
work but the resulting algorithms may not be very efficient. In order to truly
support SMP, some of the algorithms may need to be redesigned completely. In
the following, we shall show a new MP buffer management algorithm that does not
have the above shortcomings.