Study of various factors affecting performance of multi core processors
Improving software system load balancing using messaging.
1. Improving software system load balancing using messaging.
Marc Karasek
System Lead Technical Engineer
IVivity Inc.
2. The paradigm for devices using the PCI bus in a co-processor model has been one of extending
the core functionality of the system. In a typical application the core system has a central CPU
with a companion chip that provides some level of connectivity and functionality to the outside
world. An example of this would be an Intel x86 with a Northbridge/Southbridge chipset. In this
system, the companion chips to the CPU provide connectivity to PCI, AGP, USB, IDE and some
peripherals, such as serial and audio. In order to add more functionality to this system, a PCI
adapter would be added to the system. The CPU would communicate over the PCI bus to the
added peripheral (commonly called a HBA) and send commands and receive data over the PCI
bus. This model isolates the system (CPU + companion chips) from the added peripheral (HBA).
The software in this model also encapsulates what executes on the system versus the added
peripheral. There is a clear separation of which tasks run on the peripheral versus what is run on
the system. This does not allow for moving functionality from the system to the added peripheral
or the reverse easily. The software is fixed in which tasks are run on the peripheral and what is
run on the system. From a system perspective, this is a very rigid system with little or no room
for optimization. You could not, for example, take a processing block running on the peripheral
and move it over to the system. The reason for this is that these two entities, peripheral and core
system, are viewed as separate entities and not part of an overall system.
Core Sytsem
Driver Memory
Command and Control Data
Mailbox DMA
Peripheral
Figure 1 : Mailbox Control Path
The method of communications used in a host to co-processor has been mailboxes, since the first
ISA card was plugged into a PC. It is still used today, in one form or another, to communicate
with PCI adapters. Most mailboxes are accessed as internal registers to the adapter, and are
mapped through one of the PCI BARs (Base Address Register) to the host. When this register is
written to, an interrupt is raised based on the direction of the write (Host->Adapter or Adapter-
>Host) and the mailbox is processed. This procedure is not an efficient use of the PCI bus and
does not lend itself to transmitting large amounts of information. A separate DMA engine is
typically used to move data from the peripheral to the system. This implementation generally
requires that a system send in one command at a time and wait for a response from the peripheral.
There may be a queue in the driver so the system can send more commands, but they are all
issued in a sequential order. This can lead to a bottleneck in waiting for a command to be
completed, before sending in the next command. It also does not lend itself to a logical
separation of tasks on the system from the driver for the peripheral. They must go through the
same command pipeline in the driver, with no way of logically separating one task’s command
from another task’s command. This means that the driver for the peripheral must have
3. knowledge of the upper layer tasks. This results in a monolithic code that encompasses both the
low-level interface to the peripheral and the upper layer system tasks.
In order to change this paradigm, a better method of modeling the system needs to be created.
This model should not limit where a specific software function is run allowing the developer to
determine what is the optimum software load balance in the system. One such model would be to
extend over PCI from the peripheral to the system a messaging mechanism that allows the system
to appear as an extension of the peripheral. Using this model software blocks in the peripheral
could be moved to run on the system with minimal effort. This also allows for a more integrated
system approach to software. The peripheral no longer exists as an appendage to the system but
an integral part of it.
System
Memory
Messaging
DMA
Peripheral
Figure 2 : Messaging System Model
One method to implement this approach is to use the Message Queue Bus (MQB) of the iDiSX
2000 Storage Network Processor (SNP) from iVivity Inc. as unique new approach to
communication across the PCI bus. The MQB is an 8-byte messaging bus architecture for
passing information between processing blocks within the iDiSX 2000 SNP. This messaging bus
is extended outside the device over its PCI-X interface. This allows any processing block within
the device to send a message to an external host over the PCI-X bus. With this model the
peripheral can now view the system as another processing block within the peripheral. It requires
a thin driver on the system to handle the MQB overhead, which exposes the system to a minimal
API to register a callback, send a message and deregister a callback. Tasks on the system use this
API to register a callback with this driver into one of 4 possible message queues and send
messages to any processing block (PB) within the SNP. These processing blocks can be software
tasks running on any of the processors within the SNP (MIPS, ARC) or one of the device’s
hardware acceleration engines. This allows a logical separation of tasks on the system. The low-
level driver is abstracted from the upper layer tasks; it merely passes messages between the
system and the peripheral, having no knowledge of what the messages contain. Each of the four
hardware queue can send and receive messages asynchronously from/into the iDiSX
2000 SNP, so there is no longer a single sequential pipeline for command and control.
The ability to pass 8 byte messages means more information can be transferred in the
message. Status information, pointers to data structures in system memory, etc. can now
be all passed in the same message.
4. System
Task 1 Task 2 Task 3 Task 4 Memory
MQB Driver
DMA
iDiSX 2000
Figure 3 : iDiSX 2000 Message Queue Bus
Using this model we can now move processing blocks easily between the peripheral and the
system, allowing the developer to better load balance the overall system processing. From a
processing flow viewpoint added features can be inserted into the data flow process with
minimal effort. This will have a positive impact on the ability to add new functionality to a
given system, especially in terms of off-loading processing from the core system to a peripheral.
From a system level, both the core system and the peripheral look like they are part of the same
system, rather than one being an extension of the other. It also impact how embedded systems
are designed. Currently the same model used on the desktop is also used in the embedded space.
The core system and the peripheral are designed from a system perspective as two blocks that
communicate over PCI, each having its defined set of tasks. The ability to view the whole
design, core system and peripheral, as one overall system along with the capability to run tasks
anywhere will lead to better system performance and utilization. Developer’s can make
tradeoffs of how much front-end processing is required versus the processing done in the
CPU system.
By abstracting the hardware from the application, the iDiSX 2000 SNP can be used in a myriad
of configurations. If the system has a cache system, disc arrays, memory, etc. on the backend
with a minimal change to the current software stack, an iSCSI front-end can be added to a storage
solution. To this end iVivity Inc. can provide a sample application that interfaces with the
Linux /dev devices to a backend disc array. This sample uses the standard Linux SVM to handle
the backend storage, while providing an iSCSI front-end.