Sunday, July 26, 2009

I/O Virtualization with AMD's IOMMU

AMD = Advanced Micro Devices
IOMMU = Input/Output Memory Management Unit

In short, IOMMU allows paravirtualized guest operating systems to access physical devices in the host 0perating system directly. You can pass through PCI devices in the host operating system to the virtual machine.

Article: I/O Virtualization and AMD's IOMMU

Author: Steve Apiki


[QUOTE Steve Apiki]

VMMs such as VMWare and Xen have shown that you can build efficient virtual machines on x86 without hardware assistance. But working around the virtualization constraints of x86 introduces some significant software overhead—overhead that can be eliminated with some architectural improvements at a lower level.

AMD's AMD-V hardware virtualization technology, an AMD64 extension, puts the first hardware pillar under these VMMs by providing, effectively, a super-privileged operating mode in which VMMs can control guest operating systems. We've discussed AMD-V in an earlier article.

The second pillar will be hardware support for virtual I/O. In February, AMD published an I/O virtualization specification that outlines a design for a device called an IOMMU (I/O Memory Management Unit). Implementations of this design will be fielded as part of support chipsets, expected in 2007. When those implementations arrive, VMMs will be able to use IOMMU hardware to provide faster, more direct, and more secure access to physical devices from software running on guest operating systems.

Current VMMs must route I/O requests from guest OS drivers through the VMM, using emulated devices. They do this both to manage access to common memory space and to restrict real device access to kernel mode drivers. AMD's IOMMU design eliminates both of these constraints by providing DMA address translation and permission checking for device reads and writes. With an IOMMU, an unmodified driver in a guest OS can directly access its target device, without the overhead of running through the VMM, and without device emulation.

What's an IOMMU?
An IOMMU manages device access to system memory. It sits between peripheral devices and the host, translating addresses from device requests into system memory addresses and checking appropriate permissions on each access.

Typically, AMD's IOMMU will be deployed as part of a HyperTransport or PCI bridge device. In high-end systems where there may be multiple HyperTransport links between CPU(s) and I/O hubs, there will need to be multiple IOMMUs as well.

Existing AMD64 devices already include a more limited address translation facility, called a GART (Graphics Address Remapping Table), right on chip. The on-chip GART has been used for device address translation in existing systems, and is sometimes itself referred to as an IOMMU (especially in discussions of the Linux kernel), which can lead to confusion between the existing GART and the new IOMMU specification that we're discussing here.

The GART was originally designed to allow graphics chips to read textures directly from system memory, using address translation to gather allocations in system memory into a contiguous region mapped to an address that the graphics device could see. But the GART has also been put to use by Linux kernel programmers to enable legacy 32-bit PCI devices to access regions of system memory outside of their addressable range. This is done by programming the device to work inside the "graphics aperture" memory region controlled by the GART, and then using the GART to translate this address to the real target address, above 4 GB.

The new IOMMU can do this trick, too, only without the restrictions of the GART (which, after all, wasn't designed for this purpose). While the GART is limited to working inside the graphics aperture, the IOMMU can translate any address presented by the device to a system address.

More important, the IOMMU provides protection mechanisms that restrict device access to memory, whereas the GART performs translation only. It's the combination of address translation and access protection that makes the IOMMU so valuable for virtualization.

Translation and Protection
With the IOMMU, each device is assigned a protection domain. The protection domain defines the I/O page translations that will be used for each device in the domain, and specifies the read/write permissions for each I/O page. For virtualization, VMMs can assign all devices assigned to a given guest OS the same protection domain, which will create a consistent set of address translations and access restrictions used by all the devices running under a given guest OS.

Page translations are cached by the IOMMU in a TLB (Translation Lookaside Buffer). TLB entries are keyed by protection domain and by device request address. Because the protection domain is part of the cache key, cached addresses in the TLB are shared by all devices in the domain.

The IOMMU determines to what protection domain a device belongs, and then uses that domain and the device request address to look in the TLB. TLB entries contain read/write permission flags as well as the target system address for translation, so if an entry is found in the cache, the permission flags can be used to determine whether access is allowed or not.

For addresses that are not in the cache (for a given domain), the IOMMU moves on and looks through the I/O page tables associated with the device. I/O page table entries also contain permission information linked to system addresses. (There are also additional permission flags that can be set at the device level, and on directory entries encountered through the lookup process.)

So, all address translation attempts either end in a successful lookup, in which case appropriate permission flags tell the IOMMU whether to allow or block access, or in an unsuccessful lookup, in which case, naturally, the translation attempt fails. Using the IOMMU, then, the VMM is able to control what system pages should be visible to each device (or group of devices in a protection domain), and to specify the read/write access permissions on each page for each domain. It does this by controlling the I/O page tables that the IOMMU uses for address lookups.

The twin functions of translation and protection provided by the IOMMU provide a way to operate devices almost completely from user code, without kernel mode drivers. Instead of using trusted drivers to control device access to system memory, the IOMMU can be used to restrict device DMA to memory allocated by a user process. Device memory access is still protected by privileged code, but it's the privileged code that sets up the I/O page tables, not the drivers.

We have to say "almost," above, because interrupt handlers still need to be run in kernel mode. One way to take advantage of the IOMMU would be to create a limited kernel mode driver that contained interrupt handlers and otherwise control the device from user code.

Direct Access
An IOMMU makes I/O virtualization more efficient by allowing VMMs to directly assign real devices to guest operating systems. It's not possible for a VMM to emulate the translation and protection functions of an IOMMU, because the VMM can't get between kernel-mode drivers running on the guest OS and the underlying hardware. So, in the absence of an IOMMU, VMMs instead present an emulated device to the guest OS. The VMM then translates the guest's requests, ultimately, into requests to the real driver running down on the host OS or on the hypervisor.

With an IOMMU, the VMM sets up the I/O page tables to map system physical addresses to guest physical addresses, sets up a protection domain for the guest OS, and then lets the guest OS proceed as usual. Drivers written for the real device run as part of the guest OS unmodified, unaware of the underlying translations. Guest I/O transactions are isolated from those of other guests by I/O mapping of the IOMMU.

The IOMMU doesn't support demand paging of system memory. It can't, because peripherals can't be told to retry an operation, which would be required to deal with page loads. DMA transfers to pages that are not present will simply fail. Since the VMM can't know what pages will be DMA targets, the VMM is required to lock the entire guest in memory in order to work with peripherals through the IOMMU.

It's clear that AMD's IOMMU will make a big difference in virtualization overhead for I/O devices, by removing device emulation, by removing layers of translations, and by allowing native drivers to work with devices directly. It will be exciting to see what kind of performance results when this technology gets in the hands of VMM programmers.

Steve Apiki is senior developer at Appropriate Solutions, Inc., a Peterborough, NH consulting firm that builds server-based software solutions for a wide variety of platforms using an equally wide variety of tools. Steve has been writing about software and technology for over 15 years.