Bringing full GPU support to Windows containers in Kubernetes

Posted by Adam Rehn on 13 March 2023

Tags: C++, Go, Kubernetes, Pixel Streaming, Unreal Engine, Windows

Although support for GPU accelerated Windows containers was first introduced at the operating system level in October 2018 with the release of Windows Server 2019 and Windows 10 version 1809 and at the container runtime level with the subsequent release of Docker 19.03.0 in July 2019, GPU support in Windows containers has trailed significantly behind that of Linux containers during the subsequent three years, particularly with respect to both graphics API support and Kubernetes support. As a result, many Windows cloud applications that require full GPU acceleration have been forced to rely on virtual machines as their encapsulation and deployment mechanism, thus missing out on the myriad benefits provided by modern container-based deployments and their surrounding ecosystem of orchestration and monitoring frameworks.

This disparity has been particularly noticeable for Unreal Engine cloud applications that make use of Pixel Streaming to stream interactive video content to web browsers, since those applications that are able to target Linux can make full use of containers and Kubernetes thanks to the TensorWorks Scalable Pixel Streaming framework, whilst applications that are only able to target Windows have (until recently) been forced to rely on virtual machines. In June 2022, I set out to implement support for Windows containers in Scalable Pixel Streaming, and in doing so, created a set of Kubernetes device plugins that address all of the major outstanding limitations of GPU accelerated Windows containers and provide broad feature parity with GPU accelerated Linux containers.

After nine months of testing with Scalable Pixel Streaming beta participants and monitoring upstream software releases, I am delighted to announce that the TensorWorks Kubernetes Device Plugins for DirectX are now available on GitHub under the open source MIT License. Although more work is required to provide production-ready support for GPU accelerated Windows containers on managed Kubernetes services offered by public cloud providers, this open source release is the first step towards establishing GPU feature parity between Windows and Linux containers on all Kubernetes distributions, and empowering developers to build and deploy truly modern Windows cloud applications. Read on to learn more about the features provided by the new device plugins, and how you can make use of them both today and in the future.

Existing state of GPU support for Windows containers in Kubernetes
Design goals
Goal 1: Allocate individual devices to containers
Goal 2: Automatically enable support for non-DirectX APIs
Goal 3: Support GPU sharing for multi-tenancy scenarios
Goal 4: Support the widest possible variety of use cases
Architecture and implementation
Using the new device plugins today
The road ahead
Personal remarks

Existing state of GPU support for Windows containers in Kubernetes

In August 2019, just one month after the release of Docker 19.03.0, developer Anthony Arnaud created the first Kubernetes device plugin for exposing DirectX devices to Windows containers. This device plugin enumerated the GPUs on each Kubernetes worker node by executing a Windows Management Instrumentation (WMI) query to identify all hardware devices with the Win32_VideoController class, which represents devices whose drivers are compliant with the Windows Display Driver Model (WDDM). These devices were then advertised to the kubelet with the resource identifier microsoft.com/directx so they could be allocated to containers that listed that identifier in their resource requests. Once a device had been allocated to a container by the kubelet, the device plugin made use of the newly-introduced Docker 19.03 flag --device class/5B45201D-F2F2-4F3B-85BB-30FF1F953599 to expose the device to the container, as documented by Microsoft’s GPU Acceleration in Windows containers article.

This DirectX device plugin remained in active development for several months, before code commits ceased in November 2019. The repository’s issue tracker remained active and conversations continued until November 2020, after which all activity fell away and the repository was ultimately archived on the 3rd of November 2021. During this time, the plugin appears to have seen some traction with early adopters, and a container image for the plugin was hosted on the e2eteam Docker Hub organisation alongside the container images for the end-to-end testing scripts maintained by the Kubernetes SIG Windows special interest group. (Interestingly, I can’t actually see any references to this image in the history of the testing scripts repository, but the image was pulled over 100,000 times, so it was clearly in active use as part of a test suite somewhere. It was also mirrored to the new container registry hosted on Microsoft Azure when the images migrated in November 2020, and a new tag was even pushed for Windows Server version 2004.) I also suspect that this plugin may have been the basis for the (presumably modified) device plugin used by Alibaba Cloud to enable support for GPU accelerated Windows containers in Alibaba Cloud Container Service for Kubernetes (ACK), particularly since that plugin has not received a new release since development of the original plugin ceased. This is purely speculation on my part though, and since the source code for the Alibaba Cloud device plugin does not appear to be publicly available, I have no way of confirming its relation (if any) to the original device plugin.

Despite a promising start, the functionality of this first DirectX device plugin was unfortunately constrained by the inherent limitations of GPU support in Windows containers at the time:

No support for individual device allocation: back in 2019, the only documented mechanism for exposing DirectX devices to Windows containers was to specify the GUID_DEVINTERFACE_DISPLAY_ADAPTER device interface class GUID when assigning a device to a container. Because this is a device interface class, it represents all devices that support that interface, rather than individual device instances. The practical upshot of this distinction is that all GPUs that are present on the host system are exposed to every container that is launched with the device interface class GUID, making it impossible to assign individual devices to each container. The DirectX device plugin accounted for this fact by exposing three environment variables to each container that describe which GPU was intended to be allocated to that container. However, since applications running inside the container could still see all of the GPUs from the host system, it was up to individual applications to read the values of those environment variables and select the appropriate device when making DirectX API calls, resulting in incorrect behaviour when running unmodified applications that are unaware of the environment variables. The only way to ensure unmodified applications behaved as expected was to restrict usage to Kubernetes worker nodes with only a single GPU each.
No support for non-DirectX APIs: to this day, Microsoft’s documentation clearly states that DirectX, and frameworks built atop DirectX, are the only supported APIs when running GPU accelerated Windows containers. Back in 2019, the only publicly available evidence to contradict this assertion was a brief article tucked away in the Windows drivers documentation, and GPU vendors had presumably only just begun to implement and test these new container features with their device drivers. When my colleagues and I at TensorWorks later began testing non-DirectX APIs in Windows containers in 2021, we found a number of outstanding driver bugs that were then reported to device vendors and fixed. During the majority of the time period in which the first DirectX device plugin was in active development, many non-DirectX APIs genuinely were non-functional inside GPU accelerated Windows containers, and there was nothing the plugin could do to address this limitation.

When I set out to implement support for Windows containers in Scalable Pixel Streaming in June 2022, the only existing Kubernetes device plugin for DirectX was already unmaintained, and rapidly falling out of date with respect to newer Kubernetes releases. No one else appeared to be making any publicly visible efforts to overhaul the existing device plugin or create a new one, and the barrier to entry for developers seeking to run GPU accelerated Windows containers on Kubernetes was extremely high. It was clear that a new device plugin was needed to address this gap, and that such a plugin should be designed from the ground-up to take advantage of container features introduced in newer versions of Windows to remedy the limitations that had constrained its predecessor.

Design goals

When designing a new Kubernetes device plugin for DirectX, I had the following goals in mind:

Allocate individual devices to containers: the ability to allocate individual GPUs to containers and expose only the allocated GPU to applications running in any given container is critical for optimising scalability and cost-effectiveness when running in managed Kubernetes environments where the worker nodes are virtual machines. In these environments, the ability to use VM instances with multiple GPUs per worker node reduces the frequency with which new VMs need to be created in response to increases in workload demand, thus reducing scheduling latency (the amount of time it takes for the Kubernetes cluster to schedule a new Pod on a worker node once the Pod has been created) and improving overall responsiveness. This can lead to cost savings by reducing the extent to which a given cluster needs to be overprovisioned to meet responsiveness targets during periods of high demand, and also by making it possible to select the VM instance type or SKU that provides the best value for money with respect to a given application’s resource requirements. The ability to allocate individual GPUs to containers is also a prerequisite to running the device plugin on large bare metal worker nodes with many GPUs, which is common in on-premises datacentres.
Automatically enable support for non-DirectX APIs: the ability to use non-DirectX APIs is of course a prerequisite for running applications that rely on these APIs. The motivating example is the Unreal Engine’s Pixel Streaming functionality, which relies on vendor-specific hardware video encoding APIs and cannot run correctly without them. Although TensorWorks discovered that it is possible to enable support for additional graphics APIs through modifications to the container image entrypoint (which is discussed in the section on non-DirectX APIs below), it is important that the device plugin enables support for these additional APIs automatically. This eliminates the need for custom entrypoint scripts and makes it possible to run unmodified container images, further reducing the initial barrier to entry for developers who are getting started running GPU accelerated Windows containers on Kubernetes.
Support GPU sharing for multi-tenancy scenarios: even when considering the scalability and cost savings provided by the individual device allocation feature, GPUs remain relatively expensive when compared to other cloud resources. The ability to share a single GPU between multiple containers facilitates greater deployment densities, and also makes it possible to run multiple application instances on powerful GPUs that would otherwise be under-utilised if they were used to run a single application instance. Multi-tenancy functionality can also compound the benefits provided by the use of worker nodes with multiple GPUs, yielding even greater improvements to responsiveness and cost-effectiveness when SKUs are carefully selected to suit the resource requirements of a given application.
Support the widest possible variety of use cases: although the new device plugin was intended primarily for use by Scalable Pixel Streaming, it was important that it be generally applicable to other use cases as well, in order to provide the maximum benefit for the broader development community. This goal resulted in the initial design for a singular Kubernetes device plugin being expanded into a design for a pair of matching device plugins that reflect the two Microsoft driver models for DirectX hardware accelerators. This enables the use of a broader variety of devices and facilitates potential future use cases involving hardware other than GPUs, such as dedicated machine learning accelerators.

The four sections below discuss the background information and research that was necessary in order to adequately satisfy each of these design goals. The Architecture and implementation section then discusses how these findings influenced the design and implementation of the Kubernetes Device Plugins for DirectX.

Goal 1: Allocate individual devices to containers

Software components that influence device support

In order to understand the limitations that previously prevented individual devices from being allocated to Windows containers, it is necessary to first take a brief look at the stack of software components that are used to run containers under Windows. The precise set of user mode components varies depending on which container runtime is used, but the operating system components always remain the same:

Figure 1: The software stacks used when running Windows containers with [containerd](https://containerd.io/) and [Docker](https://www.docker.com/), the two container runtimes that currently support Windows containers. — Figure 1: The software stacks used when running Windows containers with containerd and Docker, the two container runtimes that currently support Windows containers.

Operating system components:

Windows kernel: the Windows kernel provides the low-level isolation and resource management features required to run process-isolated Windows containers. The two main features of interest are “Silos”, which make it possible to partition the kernel’s Object Manager namespace into multiple isolated views of the underlying system (roughly analogous to Linux namespaces), and Job Objects, which provide functionality for managing and restricting the resources available to groups or trees of processes (roughly analogous to Linux cgroups). These low-level details are discussed at length in the DockerCon 2016 presentation Windows Server & Docker: The Internals Behind Bringing Docker & Containers to Windows.
Windows Host Compute Service (HCS): the Windows kernel does not expose direct access to its system calls, and instead abstracts them behind system services that make kernel requests on behalf of user mode applications. The service responsible for abstracting access to features for both Windows containers and Hyper-V VMs is the Host Compute Service (HCS). Unlike many existing Windows system services that expose the data structures for their public APIs through C struct objects or COM components, the HCS instead exposes the majority of its data structures through an OpenAPI JSON schema, and its C API operates primarily on JSON strings for both input and output data. This approach makes it easier for Microsoft to introduce new container features with each release of Windows without breaking compatibility with existing code that uses the HCS API, since new minor schema versions are backwards compatible with their predecessors, and applications must explicitly opt in to use a new major schema version with breaking API changes. Applications can also use checks provided by wrapper libraries to detect the highest schema version supported by the operating system at runtime (such as this check provided by the hcsshim wrapper), and automatically adapt their behaviour to reflect the features supported by the version of Windows under which they are running.

It is important to note that when the HCS was first introduced, the OpenAPI JSON schema was not made publicly available and the C API was not documented. Instead, Microsoft encouraged developers to make use of the hcsshim Go wrapper library or the dotnet-computevirtualization C# wrapper library, both of which included code that had been generated using the private OpenAPI schema. It wasn’t until August 2020 that the HCS section was added to Microsoft’s virtualisation documentation and the OpenAPI JSON schema was finally made public. Even today, only schema versions 2.0 and newer have been publicly released, and the older 1.x schema remains private. This is no doubt to discourage use of the HCS 1.0 schema, which is deprecated and suffered from significant limitations that were subsequently addressed by the 2.0 schema when it was introduced in Windows Server 2019 and Windows 10 version 1809. The 2.0 schema is also the minimum version required to run Kubernetes Pods, so it is the 2.x versions that are of interest when discussing the development of the Kubernetes Device Plugins for DirectX.

Components used when running Windows containers with containerd:

containerd-shim-runhcs-v1: this is the containerd shim implementation that is used for running Windows containers. The shim code is part of the hcsshim repository, and it makes use of internal functionality that is not exposed as part of the public API of hcsshim. Like all containerd shim implementations, this shim is responsible for communicating with the underlying runtime (in this case the HCS) to manage individual containers, which includes but is not limited to: starting / stopping / pausing / resuming containers, handling container I/O streams, executing commands in containers, publishing container lifecycle events, and reporting statistics. For the full list of responsibilities, see the protobuf service definition for the containerd runtime v2 shim API.

It is worth noting that the hcsshim repository also includes a tool called runhcs, which is a fork of the runc command line tool used to run containers under Linux, and is designed to act as a drop-in replacement for compatibility with applications that are designed to communicate directly with runc. Older versions of containerd used to use runhcs under Windows, and Microsoft’s Container platform tools on Windows article still includes a diagram depicting this. When the containerd runtime v2 API was introduced to provide a consistent interface for shim implementations, new shim code was introduced in the containerd repository to replace runhcs. That code was subsequently reimplemented in the hcsshim repository as the containerd-shim-runhcs-v1 implementation that is used today, and the old implementation was removed, along with the last remaining support for runhcs. As of today, runhcs appears to be entirely unused and considered deprecated, and in a February 2021 issue comment on the hcsshim repository, Microsoft employee Kathryn Baldauf stated that runhcs is slated for eventual removal.
containerd: containerd was originally split out from the Docker codebase in 2016 to expose low-level container management functionality to a broader variety of client tools through its gRPC API. Since then, containerd has grown to become one of the most widely used container runtimes in Kubernetes environments, and has been the default container runtime for Windows nodes in most cloud providers’ managed Kubernetes offerings since the removal of support for Dockershim in Kubernetes 1.24. Although containerd manages the lifecycle of containers by communicating with the containerd-shim-runhcs-v1 shim, it still interacts directly with the HCS for other tasks, such as managing Windows container image filesystem snapshots.

Until recently, released versions of containerd did not support exposing devices such as GPUs to Windows containers. Support was added by developer Paul Hampson (one of the co-maintainers of my ue4-docker project) in pull request #6618, which was merged into the main branch in commit d4641e1 back in March 2022. The first release to include this code is containerd version 1.7.0, which had just been released as of the time of writing. In addition to plumbing through support for devices, the code introduced support for a generic IDType://ID device string syntax to complement the existing class/GUID syntax. Since containerd uses the HCS 2.x schema through its communication with containerd-shim-runhcs-v1, the permitted values for the IDType and ID fields used to specify devices should be limited only by the underlying operating system and the specific version of the HCS schema that it supports.
ctr and nerdctl: these are frontend command-line tools that communicate with containerd to run containers. ctr ships with containerd, and is designed primarily for testing and debugging containerd itself. nerdctl is designed to be a more user-friendly frontend to containerd, and acts as a drop-in replacement for the Docker client by exposing a compatible set of commands and arguments. ctr received support for exposing devices to Windows containers in the same pull request as containerd itself, and nerdctl is slated to receive support in pull request #2079.

Components used when running Windows containers with Docker:

Docker daemon: the Docker daemon is the server component of the Docker container runtime, which contains all of the logic involved in actually running containers. Under Linux, the Docker daemon has communicated with containerd to manage container lifecycles ever since containerd was split out into a separate codebase, but under Windows it currently interacts directly with the HCS, evidently still using the original 1.0 schema. The first HCS code was added to the Docker daemon back in July 2015, prior to the official introduction of Windows containers in Windows Server 2016, and long before support for running Windows containers was implemented in containerd. When support was eventually added to containerd, the Windows container code in the two projects proceeded to grow in parallel, diverging significantly with respect to feature support as new Kubernetes-specific functionality such as HostProcess containers was added to containerd.

There are ongoing efforts being undertaken by the community to converge these two disparate codebases and update the Docker deamon to run containers using containerd under Windows in the same manner as it does under Linux. Newer versions of the Docker daemon include experimental support for switching to containerd as the underlying runtime under Windows, but this is not yet the default behaviour and can only be enabled by separately installing containerd and passing the path of its named pipe to the Docker daemon at startup. With regards to device support, Paul Hampson added support for the generic IDType://ID device string syntax in April 2022 and this change is present in Docker 23.0.0 and newer, but the limitations of the HCS 1.0 schema still prevent the use of IDType values other class and vpci-class-guid, which specify device interface class GUIDs. When the Docker daemon is configured to use containerd as the underlying runtime then the same values should be permitted as when using containerd directly.
Docker client: this is the frontend command-line tool that communicates with the Docker daemon to run containers. It passes device strings directly to the Docker daemon without parsing or modifying them, so device support is dictated only by the daemon itself.

Device support in the HCS schema

As discussed above, the generic IDType://ID device string syntax can be used to specify which devices will be exposed to containers that are run with containerd 1.7.0 or Docker 23.0.0. There are two main factors that determine the permitted values for these fields:

The maximum version of the Host Compute Service (HCS) schema that is supported by both the operating system and the container runtime.
Whether a given container is running in process isolation mode or Hyper-V isolation mode. Since Kubernetes does not currently support Hyper-V isolated Windows containers, only values that function correctly with process-isolated Windows containers were considered when developing the Kubernetes Device Plugins for DirectX.

Within the containerd codebase, the IDType and ID values are mapped to the corresponding fields in the WindowsDevice structure from the Open Container Initiative (OCI) Runtime Specification. An array of these objects are included in the Windows structure, which defines configuration options that are specific to Windows containers. This in turn is used by the top-level Spec structure, which specifies all of the configuration options for a given container. It is this OCI Spec object that is passed from containerd to hcsshim so it can be translated to HCS schema structures.

When dealing with process-isolated Windows containers, the function in the hcsshim codebase that is responsible for translating OCI Spec objects to HCS schema structures is hcsoci.createWindowsContainerDocument(). The HCS schema object that it produces is (unsurprisingly) the Container structure, which represents the configuration options for a process-isolated Windows container. The OCI Windows.Devices field is mapped to the HCS schema Container.AssignedDevices field, and the OCI WindowsDevice structure is mapped to the HCS schema Device structure. These device objects are translated by the function hcsoci.parseAssignedDevices(). The OCI WindowsDevice.IDType field is translated to a member of the HCS schema DeviceType enum and mapped to the HCS schema Device.Type field, and the OCI WindowsDevice.ID field is mapped to different HCS schema fields depending on the value of IDType.

With the HCS schema Device structure definition and the accompanying translation function at hand, it becomes immediately clear that the following IDType values can be used with process-isolated Windows containers:

“class” and “vpci-class-guid” are mapped to Device.Type as the HCS schema DeviceType.ClassGuid enum value, which indicates that the ID value represents a device interface class GUID. The ID value is mapped to the HCS schema Device.InterfaceClassGuid field. As seen previously, this exposes all devices that support that interface to the container. (It is interesting to note that the translation function actually leaves the Device.Type field empty in the relevant code paths here, but since the description of the Device.InterfaceClassGuid field explicitly states that it is “only used when Type is ClassGuid”, we can infer that DeviceType.ClassGuid is the default value for the Device.Type field when a value has not been explicitly specified.)
“vpci-location-path” is mapped to Device.Type as the DeviceType.DeviceInstance enum value, which indicates that the ID value represents an individual device instance. The ID value is mapped to the HCS schema Device.LocationPath field, which represents the physical PCIe Location Path of a device. PCIe location path values are unique for each individual device in a system, and only ever change when a device is physically moved to a different PCIe slot. Based on the available information, it stands to reason that this should expose an individual device to the container.

Looking through the rest of the hcsshim codebase, the cri-containerd test suite confirms that these are the only values applicable to process-isolated Windows containers (and also demonstrates the existence of gpu:// syntax for Linux containers and vpci:// syntax for Hyper-V isolated Windows containers, although a set of constants in another file suggests that vpci:// is actually the legacy syntax for vpci-instance-id://, in much the same way that class:// is considered legacy syntax for vpci-class-guid://). The full set of supported ID types is summarised below for easier comprehension at a glance:

Device string syntax	HCS schema `Device` values	Intended use
`class/GUID` `class://GUID` `vpci-class-guid://GUID`	.Type = `"ClassGuid"` .InterfaceClassGuid = <GUID>	Exposes all devices that support the specified device interface class to the container.
`vpci-location-path://PATH`	.Type = `"DeviceInstance"` .LocationPath = <PATH>	Exposes the individual device with the specified PCIe Location Path to the container.

Table 1: The full set of IDType://ID device string values that are supported when running process-isolated Windows containers, and the manner in which they are interpreted by hcsshim and the Windows Host Compute Service (HCS).

Environments that support individual device allocation

We now know that the appropriate device string syntax to expose individual GPUs to process-isolated Windows containers is vpci-location-path://PATH. The next step is to determine which environments support this syntax. According to the HCS JSON Schema Reference, the Device.LocationPath field was first introduced in HCS schema version 2.2. The Schema Version Map indicates that the HCS 2.2 schema was introduced with the Windows 10 SDK, version 1809 (10.0.17763.0). Somewhat confusingly, the version map lists Windows SDK releases for some schema versions and Windows OS build numbers for others. I can think of two potential reasons for this:

Option A: Windows SDK releases are specified for schema versions that were introduced in Windows kernel builds that are shared by both a client and server version of the operating system (e.g. Windows Server, version 1903 and Windows 10, version 1903), or in some cases, by a client version and two server versions (e.g. Windows Server 2019, Windows Server, version 1809, and Windows 10, version 1809) and are used as a more concise alternative to listing multiple versions of Windows.
Option B: Windows SDK releases are used to indicate schema versions that require an update to the HCS and its API but not to the Windows kernel itself, and are therefore compatible with older Windows OS builds so long as they are fully patched to include an updated version of the HCS. I consider this unlikely, given that the HCS C API remains stable and that the schema’s OpenAPI JSON specification is not included in the Windows SDK, so there is little reason for new schema versions to necessitate building against a newer version of the Windows SDK. I would have dismissed this explanation out of hand, were it not for the peculiarities of the hcsshim schema compatibility detection logic that are described below.

As previously mentioned, hcsshim provides a function to detect the highest schema version supported by the operating system at runtime. However, at the time of writing, this detection function only makes a distinction between schema versions 1.0, 2.1 and 2.5. The version checking logic indicates that all versions of Windows support the HCS 1.0 schema, that all versions of Windows since Windows Server 2019 (codename “Redstone 5” or “RS5”) support the HCS 2.1 schema, and that all versions of Windows since Windows Server 2022 support the HCS 2.5 schema. Although a comment in the code indicates that the check for the HCS 2.1 schema also includes version 2.0, it is unclear whether the HCS 2.1 schema check is also intended to act as a catch-all for HCS schema versions 2.2 through 2.4. If this is indeed the intent, then it would indicate that all versions of Windows since Windows Server 2019 also support these newer schema versions, which would be consistent with the “Option B” interpretation of the Windows SDK release numbers that are listed as the requirement for those schema versions. It is important to understand whether this is the case, since the answer determines which versions of Windows Server support allocating individual devices to containers.

To determine the answer empirically, I created a small tool called query-hcs-capabilities within the source tree of the Kubernetes Device Plugins for DirectX. The tool uses the HcsGetServiceProperties() function to query the HCS and retrieve the list of supported schema versions. (Note that although the BasicInformation.SupportedSchemaVersions result is an array, it will only contain one element per schema major version number, which indicates the maximum supported minor version number for that schema series.) This information is printed to standard output, along with the Windows OS build number and descriptive product string. Here are the outputs when running the tool under different versions of Windows, sorted in ascending order of OS build number:

Windows Server 2019 (LTSC, currently supported and receiving security updates):

Operating system version:
Microsoft Windows Server 2019 Datacenter, version 1809 (OS build 10.0.17763.4010)
  
Supported HCS schema versions:
- 2.1

Windows Server, version 1903 (SAC, support ended on the 8th of December 2020):

Operating system version:
Microsoft Windows Server Datacenter, version 1903 (OS build 10.0.18362.1256)
  
Supported HCS schema versions:
- 2.2

Windows Server, version 1909 (SAC, support ended on the 11th of May 2021):

Operating system version:
Microsoft Windows Server Datacenter, version 1909 (OS build 10.0.18363.1556)
  
Supported HCS schema versions:
- 2.2

Windows Server, version 2004 (SAC, support ended on the 14th of December 2021):

Operating system version:
Microsoft Windows Server Datacenter, version 2004 (OS build 10.0.19041.1415)
  
Supported HCS schema versions:
- 2.3

Windows Server, version 20H2 (SAC, support ended on the 9th of August 2022):

Operating system version:
Microsoft Windows Server Datacenter, version 20H2 (OS build 10.0.19042.1889)
  
Supported HCS schema versions:
- 2.3

Windows 10, version 22H2 (client OS, included purely as a matter of interest):

Operating system version:
Microsoft Windows 10 Pro, version 22H2 (OS build 10.0.19045.2604)
  
Supported HCS schema versions:
- 2.3

Windows Server 2022 (LTSC, currently supported and receiving security updates):

Operating system version:
Microsoft Windows Server 2022 Datacenter, version 21H2 (OS build 10.0.20348.1547)
  
Supported HCS schema versions:
- 2.5

Windows 11, version 22H2 (client OS, included purely as a matter of interest):

Operating system version:
Microsoft Windows 11 Pro, version 22H2 (OS build 10.0.22621.1344)
  
Supported HCS schema versions:
- 2.7

These results demonstrate the following:

The “Option A” interpretation of the Windows SDK release numbers listed in the HCS schema version map is indeed correct. New HCS schema versions require new builds of the Windows kernel, and any Windows SDK release number listed as a requirement for a given schema version indicates that a version of Windows with the corresponding OS build number is required. This makes sense, for both the reasons discussed above and also because the primary focus of the (now discontinued) Semi-Annual Channel (SAC) release series for Windows Server was to introduce new container features.
Of the Windows Server Long-Term Servicing Channel (LTSC) releases that are currently supported by Microsoft and receiving security updates, only Windows Server 2022 supports the HCS 2.2 schema features required in order to allocate individual devices to containers using the vpci-location-path://PATH device string syntax. All of the SAC releases also support this schema version, but they are no longer receiving security updates from Microsoft and should not be used. The latest client versions of Windows support this schema version as well, but client versions of Windows are typically not available on public cloud platforms for licensing reasons, and thus are not typically supported for use as worker nodes by managed Kubernetes services.
The HCS schema documentation on Microsoft’s website appears to be slightly out of date, since Windows 11 evidently supports the HCS 2.7 schema, but at the time of writing, the version map only lists schema versions up to and including 2.6.
The schema version compatibility check in hcsshim is not comprehensive and does not fully represent the requirements of HCS schema versions 2.2 through 2.4. The code in hcsoci.parseAssignedDevices() that uses newer schema fields is gated only behind a check for HCS 2.1 schema support, which allows these fields to be used erroneously under Windows Server 2019 and submitted to the HCS. Attempting to run a container using the vpci-location-path://PATH device string syntax under Windows Server 2019 confirms this, and displays the expected error message when the HCS rejects the fields that it does not recognise:
```
> ctr run --device "vpci-location-path://PCIROOT(0)#PCI(1E00)" "mcr.microsoft.com/windows/nanoserver:ltsc2019" testing cmd
  
ctr: failed to create shim task: hcs::CreateComputeSystem testing: The virtual machine or container JSON document is invalid.: unknown
```

Windows bugs that affect individual device allocation

We have now determined that Windows Server 2022 or newer is required to expose individual GPUs to process-isolated Windows containers that are running in public cloud environments. The documentation for the HCS schema Device structure and its accompanying DeviceType enum indicate clearly that individual devices should be allocated to a container when hcsshim translates vpci-location-path://PATH device strings into the corresponding schema values.

There’s just one problem. It doesn’t actually function as intended when the device in question is a GPU.

Allocating individual devices works correctly for other types of hardware, such as COM ports. Below is the output of the chgport command when run directly on a Windows Server 2022 host system that has multiple COM ports:

> chgport

AUX = \DosDevices\COM1
COM1 = \Device\Serial0
COM3 = \Device\Serial1
COM4 = \Device\Serial2

When an individual COM port is allocated to a process-isolated Windows container using the vpci-location-path://PATH device string syntax, the chgport command running inside the container only sees the individual COM port that was exposed, as expected:

> ctr run --device "vpci-location-path://ACPI(_SB_)#ACPI(PCI0)#ACPI(ISA_)#ACPI(COM1)" "mcr.microsoft.com/windows/servercore:ltsc2022" testing chgport

AUX = \DosDevices\COM1
COM1 = \Device\Serial0

The behaviour is very different when the allocated device is a GPU. The example below makes use of the test-device-discovery-cpp tool, which is included as part of the Kubernetes Device Plugins for DirectX. This is the output when the tool is run directly on a Windows Server 2022 host system that has multiple GPUs (note that the output is truncated to two of the four GPUs for brevity):

> test-device-discovery-cpp.exe

DirectX device discovery library version 0.0.1
Discovered 4 devices.

[Device 0 details]

PnP Hardware ID:     PCI\VEN_10DE&DEV_1EB8&SUBSYS_12A210DE&REV_A1\3&13C0B0C5&1&D8
DX Adapter LUID:     7048070
Description:         NVIDIA Tesla T4
Driver Registry Key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e968-e325-11ce-bfc1-08002be10318}\0004
DriverStore Path:    C:\Windows\System32\DriverStore\FileRepository\nvgridsw_aws.inf_amd64_1b50ef32b191cff1
LocationPath:        PCIROOT(0)#PCI(1B00)
Vendor:              NVIDIA
Is Integrated:       false
Is Detachable:       false
Supports Display:    true
Supports Compute:    true

8 Additional System32 runtime files:
    nvcudadebugger.dll => nvcudadebugger.dll
    nvcuda_loader64.dll => nvcuda.dll
    nvcuvid64.dll => nvcuvid.dll
    nvEncodeAPI64.dll => nvEncodeAPI64.dll
    nvapi64.dll => nvapi64.dll
    nvml_loader.dll => nvml.dll
    OpenCL64.dll => OpenCL.dll
    vulkan-1-x64.dll => vulkan-1.dll

6 Additional SysWOW64 runtime files:
    nvcuda_loader32.dll => nvcuda.dll
    nvcuvid32.dll => nvcuvid.dll
    nvEncodeAPI.dll => nvEncodeAPI.dll
    nvapi.dll => nvapi.dll
    OpenCL32.dll => OpenCL.dll
    vulkan-1-x86.dll => vulkan-1.dll

[Device 1 details]

PnP Hardware ID:     PCI\VEN_10DE&DEV_1EB8&SUBSYS_12A210DE&REV_A1\3&13C0B0C5&1&E0
DX Adapter LUID:     7098126
Description:         NVIDIA Tesla T4
Driver Registry Key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e968-e325-11ce-bfc1-08002be10318}\0005
DriverStore Path:    C:\Windows\System32\DriverStore\FileRepository\nvgridsw_aws.inf_amd64_1b50ef32b191cff1
LocationPath:        PCIROOT(0)#PCI(1C00)
Vendor:              NVIDIA
Is Integrated:       false
Is Detachable:       false
Supports Display:    true
Supports Compute:    true

8 Additional System32 runtime files:
    nvcudadebugger.dll => nvcudadebugger.dll
    nvcuda_loader64.dll => nvcuda.dll
    nvcuvid64.dll => nvcuvid.dll
    nvEncodeAPI64.dll => nvEncodeAPI64.dll
    nvapi64.dll => nvapi64.dll
    nvml_loader.dll => nvml.dll
    OpenCL64.dll => OpenCL.dll
    vulkan-1-x64.dll => vulkan-1.dll

6 Additional SysWOW64 runtime files:
    nvcuda_loader32.dll => nvcuda.dll
    nvcuvid32.dll => nvcuvid.dll
    nvEncodeAPI.dll => nvEncodeAPI.dll
    nvapi.dll => nvapi.dll
    OpenCL32.dll => OpenCL.dll
    vulkan-1-x86.dll => vulkan-1.dll

When an individual GPU is allocated to a process-isolated Windows container using the vpci-location-path://PATH device string syntax, the test-device-discovery-cpp command running inside the container sees all of the GPUs that are present on the host system (note that the output is again truncated to two of the four GPUs, and verbose logging lines have also been removed for brevity):

> ctr run --device "vpci-location-path://PCIROOT(0)#PCI(1B00)" "index.docker.io/tensorworks/example-device-discovery:0.0.1" testing

DirectX device discovery library version 0.0.1
Discovered 4 devices.

[Device 0 details]

PnP Hardware ID:     PCI\VEN_10DE&DEV_1EB8&SUBSYS_12A210DE&REV_A1\3&13C0B0C5&1&D8
DX Adapter LUID:     7048070
Description:         NVIDIA Tesla T4
Driver Registry Key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e968-e325-11ce-bfc1-08002be10318}\0004
DriverStore Path:    C:\Windows\System32\HostDriverStore\filerepository\nvgridsw_aws.inf_amd64_1b50ef32b191cff1
LocationPath:        PCIROOT(0)#PCI(1B00)
Vendor:              NVIDIA
Is Integrated:       false
Is Detachable:       false
Supports Display:    true
Supports Compute:    true

0 Additional System32 runtime files:

0 Additional SysWOW64 runtime files:

[Device 1 details]

PnP Hardware ID:     PCI\VEN_10DE&DEV_1EB8&SUBSYS_12A210DE&REV_A1\3&13C0B0C5&1&E0
DX Adapter LUID:     7098126
Description:         NVIDIA Tesla T4
Driver Registry Key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e968-e325-11ce-bfc1-08002be10318}\0005
DriverStore Path:    C:\Windows\System32\HostDriverStore\filerepository\nvgridsw_aws.inf_amd64_1b50ef32b191cff1
LocationPath:        PCIROOT(0)#PCI(1C00)
Vendor:              NVIDIA
Is Integrated:       false
Is Detachable:       false
Supports Display:    true
Supports Compute:    true

0 Additional System32 runtime files:

0 Additional SysWOW64 runtime files:

This is clearly a bug in either the HCS or the Windows kernel itself, and I have reported it to Microsoft. Although I can only speculate as to the underlying cause, it is notable that the DirectX Graphics Kernel (located at \Device\DxgKrnl in the Object Manager namespace, and discussed in greater detail in the section below) is mounted into each container that accesses one or more GPUs. My understanding is that this component is ultimately responsible for DirectX device enumeration, so it is possible that there is a bug in the enumeration logic whereby it simply fails to filter devices based on the caller’s Silo.

With the pertinent background information established, it is clear that the vpci-location-path://PATH device string syntax is the correct choice for use in the Kubernetes Device Plugins for DirectX. Although clusters will be limited to worker nodes with only a single GPU each until the relevant bug is fixed by Microsoft, the use of this device string syntax ensures that the desired behaviour will take effect immediately after updating to a patched version of Windows Server that includes the bugfix. Since none of the SAC releases of Windows Server are still supported, this choice of device string syntax also means that Windows Server 2022 is the minimum version requirement for the Kubernetes Device Plugins for DirectX.

Goal 2: Automatically enable support for non-DirectX APIs

As mentioned previously, the GPU Acceleration in Windows containers article from the Microsoft documentation states that DirectX, and frameworks built atop DirectX, are the only supported graphics APIs when running GPU accelerated Windows containers. Back in 2021, myself and my colleagues from TensorWorks conducted an investigation into this limitation and discovered that it is actually possible to enable support for additional graphics APIs inside process-isolated Windows containers. These findings were documented in the July 2021 article Enabling vendor-specific graphics APIs in Windows containers on the Unreal Containers community hub website.

The next three sections recap the relevant background information regarding the Windows Display Driver Model (WDDM) and the way it functions inside GPU accelerated Windows containers. The material in these sections is reproduced verbatim from Enabling vendor-specific graphics APIs in Windows containers, with only minor formatting changes to ensure consistency with the surrounding article and to resolve ambiguities that arose when transplanting the text to a new context. The section titles are prefixed with “[Recap]” to distinguish them at a glance from sections which contain original material. If you have already read Enabling vendor-specific graphics APIs in Windows containers, feel free to skip ahead to the section Enabling non-DirectX APIs for unmodified containers, where the surrounding article’s original material resumes.

[Recap] WDDM Architecture

In order to understand how GPUs are accessed from inside Windows containers, it is first necessary to understand how graphics devices are accessed under Windows in general. Since Windows 8, all graphics drivers must conform to the Windows Display Driver Model (WDDM), which defines how graphics APIs and Windows system services interact with driver code. An overview of the architecture of WDDM is depicted in Figure 2:

Figure 2: Architecture of the Windows Display Driver Model (WDDM) as it pertains to Windows containers. Adapted from the WDDM architecture diagram in the X.Org Developers Conference 2020 presentation "WSL Graphics Architecture" ([slides](https://xdc2020.x.org/event/9/contributions/610/attachments/700/1295/XDC_-_WSL_Graphics_Architecture.pdf), [video](https://www.youtube.com/watch?v=b2mnbyRgXkY&t=4680s)) and from the architecture diagram on the [Windows Display Driver Model (WDDM) Architecture](https://docs.microsoft.com/en-us/windows-hardware/drivers/display/windows-vista-and-later-display-driver-model-architecture) page of the Windows Driver Documentation, which is Copyright © Microsoft Corporation and is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/). — Figure 2: Architecture of the Windows Display Driver Model (WDDM) as it pertains to Windows containers. Adapted from the WDDM architecture diagram in the X.Org Developers Conference 2020 presentation “WSL Graphics Architecture” (slides, video) and from the architecture diagram on the Windows Display Driver Model (WDDM) Architecture page of the Windows Driver Documentation, which is Copyright © Microsoft Corporation and is licensed under a Creative Commons Attribution 4.0 International License.

The key components depicted in Figure 2 are as follows:

Direct3D API: this is the public Direct3D graphics API, which forms part of DirectX family of APIs and is designed for use by user-facing applications such as the Unreal Engine. The DirectX header files that define this API are available in the DirectX-Headers repository on GitHub.
Other graphics APIs: all other graphics APIs which are not based on DirectX, such as OpenGL, OpenCL, Vulkan, etc. For a list of common graphics APIs and their relation to Unreal Engine use cases, see the graphics APIs section of the GPU acceleration page from the Unreal Containers community hub documentation.
WDDM Interface: this is the set of low-level WDDM APIs that expose kernel graphics functionality to userspace code. The API functions are prefixed with D3DKMT and are defined in the header d3dkmthk.h, which is available as part of the Windows Driver Kit (WDK) and also as part of the Windows 10 SDK for Windows 10 version 2004 and newer.
DirectX Graphics Kernel: this is the component of the Windows kernel that implements the underlying functionality exposed by the API functions of the WDDM Interface. The graphics kernel is responsible for ensuring GPU resources can be shared between multiple userspace processes, and performs tasks such as GPU scheduling and memory management. The code for this component is located in the file dxgkrnl.sys and in accompanying files with the prefix dxgmms*.sys (e.g. dxgmms1.sys, dxgmms2.sys, etc.)
Kernel Mode Driver (KMD): this is the component of the graphics driver that runs in kernel mode and communicates directly with hardware devices on behalf of the DirectX Graphics Kernel.
User Mode Driver (UMD): this is the component of the graphics driver that runs in userspace and provides driver functionality for use by the Direct3D API. The Direct3D API interacts with the low-level WDDM Interface on behalf of the User Mode Driver, so the driver need not interact with it directly.
Installable Client Driver (ICD): this refers to components of the graphics driver that run in userspace and provide driver functionality for use by all other graphics APIs which are not based on DirectX. There may be a distinct Installable Client Driver for each supported graphics API. Unlike the Direct3D User Mode Driver, an Installable Client Driver directly interacts with the low-level WDDM Interface.

The components depicted in green boxes in the diagram (Kernel Mode Driver, User Mode Driver, Installable Client Driver) are provided by the graphics driver from the hardware vendor. The mechanisms by which these components are located and loaded at runtime are of particular interest when considering GPU access inside Windows containers.

[Recap] Loading User Mode Drivers (UMDs) and Installable Client Drivers (ICDs)

Although the Kernel Mode Driver component of a graphics driver is loaded automatically by Windows when initialising a GPU, the accompanying User Mode Driver and any Installable Client Drivers need to be located by each new process that requests access to the graphics device. To do so, applications call the D3DKMTQueryAdapterInfo API function from the low-level WDDM Interface to query the filesystem locations of userspace driver components. The WDDM Interface will then consult the Windows Registry to locate the relevant paths inside the system’s Driver Store, which acts as the central repository for driver installation files. The Direct3D API performs this call in order to automatically locate the User Mode Driver, as do the runtimes for other vendor-neutral APIs such as OpenGL and OpenCL in order to locate the appropriate Installable Client Driver.

It is important to note that the WDDM Interface contains specific logic to handle cases where the calling process is running inside a Windows container or inside a virtual machine that is accessing the host system’s GPU through WDDM GPU Paravirtualization (GPU-PV), and will adjust the returned information accordingly. This is why Microsoft recommends that applications always call the D3DKMTQueryAdapterInfo function instead of querying the registry or the filesystem directly, since bypassing the adjustment logic provided by the WDDM Interface will lead to incorrect results inside containers or virtual machines and prevent User Mode Drivers and Installable Client Drivers from being loaded correctly.

[Recap] Accessing GPUs inside containers

Processes running inside process-isolated Windows containers interact with the host system kernel in exactly the same manner as processes running directly on the host. As such, processes interact with the WDDM Interface as usual to communicate with the DirectX Graphics Kernel and the Kernel Mode Driver. The key difference is in how User Mode Drivers and Installable Client Drivers are loaded.

Windows container images include their own set of driver packages in the system Driver Store, which is distinct from the host system’s Driver Store. When a GPU accelerated Windows container is created by the Windows Host Compute Service (HCS), the host system’s Driver Store is automatically mounted in the container’s filesystem under the path C:\Windows\System32\HostDriverStore. Filesystem paths returned by calls to the D3DKMTQueryAdapterInfo function will be automatically adjusted by the WDDM Interface to reference the mount path for the host system’s Driver Store instead of the container’s own Driver Store, which ensures User Mode Drivers and Installable Client Drivers will be loaded from the correct location.

Although the automatic adjustment functionality provided by the WDDM Interface ensures the Direct3D API functions correctly inside GPU accelerated Windows containers, an additional step is required in order to support other graphics APIs. The runtime libraries for these APIs must be available for user applications to load before the runtime library can then locate and load the appropriate Installable Client Driver. Most applications expect the runtime libraries to be located in the Windows system directory under C:\Windows\System32 and in some cases will even refuse to load them from any other location for security reasons.

Windows provides a mechanism for drivers to specify additional runtime libraries that should be automatically copied to the Windows system directory, in the form of the CopyToVmOverwrite and CopyToVmWhenNewer registry keys and their SysWOW64 counterparts. Recent graphics drivers from AMD, Intel and NVIDIA all ship with registry entries to copy the runtime libraries for their supported vendor-neutral graphics APIs, along with a subset of their supported vendor-specific graphics APIs. Unfortunately, as of the time of writing Enabling vendor-specific graphics APIs in Windows containers, the automatic file copy feature is only supported for GPU-PV and so process-isolated Windows containers must manually copy the files for runtime libraries to the Windows system directory at startup.

Enabling non-DirectX APIs for unmodified containers

Back when Enabling vendor-specific graphics APIs in Windows containers was written, Kubernetes 1.24 had only just recently been released and many existing Kubernetes clusters still used Docker as the container runtime for their Windows worker nodes. One of the notable limitations of the HCS 1.0 schema that Docker uses is that only entire directories can be bind-mounted into Windows containers, not individual files. This makes it impossible to mount individual DLL files into C:\Windows\System32 when running containers with Docker, and so an alternative mechanism for enabling non-DirectX APIs was devised to ensure compatibility with Docker in both local development environments and in Kubernetes clusters.

The solution presented in Enabling vendor-specific graphics APIs in Windows containers was a custom entrypoint script that searches through the host Driver Store mounted at C:\Windows\System32\HostDriverStore and attempts to determine the location of the DLL files for each specific hardware vendor’s drivers. This was ultimately just a workaround for the underlying limitations of the HCS 1.0 schema, but it functioned well enough as a proof of concept, and the entrypoint script was subsequently integrated into the official Windows runtime base image that ships with Unreal Engine 4.27 and newer. However, this approach suffers from a number of serious limitations:

Extremely brittle: the entrypoint script relies on hard-coded DLL filenames and path patterns in order to function. This means its functionality can break at any time if a GPU vendor changes the naming scheme for their driver files, or if Microsoft changes the path under which the host system’s Driver Store is bind-mounted. In a March 2022 issue comment on the containerd repository, Microsoft employee Kathryn Baldauf advised against relying on this path, since it is an internal implementation detail of the HCS rather than a guaranteed and documented behaviour.
Requires manual updates: because the DLL filenames and path patterns are hard-coded in the entrypoint script, they must be manually updated any time a GPU vendor changes the naming scheme for their driver files, or new files and patterns need to be added (e.g. for additional dependencies for existing DLL files, for new APIs, or for new GPU vendors). This necessitates rebuilding any container images that use the entrypoint script, and re-deploying any containers using those images.
Requires modified container images: container images must explicitly include the entrypoint script in their files and also configure it as the container’s entrypoint. This requirement cascades down from base images to their derived images, and if a derived image configures a different entrypoint without awareness of the consequences then the functionality of the entrypoint script will never be triggered. When combined with the need to rebuild images to reflect any updates to the entrypoint script as discussed in the dot point above, this significantly increases the maintenance burden for Windows container images that require access to non-DirectX APIs.

In the ten months since Kubernetes 1.24 was released and the removal of Dockershim support was first enforced, all major public cloud providers appear to have updated their managed Kubernetes offerings to use containerd as the container runtime for Windows worker nodes. Ongoing improvements to Windows container support in the ctr and nerdctl frontends for containerd, combined with community-made tools such as Markus Lippert’s containerd Windows installer have also made the use of containerd for running Windows containers in local development environments more accessible and viable than ever before. By dropping the previous requirement of compatibility with Docker, the Kubernetes Device Plugins for DirectX can make use of the far less complex and more elegant solution of simply mounting individual DLL files into C:\Windows\System32. This approach addresses all of the limitations of the previous entrypoint script and provides the following benefits:

Flexible and automatic: by directly querying the CopyToVmOverwrite and CopyToVmWhenNewer registry keys and their SysWOW64 counterparts, the list of DLL files that need to be mounted into each container is provided by the GPU drivers themselves, and will always reflect the specific version of the drivers running on the host system. Hard-coded lists no longer need to be maintained manually for each specific hardware vendor, and updated lists will be retrieved automatically whenever the GPU drivers on a host system are upgraded. GPUs from all hardware vendors will work automatically and immediately, so long as their device drivers comply with Microsoft’s requirements to populate the relevant registry keys, and the vendor has performed the relevant QA to ensure that their drivers function correctly inside a container.
Easily extensible: the DLL lists that have been automatically retrieved from the registry can also be supplemented with user-defined lists to enable additional APIs or functionality. (As an example, by default the Kubernetes Device Plugins for DirectX will append nvidia-smi.exe to the list of DLLs for NVIDIA GPUs, to facilitate use of this utility from inside containers.) Unlike the hard-coded DLL lists of the old entrypoint script, these user-defined lists are by necessity accessed on the host system, so they can be updated at any time and will take effect for newly-started containers without the need to modify any container images.
Works with unmodified container images: because all of the logic to mount DLL files into C:\Windows\System32 is performed on the host system by the Kubernetes Device Plugins for DirectX, there is no need for container images to incorporate additional files or configure specific entrypoints. Non-DirectX APIs will be enabled automatically for all unmodified container images, incurring no maintenance overheads whatsoever for developers.

Unlike GPU partitioning technologies such as NVIDIA’s Multi-Instance GPU (MIG) that carve up device resources into discrete virtual GPUs for allocation to different virtual machines or containers, GPU sharing techniques for containers running graphical applications tend to rely on a simpler time-sharing model. In this model, a given GPU is exposed to multiple containers simultaneously, and the applications running in those containers simply share GPU resources in the same manner as applications running directly on the underlying host system would. The GPU driver’s scheduling functionality is responsible for managing concurrent execution of code from multiple applications, and there are no limits in place to prevent applications from competing for resources such as VRAM or interfering with each other due to excessive resource use.

The NVIDIA device plugin for Kubernetes introduced support for GPU sharing in June 2022, and the NVIDIA documentation refers to the time-sharing model as “Time-Slicing”. The way it works is extremely simple, and is directly transferrable to other Kubernetes device plugins that wish to support multi-tenancy scenarios:

Each physical device is advertised to the kubelet as multiple virtual devices, or “replicas”. Each set of replicas refer to the same underlying GPU, and the number of replicas per GPU can be configured by the user. (Note that the use of replicas is necessary because the kubelet has no native concept of device sharing, and Kubernetes does not permit containers to specify fractional values when requesting extended resources.)
When a replica is assigned to a container by the kubelet, the device plugin identifies which physical device the replica refers to, and exposes that GPU to the container. When multiple containers are assigned different replicas that refer to the same underlying device, they all receive access to the same device. If a given container requests multiple replicas, those replicas may refer to the same GPU (in which case fewer containers will end up sharing that physical device) or different GPUs (in which case the container will receive access to multiple physical devices).

To provide behaviour consistent with the existing NVIDIA device plugin under Linux, I opted to support an identical GPU sharing system in the Kubernetes Device Plugins for DirectX, whereby a user-configured number of replicas representing each physical device can be advertised to the kubelet.

Goal 4: Support the widest possible variety of use cases

When considering the full breadth of use cases that Microsoft supports for DirectX, it is important to understand how both the DirectX API itself and its supporting Windows components have evolved over time to reflect these use cases. Prior to Direct3D 11, the DirectX API was primarily focused on multimedia use cases, and reflected the most common uses of GPUs at the time. Direct3D 11 was the first version of the API to introduce support for compute shaders (under the moniker “DirectCompute”), and reflected the growing popularity of general-purpose computing on GPUs (GPGPU), which had relied primarily on the OpenCL and NVIDIA CUDA APIs up until that point (and typically still does under non-Windows platforms).

DirectX 12 took further steps to support compute-oriented use cases, defining the Direct3D 12 Core 1.0 Feature Level to represent support for compute-only functionality, and introducing the new DirectML API for machine learning, which builds upon the functionality defined in the Core 1.0 feature level. This new feature level is of particular interest, because it allows applications to use Direct3D to interact with not just traditional GPUs, but also with compute-only devices such as dedicated machine learning accelerators. To further facilitate this use case, Microsoft introduced a new driver model called the Microsoft Compute Driver Model (MCDM), which represents a subset of the existing Windows Device Driver Model (WDDM) in much the same way that the Core 1.0 feature level represents a subset of the full Direct3D 12 feature set.

Microsoft also introduced the new DXCore device enumeration API in Windows Server, version 2004 and Windows 10, version 2004. The DXCore API makes it easy to distinguish between legacy Direct3D 11 devices, compute-only Direct3D 12 devices, and fully-featured Direct3D 12 devices during enumeration, without needing to dig into the low-level DXGI APIs to query device capabilities. Other features of interest are the ability to easily determine whether a given device is an integrated GPU (iGPU) or a discrete GPU (dGPU), and to determine whether a given device is removable (e.g. an external GPU connected via Thunderbolt).

To reflect the full set of use cases that DirectX supports, I decided to expand the original single-plugin design of the Kubernetes Device Plugins for DirectX to encompass two separate device plugins: one plugin for fully-featured Direct3D 12 devices that comply with WDDM, and another plugin for compute-only Direct3D 12 devices that comply with MCDM. This design decision was made significantly easier by the fact that DXCore makes it trivial to distinguish between these types of devices, so only the device enumeration logic need differ between the two plugins and all other code can be shared.

Architecture and implementation

The design choices discussed in the previous sections established the following minimum requirements for the Kubernetes Device Plugins for DirectX:

containerd 1.7.0 or newer: this is the earliest version of containerd that supports exposing devices to Windows containers, and containerd itself is the only container runtime that currently supports both allocating individual devices to Windows containers and bind-mounting individual files in Windows containers.
Windows Server 2022 or newer: of the Windows Server releases that are currently supported by Microsoft and receiving security updates, Windows Server 2022 is the earliest version that supports both the HCS 2.2 schema features required in order to allocate individual devices to containers using the vpci-location-path://PATH device string syntax, and the DXCore device enumeration API.

My initial intent was to implement the device plugins entirely in Go, since the ability to leverage the Kubernetes deviceplugin package significantly simplifies the process of writing Kubernetes device plugins. However, it quickly became apparent that the Windows APIs that were chosen in the previous sections would be far cleaner to access from native C++ code. In particular, DXCore was chosen for device enumeration due to its ability to quickly surface information relevant to the selection of WDDM and MCDM compliant devices. Since devices enumerated by DXCore do not include complete information about the underlying physical hardware, it is also necessary to perform a Windows Management Instrumentation (WMI) query to retrieve the PCIe Location Path and identify the registry keys for the relevant device driver. Both DXCore itself and the COM API for WMI are designed to be used with a Win32 API language projection, and at the time of writing, there is no language projection for Go.

As a result, I decided to implement the core device discovery functionality in a native DLL, leveraging the C++/WinRT language projection for the Win32 API. I also made use of the RAII object lifecycle wrappers and error handling functionality from the Windows Implementation Libraries (WIL) to simplify interactions with the raw C Windows APIs. The rest of the Kubernetes Device Plugins for DirectX codebase is still written in Go, so the native DLL is designed to expose a simple C API that is then consumed by a Go language binding. The complete architecture is depicted below in Figure 3.

Figure 3: The software architecture of the Kubernetes Device Plugins for DirectX.

The architecture includes the following executables:

WDDM Device Plugin: the Kubernetes device plugin that provides access to fully-featured Direct3D 12 devices that comply with the full Windows Display Driver Model (WDDM), under the resource name directx.microsoft.com/display. The entrypoint for this executable just calls the common device plugin code, specifying its resource name along with device filtering flags to enumerate only DirectX devices that support both compute and display features.
MCDM Device Plugin: the Kubernetes device plugin that provides access to compute-only Direct3D 12 Core devices that comply with the Microsoft Compute Driver Model (MCDM), under the resource name directx.microsoft.com/compute. The entrypoint for this executable just calls the common device plugin code, specifying its resource name along with device filtering flags to enumerate only DirectX devices that purely support compute features.
gen-device-mounts: a command-line tool that generates flags for mounting devices and their accompanying DLL files into standalone containers without needing to deploy them to a Kubernetes cluster. This is designed to be used when developing and testing container images, and provides several different options for selecting which DirectX devices are exposed to the container. In the initial release of the Kubernetes Device Plugins for DirectX, gen-device-mounts only supports generating flags for the ctr frontend for containerd, but support for nerdctl will also be added once that frontend releases a version that supports exposing devices to Windows containers. When Docker eventually migrates to the HCS 2.x schema (e.g. by using containerd under Windows in the same manner that it does under Linux), support will also be added for generating flags to pass to the docker run command.
C++ Testing Tools: the test-device-discovery-cpp executable is a simple test program that uses the Device Discovery Library’s C/C++ API to enumerate DirectX devices. Although its primary purpose is to test the library’s device discovery functionality, it is also useful as a utility to view information about the DirectX devices that are present on a given host system.
Go Testing Tools: the test-device-discovery-go executable is a simple test program that uses the Device Discovery Library’s Go language bindings to enumerate DirectX devices. Much like its C++ counterpart, it was created primarily to test the library’s Go language bindings, but also serves as a useful enumeration utility.

The architecture includes the following libraries and Go packages:

Device Discovery Library: the native C++/WinRT shared library that interacts with DXCore, WMI and the DirectX kernel API to enumerate DirectX adapters and retrieve information about with the underlying Plug and Play (PnP) hardware device for each adapter. The Device Discovery Library is also responsible for querying Windows registry information for device drivers in order to determine which runtime files need to be mounted into containers to make use of non-DirectX APIs. The library exposes a pure C API and an accompanying set of C++ language bindings.
Device Discovery Library Go Language Bindings: this Go package provides bindings for the Device Discovery Library’s C API so it can be consumed by other Go code. It also includes basic functionality for manipulating device objects once they have been enumerated.
Mount Generation: this Go package provides functionality for generating container mounts for DirectX devices and their accompanying runtime files. It is used by both the two Kubernetes device plugins and the gen-device-mounts tool.
Common Plugin Code: this Go package contains all of the common code for the two Kubernetes device plugins. This includes parsing configuration options for the plugins from both environment variables and YAML files, enumerating DirectX devices and refreshing the device list whenever changes are detected by DXCore, and the logic for the main gRPC device plugin service that communicates with the kubelet.

Using the new device plugins today

Back when the Kubernetes Device Plugins for DirectX were first developed, it was necessary to use a pre-release version of containerd or manually backport changes to a stable version in order to use the plugins. Many managed Kubernetes services from public cloud providers had yet to implement support for Windows Server 2022 at that time, so it was also necessary to implement workarounds to allow Windows Server 2022 worker nodes to join clusters. Although these limitations were acceptable within the context of the Scalable Pixel Streaming beta program (where TensorWorks is able to completely customise the Kubernetes clusters that are deployed, and the beta status of the SPS framework itself makes it suited to the use of pre-release software), we did not consider it reasonable to expect other developers to rely on such an experimental combination of modifications to standard Kubernetes cluster deployments.

We have spent the past nine months carefully monitoring the progress of both the containerd 1.7.0 release and the rollout of support for Windows Server 2022 in managed Kubernetes services, in order to perfectly time the public release of the Kubernetes Device Plugins for DirectX to align with the day that awkward configuration hacks and the use of pre-release software are no longer necessary. That day has finally come, and it is now relatively simple to use the device plugins in a variety of different environments. See the instructions for your chosen environment below to get up and running with the Kubernetes Device Plugins for DirectX today.

Managed Kubernetes services in the public cloud

Not yet ready for production use in managed environments!

Although the Kubernetes Device Plugins for DirectX are fully functional in and of themselves, there is still integration work to be done to provide a production-ready experience when using the plugins with managed Kubernetes services offered by public cloud providers. Of particular note is the fact that cluster autoscaling will not yet respond correctly to requests for DirectX devices. For more details on the integration work that is planned, see the section The road ahead.

Now that containerd 1.7.0 has been released, you can expect to see most managed Kubernetes services implementing support for it in future versions of their offerings. In the meantime, you can use the scripts in the cloud/aws subdirectory of the Kubernetes Device Plugins for DirectX repository to deploy an Amazon Elastic Kubernetes Service (EKS) cluster with a custom worker node VM image that includes both containerd 1.7.0 and the NVIDIA GPU drivers. EKS is one of the few managed Kubernetes services that supports the use of custom VM images for worker nodes, which makes it possible to try out the Kubernetes Device Plugins for DirectX today, without needing to create a self-managed Kubernetes cluster.

Once your chosen Kubernetes service provides worker node images that include containerd 1.7.0, you can deploy the Kubernetes Device Plugins for DirectX like so:

Provision a Kubernetes cluster running Kubernetes 1.24 or newer, with at least one Windows Server 2022 node pool. Be sure to use a VM instance type for the node pool that includes a GPU.
Select a worker node VM image for the node pool that includes GPU drivers if one is available, or deploy a HostProcess DaemonSet with a script to automatically install the appropriate GPU drivers when each worker node starts. TensorWorks intends to work with public cloud providers to develop driver installation scripts as part of our ongoing roadmap to provide production-ready support for the Kubernetes Device Plugins for DirectX on managed Kubernetes services.
Apply the YAML files from the deployments subdirectory of the Kubernetes Device Plugins for DirectX repository to deploy the HostProcess DaemonSets for both the WDDM Device Plugin and the MCDM Device Plugin.
Once the device plugins have been deployed, your Kubernetes cluster will be ready to run GPU accelerated Windows containers!

Self-managed and on-premises Kubernetes clusters

Only use worker nodes with one GPU each!

Until Microsoft fixes the relevant bug in Windows itself, any container that has a DirectX device allocated to it will be able to see all devices that are present on the underlying host system. In the meantime, be sure to use Kubernetes worker nodes with only a single GPU each, in order to ensure correct behaviour.

Using virtual machine worker nodes

If you are deploying a self-managed Kubernetes cluster in the cloud or in an on-premises datacentre that runs VMs to act as worker nodes (e.g. by using the Kubernetes Cluster API) then you will need to do the following:

Create a Windows Server 2022 VM image that includes containerd 1.7.0, along with the appropriate GPU drivers for the hardware that will be used. (Alternatively, you may opt to omit GPU drivers from the VM image, and instead use a HostProcess DaemonSet to install device drivers on demand, as described in the section above.) The Kubernetes Image Builder project provides code that can be used to create custom worker node VM images for various public cloud platforms, including code for building Windows images that can be deployed with the Kubernetes Cluster API. Even if the code from the Image Builder project is not immediately suitable for your intended deployment environment, it may at least act as a helpful starting point.
Provision a Kubernetes cluster running Kubernetes 1.24 or newer, with one or more Windows worker nodes running the custom VM image that you created in the previous step. There are example YAML files available for deploying Windows worker nodes with some Kubernetes Cluster API providers, such as this script for Azure. For other providers, you may need to consult the documentation to determine how to deploy Windows worker nodes with custom VM images.
Apply the YAML files from the deployments subdirectory of the Kubernetes Device Plugins for DirectX repository to deploy the HostProcess DaemonSets for both the WDDM Device Plugin and the MCDM Device Plugin.
Once the device plugins have been deployed, your Kubernetes cluster will be ready to run GPU accelerated Windows containers!

Using bare metal worker nodes

If you are running an on-premises Kubernetes cluster with bare metal machines acting as worker nodes, then you will need to ensure each Windows worker node has the following components installed:

Windows Server 2022
containerd 1.7.0
Device drivers for any GPUs

Once your worker nodes are configured, simply deploy the HostProcess DaemonSets for both the WDDM Device Plugin and the MCDM Device Plugin as described above.

Development and testing without Kubernetes

You can use the gen-device-mounts tool included with the Kubernetes Device Plugins for DirectX to run standalone Windows containers with full GPU acceleration outside of a Kubernetes cluster. It is important to note that this currently requires Windows Server 2022, but once containerd pull request #8137 has been merged and included in a future containerd 1.7.x release then it should also be possible to run GPU accelerated containers using Windows 11 Pro. To get started using the gen-device-mounts tool, you will need to do the following:

Install containerd 1.7.0. The easiest way to do this is to use Markus Lippert’s containerd Windows installer. Simply download the installer from the GitHub repository’s releases page and run the following command:
```
 containerd-installer.exe --containerd-version 1.7.0 --cni-plugin-version 0.3.0
```
You may need to restart your machine after the first step of the installation process, since Windows requires a restart when installing the Windows containers operating system feature. If the containers and Hyper-V features are already installed then a restart will not be required.
Add the directory C:\Program Files\containerd to your system’s PATH environment variable. This will allow you to run the ctr containerd frontend from anywhere on the system.
Download the binaries for the latest release of the Kubernetes Device Plugins for DirectX from the GitHub releases page and extract the files from the ZIP archive.
Verify that your system’s DirectX devices are detected correctly:
```
 test-device-discovery-cpp.exe
```
You should see details listed for at least one GPU. If no devices are listed then you will need to ensure you have installed the appropriate device drivers.

Pull the test container image:

 ctr images pull "index.docker.io/tensorworks/example-device-discovery:0.0.1"

Use the gen-device-mounts to run a container with the test image, exposing all GPUs that are present on the host system:
```
 gen-device-mounts.exe --all --run "index.docker.io/tensorworks/example-device-discovery:0.0.1" testing
```
The tool will print the ctr command that it has generated, and then run it as a child process. You should see a container run the test-device-discovery-cpp.exe tool and list details for all of the DirectX devices that are present on the host system. Note that the output will not include any lists of runtime DLL files, since this information can only be determined when the command is run directly on the host system, and runtime file enumeration is skipped when running inside a container.

The road ahead

Now that the Kubernetes Device Plugins for DirectX have been released to the public and their dependencies are widely available, the next step is to provide a production-ready experience for GPU accelerated Windows containers in all Kubernetes cloud environments. TensorWorks intends to collaborate on an ongoing basis with Microsoft, the Kubernetes development community, and all major public cloud providers to complete the necessary integration work and ensure that future support for GPU accelerated Windows containers becomes as seamless as existing support is today for GPU accelerated Linux containers.

The planned work includes, but is not limited to:

Assisting Microsoft in testing any bugfixes for the Windows bug that exposes all GPUs to each container even when an individual GPU has been specified.
Adding support to the Kubernetes Cluster Autoscaler for autoscaling node pools in response to resource requests for DirectX devices. Specific support will need to be implemented for each public cloud provider, since the autoscaler uses information about each provider’s VM instance type SKUs to identify which node pools should be scaled to satisfy a request for a given resource type.
Developing device driver installation scripts that can be deployed to Windows worker nodes as HostProcess DaemonSets, to facilitate on-demand driver installation akin to that provided for NVIDIA GPUs on Linux worker nodes by the NVIDIA GPU Operator today. Due to the variety of disparate mechanisms that currently exist for installing Windows GPU drivers across different public cloud platforms, it is likely that specific scripts will need to be developed for each cloud platform.
Providing feedback to cloud providers regarding any additions or modifications to documentation, web interfaces, or command-line tooling that surfaces configuration options or information related to provisioning managed Kubernetes clusters with support for GPU accelerated Windows containers.
Assisting Microsoft, hardware vendors, or public cloud providers in testing the Kubernetes Device Plugins for DirectX with any new compute-only DirectX devices that comply with the Microsoft Compute Driver Model (MCDM), such as dedicated machine learning accelerators.

Additionally, if Microsoft ever decides that they would like to adopt the Kubernetes Device Plugins for DirectX as an officially-supported Microsoft technology, TensorWorks pledges its willingness to donate the project and its associated IP to either the Microsoft Corporation itself, or to an open governance body of its choice. The only stipulation attached to this offer is that the project must remain freely available and licensed under a permissive open source license, such as the current MIT License.

Personal remarks

Creating the Kubernetes Device Plugins for DirectX was a fantastic learning experience for me, as both a developer and a user of Windows containers. The opportunity to dig down into the internals of the Windows container tooling ecosystem was fascinating, as was the chance to make use of new APIs that I had never interacted with before. I had a lot of fun creating this project, and I am tremendously excited to see what developers do with the new capabilities that it enables. It is now possible to create and deploy truly modern GPU accelerated Windows applications in the cloud, on devices from any hardware vendor. As support for GPU accelerated Windows containers matures throughout the cloud tooling ecosystem, developers will be empowered to create ever more ambitious applications. The possibilities are exhilarating.

Bringing full GPU support to Windows containers in Kubernetes

Contents

Existing state of GPU support for Windows containers in Kubernetes

Design goals

Goal 1: Allocate individual devices to containers

Software components that influence device support

Device support in the HCS schema

Environments that support individual device allocation

Windows bugs that affect individual device allocation

Goal 2: Automatically enable support for non-DirectX APIs

[Recap] WDDM Architecture

[Recap] Loading User Mode Drivers (UMDs) and Installable Client Drivers (ICDs)

[Recap] Accessing GPUs inside containers

Enabling non-DirectX APIs for unmodified containers

Goal 4: Support the widest possible variety of use cases

Architecture and implementation

Using the new device plugins today

Managed Kubernetes services in the public cloud

Self-managed and on-premises Kubernetes clusters

Using virtual machine worker nodes

Using bare metal worker nodes

Development and testing without Kubernetes

The road ahead

Personal remarks

Ready to start something amazing?

Contents

Existing state of GPU support for Windows containers in Kubernetes

Design goals

Goal 1: Allocate individual devices to containers

Software components that influence device support

Device support in the HCS schema

Environments that support individual device allocation

Windows bugs that affect individual device allocation

Goal 2: Automatically enable support for non-DirectX APIs

[Recap] WDDM Architecture

[Recap] Loading User Mode Drivers (UMDs) and Installable Client Drivers (ICDs)

[Recap] Accessing GPUs inside containers

Enabling non-DirectX APIs for unmodified containers

Goal 3: Support GPU sharing for multi-tenancy scenarios

Goal 4: Support the widest possible variety of use cases

Architecture and implementation

Using the new device plugins today

Managed Kubernetes services in the public cloud

Self-managed and on-premises Kubernetes clusters

Using virtual machine worker nodes

Using bare metal worker nodes

Development and testing without Kubernetes

The road ahead

Personal remarks

Ready to start something amazing?