As part of an ongoing collaboration with Epic Games and Amazon Web Services (AWS), TensorWorks has been investigating the feasibility of migrating Unreal Engine cloud workloads from Windows to Linux using the Wine software compatibility layer. Unreal Engine workloads that target Windows devices or game consoles cannot run natively under Linux because the SDKs for these target platforms only support running under a Windows host platform. In cloud deployment scenarios, this reliance on Windows increases costs due to operating system licensing fees, and introduces operational overheads since the available ecosystem of cloud native tooling typically features far more mature support for Linux than for Windows (with many tools and technologies lacking support for Windows entirely). The use of Wine makes it possible to run these workloads under Linux, providing Unreal Engine licensees with significant cost savings and access to the operational benefits afforded by modern orchestration and autoscaling frameworks built upon technologies such as Linux containers.
The investigation into Unreal Engine deployment under Wine has yielded a number of promising findings thus far, along with patches that expand or refine the functionality of the Wine compatibility layer itself. The most interesting of these patches relate to memory management and error handling. The patches aim to fix bugs in Wine and also address underlying differences in how Windows and Linux approach memory allocation, ensuring that various assumptions made by Unreal Engine code under Windows will hold true under Wine as well. This ultimately provides more predictable behaviour and reliable reporting of errors when running in Linux environments. Although the overarching investigation is still ongoing, the results to date when running Unreal Engine asset cooking workloads under Wine have proven to be extremely successful. As such, the improvements discussed in the sections below now enable a whole class of Unreal Engine cloud workloads to be migrated from Windows to Linux, affording developers cost savings without sacrificing correctness or reliability.
Contents
- Unreal Engine workload considerations
- Challenges and solutions
- Get started using the Wine patches today
Unreal Engine workload considerations
Memory-based scaling logic for asset cooking
In Unreal Engine terminology, cooking refers to the process of converting asset source data into the appropriate format for use on a given target platform. Cooking assets is one of the key steps in packaging Unreal Engine projects for testing or distribution, and therefore represents a major component of common CI/CD workloads for Unreal Engine developers. Asset cooking workloads are also orchestrated at scale in the cloud by Epic Games as part of the backend that powers Unreal Editor for Fortnite (UEFN). Since the cooking process typically needs to leverage the SDK for a given target platform in order to convert assets to that platform’s specific formats, these workloads inherit the host platform requirements of the SDKs being used. Cooking assets for Windows devices, game consoles, or Apple platforms using the Windows Metal Shader Compiler therefore requires a Windows host, and these cooking workloads can only run under Linux through the use of Wine.
Cooking assets can be an extremely memory-intensive process, and the Unreal Engine’s cooking logic relies on accurate information about memory usage and available system memory to efficiently schedule cook tasks during parallel processing. In particular, compilation of shader code is performed by a pool of child processes that run the engine’s ShaderCompileWorker (SCW) executable. The cooking logic feeds shader compilation jobs to the pool of processes from a queue, and dynamically scales the number of processes up and down based on available memory. When the SCW processes are using very little memory then the pool will be scaled up, allowing more shaders to be compiled in parallel. Conversely, when memory usage is high then the pool will be scaled down to reduce the level of concurrency and ensure the system does not run out of memory. The implementation of this scaling logic can be seen in the source code for the FShaderCompileThreadRunnable class (note that you will need to be logged in to GitHub and have linked your GitHub account to an Epic Games account in order to view the Unreal Engine source code).
To ensure correct cooking behaviour, the operating system must report accurate values for both the quantity of available system memory and for the memory usage of individual processes. If available system memory is under-reported or the memory usage of SCW processes is over-reported then the pool of shader compilation processes will not be scaled up as much as it should be, leading to reduced concurrency and longer overall cook times. If available system memory is over-reported or the memory usage of SCW processes is under-reported then the pool of shader compilation processes may be scaled up too far and system memory may be exhausted, leading to an Out Of Memory (OOM) error that results in one or more processes crashing and the overall cooking workload failing.
Error reporting and Out Of Memory (OOM) handling
Since cloud workloads typically run in an automated manner without user intervention, it is important that any failures are detected and correctly reported so they can be investigated and addressed. Robust error detection and reporting functionality is provided throughout key Unreal Engine components, including the ShaderCompileWorker (SCW) executable mentioned in the section above, Unreal Editor (which is best known for providing the engine’s primary graphical interface but is also the executable responsible for performing asset cooking and running command-line utilities known as “commandlets” in automation scenarios), and various accompanying C# tools such as Unreal Automation Tool (UAT) and UnrealBuildTool (UBT). Unreal Engine cloud workloads may involve the creation of a complex tree of child processes that run these executables. For example, an asset cooking workload might launch UAT as the top-level process, then UAT would run the Editor to perform a cook, and the Editor would then run multiple SCW child processes to compile shaders.
If any process in the tree of child processes fails then this will be detected by its immediate parent process and propagated up the tree to the top-level process so the failure of the overarching workload can be reported. If a process in the tree crashes then the Unreal Engine’s error handling code may also launch the Crash Report Client executable, which is responsible for collecting information about the crash and dumping the memory of the crashed process to disk in a small crash dump file known as a minidump file. The Crash Report Client can automatically transmit the resulting minidump file to a preconfigured server, or the file can be collected by some other means. (Note that in some scenarios the minidump file may instead be generated by the crashing process itself and then transmitted by the Crash Report Client, but the outcome is ultimately the same.) Once the minidump file has been retrieved then it can be analysed locally in a Windows debugger to diagnose the cause of the crash. This workflow is particularly useful in cloud deployment scenarios, since the nature of the cloud environment may vary significantly over time as different versions of Windows (or Wine) are deployed via different orchestration mechanisms, but these details are completely transparent to developers when debugging crashes, since they can simply download a minidump file and open it in a debugger under their local Windows development environment.
Unreal Engine also provides special error handling logic for Out Of Memory (OOM) errors under Windows. When the engine starts, it pre-allocates a 32 MiB buffer of memory known as the “Backup OOM Pool”, which remains unused during normal operation. When a memory allocation request fails due to the operating system reporting that there is insufficient memory to satisfy the request, the engine’s OOM handler logic frees the Backup OOM Pool to ensure there is sufficient memory available to generate a minidump file. The fatal error is then reported as usual, ensuring OOM errors are reported reliably in the same manner as any other crash. The logic for the allocation and deallocation of the Backup OOM Pool can be found in the FGenericPlatformMemory::
To ensure reliable error reporting behaviour, the operating system must report both child process failures and memory allocation failures in the manner that the Unreal Engine’s error handling code expects, and generation of minidump files must function correctly. If failures are reported in a manner that Unreal Engine code is not expecting, or minidump files are not generated with the appropriate information, then the ability to identify and diagnose failures in Unreal Engine cloud workloads will be significantly compromised, undermining both the reliability of those workloads and also broader developer confidence in cloud deployment strategies for Unreal Engine.
Maintaining optimal performance
The broad scope and technical sophistication of Unreal Engine is directly reflected in the size and complexity of its codebase. As a result of this complexity, most Unreal Engine workloads commonly deployed to the cloud are computationally intensive and make full use of available hardware resources such as CPU, memory, storage and network bandwidth. Tuning cloud deployments to optimise performance and balance throughput requirements against resource costs is an art unto itself, and some Unreal Engine workloads may also be subject to additional demands such as internal development velocity targets that rely on rapid CI/CD pipeline completion times, or minimum performance levels guaranteed to customers as part of a Service-Level Agreement (SLA).
Given the challenges inherent in optimising the performance of Unreal Engine workloads (regardless of whether deployed in the cloud or on-premises), the operating system must avoid introducing any additional, unnecessary overheads. Poor optimisation of the underlying operating system functions that Unreal Engine code relies upon will only increase the resource requirements necessary to meet performance targets, therefore increasing the complexity of balancing those requirements against cost considerations.
Challenges and solutions
Misleading exit codes for abnormal process termination
When a Windows application runs under Linux with Wine, a native Linux process is created to host each Windows process. Similarly, each Windows thread in a given process runs directly as a native Linux thread inside its host process. State data for every Windows process and thread is maintained by the Wine server, the component of Wine that provides roughly equivalent functionality to the Windows kernel. The Wine server is a separate daemon process that maintains a persistent socket connection with each process and thread to facilitate communication. These connections are used by threads for making remote procedure calls (RPCs) to the Wine server in order to notify it of updated state information or to request that it carry out a given operation on the thread’s behalf. The Wine server also monitors the status of each persistent socket connection to detect disconnection events and respond accordingly.
During our investigation, it was discovered that Wine would report an exit code of zero for a given Windows process if its underlying Linux process was terminated from outside of Wine. (This can happen if a user manually terminates a process, or another application programmatically terminates the process such as the Linux Out Of Memory (OOM) killer, which is discussed in greater detail in the section Differences in memory overcommit behaviour under Windows and Linux below.) This value is misleading, since established convention dictates that a process exit code of zero indicates success whilst a non-zero value indicates failure. The Unreal Engine’s error handling logic assumes that this convention is followed, so failed child processes that report an exit code of zero to their immediate parent process will be assumed to have succeeded. Since no error will be detected or propagated up the tree to the top-level process, the remaining processes will continue running as normal until they inevitably encounter invalid state or other problems caused by the silent failure of the previous child process. These subsequent errors will be detected and reported as normal, but the reports will typically relate to an entirely different part of the codebase than the point where the original failure occurred, making it far more difficult to identify and diagnose the genuine cause.
The cause of these misleading process exit codes was an oversight in the Wine server logic that monitors the persistent socket connection for each process to detect disconnection events. The code was not correctly checking the current state of the process at the time of the disconnect, and was therefore unable to distinguish between expected disconnects during normal process shutdown and abnormal disconnects caused by unexpected termination of the underlying Linux process. The fix was to add a check to correctly distinguish between these scenarios, ensuring abnormal termination is treated as such and results in a non-zero process exit code. This patch was submitted upstream to the Wine project in merge request 3908 and was subsequently merged in commit 2dfeb87f, which is present in Wine 9.11 and subsequent releases.
Minidump file size and debugging experience discrepancies
Overview of minidump files
Windows minidump files are designed to be extremely small by default, containing only the information necessary to diagnose a given crash in a debugger. The default settings include the stack for each thread, along with a small amount of memory around the CPU instruction that each thread was executing at the time of the crash. The rest of the application’s memory is excluded, as are the contents of any loaded modules (DLL and EXE files) that are mapped into memory. It is typically intended that the debugger will load any relevant DLL or EXE files (along with their accompanying PDB files) from disk when opening the minidump, these files having been either supplied by the developer who opened the minidump or else retrieved automatically from a preconfigured symbol store. The latter is particularly common when loading DLL and PDB files for system libraries, and Microsoft operates a public symbol server that provides these files for Windows system DLLs (along with DLL and PDB files for a variety of other Microsoft software).
In addition to providing default options that produce extremely small minidump files, the MiniDumpWriteDump() Windows API function gives applications a great deal of control in overriding the defaults and specifying exactly what information will be included or excluded in a given dump file. Various flags are provided to include or exclude each supported type of information (along with a callback system that allows applications to specify these options on a granular basis for each thread and each loaded module), and a flag is also available that includes the application’s full memory in the dump, producing extremely large files that are typically in the order of many gigabytes for complex applications. In contrast to “normal” minidump files generated with the default MiniDumpNormal
flag, “full memory” minidumps generated with the MiniDumpWithFullMemory
flag do include all loaded modules that are mapped into memory, so a debugger does not need to load any DLL or EXE files from disk when opening such minidump files. However, PDB files will still need to be provided by the developer or else retrieved from a symbol store in order to provide full debug symbols.
By default, Unreal Engine applications produce minidumps using the Windows API default options when a crash is encountered. The engine also supports producing full memory minidumps for crashes, which can be enabled by specifying the -fullcrashdump
command-line flag. The code for parsing this flag can be found in the FGenericCrashContext::MiniDumpWriteDump()
can be found in the WriteMinidump() function in an anonymous C++ namespace in the file WindowsPlatformCrashContext.cpp
(note that you will need to be logged in to GitHub and have linked your GitHub account to an Epic Games account in order to view the Unreal Engine source code).
Differences between minidumps generated under Windows and Wine
During our investigation, we tested the creation of both normal and full memory minidump files with Unreal Engine applications under Wine, and it was discovered that normal minidump files generated for 64-bit applications were significantly larger than the equivalent dumps created under Windows. We also observed that these files took hours to load when opened in the Visual Studio debugger, although full memory minidumps loaded within a matter of seconds as expected. After examining the Wine codebase and making an educated guess as to the cause, we developed a small suite of tools to analyse and compare minidump files, and a comparison of dump files generated by Windows and Wine confirmed our initial diagnosis.
The cause of the large file sizes for normal minidumps was logic in Wine that included unwind information for each function in every loaded module, in the form of RUNTIME_FUNCTION structures that Windows uses for exception handling under the x64 architecture. These structures exist for every function that that allocates stack space or calls another function, and the enormous size of the Unreal Engine codebase represented a pathological case where hundreds of thousands of RUNTIME_FUNCTION
structures were included in the generated minidump files. Because this data is non-contiguous in memory, individual memory ranges were added to the minidump file for each structure, and the sheer number of memory ranges was causing the Visual Studio debugger to perform poorly when loading the dump files. This unwind information is precomputed at compile-time and is stored in the module that contains the functions to be unwound. As mentioned previously, normal minidump files do not ordinarily include the contents of loaded modules that are mapped into memory (aside from the small amount of memory surrounding the instruction pointer for each thread), and so this unwind information is not present in normal minidump files generated by Windows. The fix was to exclude it from normal minidump files generated by Wine as well.
We reported this issue on the Wine bug tracker in bug 55798 and Eric Pouech from CodeWeavers fixed it in commit 819d8692, which is present in Wine 9.5 and subsequent releases. Eric also kindly provided patches to backport all minidump-related improvements from Wine 9.13 (including this bugfix) to the stable Wine 9.0 release that was used in our investigation. With these patches in place, we observed file sizes extremely similar to dump files produced by Windows, and loading times in the Visual Studio debugger became near-instant as expected. We also used our suite of tools to confirm that the minidump files generated by Wine now contained equivalent information to their Windows counterparts, with the exception of a few minor pieces of information that do not meaningfully impact the debugging experience.
With the file size and loading time discrepancies resolved, the only remaining difference in the debugging experience when loading minidumps generated under Wine rather than Windows was the lack of automatic symbol file loading. When opening a minidump file generated under Windows, debuggers such as the Visual Studio debugger or WinDbg will automatically download the DLL and PDB files for Windows system libraries from the Microsoft public symbol server. However, there is no default symbol store that provides the files for the Wine system libraries, so developers need to manually provide the debugger with copies of the Wine system DLL files. In addition, building Wine with the default MinGW-w64 compiler toolchain does not produce PDB files for the system libraries, since the toolchain does not support the PDB file format. To rectify this, we built our patched version of Wine 9.0 using the LLVM-MinGW compiler toolchain, which does support generating PDB files. We then uploaded the DLL and PDB files for the Wine system libraries to our own symbol store and configured the Visual Studio debugger to use it as a source of symbol files. With this configuration in place, the debugger was able to automatically retrieve the DLL and PDB files without further manual intervention, and the overall debugging experience was identical to the experience of loading an equivalent minidump file generated under Windows.
Memory usage reporting for Linux containers
Unreal Engine uses an internal function named FWindowsPlatformMemory::
Wine’s implementation of GlobalMemoryStatusEx() calls the function NtQuerySystemInformation() to gather system performance information, and the Wine implementation of that function in turn calls an internal Wine function named get_GlobalMemoryStatusEx()
under Windows, making it an ideal implementation choice under Linux. However, during our investigation we encountered limitations that make this choice problematic when running inside of Linux containers.
Querying cgroup memory usage
Reading Linux system memory information from the /proc//proc/meminfo
are the same regardless of whether it is read from within a container or directly from the host, and always represent the state of the underlying host system rather than that of the container itself. This is an important distinction because containers can be configured with resource limits, including limits for memory and swap. The resource usage of processes running inside the container is strictly bound by the configured limits. The underlying mechanism which facilitates these resource limits under Linux is called Control Groups, more commonly known as cgroups.
For applications running inside of a container, querying memory utilisation information for the underlying host system is not typically useful. These values do not reflect the configured memory limit for the container, or how much of the system’s current memory usage is attributed to the container’s cgroup. This is not helpful when we need to determine how much memory is available to processes running inside the container. Fortunately, Linux provides a dedicated set of interface files located under the /sys/fs/cgroup/
directory which can be queried to determine the current resource utilisation and limits for a given cgroup. These interface files provide a wide variety of information, including memory information.
Values provided by the cgroup version 2 interface files that are of particular relevance to measuring memory usage and limits include:
memory.current
: The current total amount of physical memory in use by the cgroup in bytes. This does not include swap.memory.max
: The hard memory limit applied to the container in bytes. For details about how Linux handles situations where this value is exceeded, see the section Memory overcommit under Windows and Linux below.memory.swap.current
: The current total amount of swap space in use by the cgroup in bytes.memory.swap.max
: The hard swap limit applied to the container in bytes. Memory pages cannot be swapped out once this limit is reached.
For cgroups version 1, the relevant values are:
memory.usage_in_bytes
: The current total amount of physical memory in use by the cgroup in bytes. This does not include swap.memory.limit_in_bytes
: The hard memory limit applied to the container in bytes. For details about how Linux handles situations where this value is exceeded, see the section Memory overcommit under Windows and Linux below.memory.memsw.usage_in_bytes
: The current total amount of combined physical memory and swap space in use by the cgroup, in bytes.memory.memsw.limit_in_bytes
: The hard limit for physical memory and swap combined in bytes.
To ensure the Wine implementation of
GlobalMemoryStatusEx() reports correct information with respect to a container’s memory limit and usage, we had to make Wine “cgroup aware” so that it could use the appropriate cgroup values when querying memory statistics. We patched the get_/proc/meminfo
in the case where the process is not in a cgroup or is part of a cgroup with no configured memory limits. The patched code makes use of the open source libcgroup library to interact with the underlying cgroup interface files. To ensure performance overheads are kept to a minimum, we cache values that are not expected to change during execution, such as the memory limit and swap limit. Additionally, we cache the host system’s MemTotal
and SwapTotal
values from /proc/meminfo
so we only have to read the file once during initialisation. With these optimisations in place, subsequent queries only need to update the current values for physical memory usage, swap usage, and inactive page cache size.
Accounting for kernel page cache usage
The Linux kernel accelerates disk I/O operations through the use of an in-memory caching mechanism known as the page cache. The page cache maintains a collection of pages containing data that has been read from files stored on disk. Whenever a process reads data from a file, the required regions are copied into memory and the pages containing that data are then retained even once they are no longer required by the process that performed the initial read operation (at which point they are referred to as “inactive” pages). When another process subsequently attempts to read the same regions of the same file, the operating system can simply re-use the cached memory pages rather than needing to retrieve the data again from disk. This significantly improves the performance of file read operations, since disk I/O is typically orders of magnitude slower than accessing RAM. Write operations to files on disk are cached in a similar manner so that they can be batched and flushed to the underlying disk asynchronously. (It is worth noting that this mechanism is not unique to Linux, and other operating system kernels also perform equivalent caching to accelerate disk I/O operations. Under Windows, cached pages are tracked in structures named the “standby list” and the “modified list”. For more details of the Windows implementation, see the section “Page frame number database” in Chapter 5 of the book Windows Internals 7th edition, Part 1.)
Inactive pages that are stored in the page cache are considered to be reclaimable. If a reclaimable page is unmodified (or “clean”), it can simply be freed at any time since the data can be retrieved from the associated file on disk if it is needed again later. If a reclaimable page is modified (or “dirty”), it must first be written to the underlying storage device before it can be freed. If the system is not under any memory pressure, the page cache will continue to grow to make use of any free memory, since leaving large quantities of RAM entirely unused represents a wasted opportunity to maximise system performance. Once the page cache has reached its maximum size, reclaimable pages will be used to satisfy any memory allocation requests from applications, and the page cache will shrink accordingly. (Incidentally, the visibility of page cache memory use is often a source of confusion for many new Linux users, leading to the rise of websites such as the amusingly-titled Linux ate my RAM! that explains how the page cache works and assures users that its memory use does not warrant concern.)
During our investigation, we realised that our cgroup memory usage calculations needed to explicitly account for the presence of the kernel page cache. The value reported by the memory.current
cgroup interface file incorporates page cache use associated with the container, including both active and inactive pages. Our code was querying this interface and using its value unmodified in calculations. The inadvertent inclusion of reclaimable pages in our reported values was resulting in a misleading representation of available memory for the container, since these pages will be reclaimed and allocated to applications as described above. This problem would be further compounded over time as the page cache usage for a given container continued to grow, increasing the discrepancy in reported memory available to applications.
To address this, we adjusted our cgroup memory reporting logic to exclude reclaimable pages from the current memory usage value. Reclaimable pages associated with the container are reported by the inactive_
field of the memory.stat
cgroup interface file. Subtracting this value when calculating the available memory for the cgroup results in a number which far more accurately represents the quantity of memory that is available for allocation by processes running in the container.
Suitability for inclusion in upstream Wine
Despite the benefits afforded by modifying GlobalMemoryStatusEx() to report memory values that reflect the container environment a process might be running in, these changes were ultimately deemed unsuitable for inclusion in the upstream Wine project. The reason for this is that Wine strives to ensure “bug-for-bug” compatibility with Windows and the existing limitations are actually consistent with Windows behaviour. Under Windows, GlobalMemoryStatusEx()
is not container-aware, and calling this API function from inside a process-isolated Windows container will report memory values that reflect the underlying host system rather than the container itself.
To mitigate this limitation when running inside of a Windows container, Unreal Engine includes specific logic to query the memory limits and memory use for the Job Object associated with the container. (Note that you will need to be logged in to GitHub and have linked your GitHub account to an Epic Games account in order to view the Unreal Engine source code). However, this workaround is not possible when running under Wine, as the required Job Object API functions are not implemented. Implementing those API functions correctly would have required significantly more engineering effort than simply modifying the behaviour of the GlobalMemoryStatusEx()
function, and was outside the scope of our investigation.
Since they have not been submitted for inclusion in the upstream Wine project, the cgroup memory reporting patches are instead maintained out-of-tree. For details on how to obtain these patches, see the section Get started using the Wine patches today below.
Approximate memory usage reporting for individual processes
As discussed in the section Memory-based scaling logic for asset cooking above, the Unreal Engine’s asset cooking code includes logic that distributes shader compilation jobs to a pool of ShaderCompileWorker (SCW) child processes that are scaled based on memory use. For this to function correctly, accurate information about the memory use of the SCW processes is required. When running under Windows, Unreal Engine uses a Job Object to track and query the aggregate memory usage of the pool of processes, but this approach does not work under Wine due to the absence of the relevant Job Object API functions as mentioned in the section Suitability for inclusion in upstream Wine above. When Unreal Engine detects that it is running under Wine, the code falls back to calling the GetProcessMemoryInfo() function to query the memory usage of each individual SCW process in the pool. The detection logic can be seen in the FShaderCompileThreadRunnable constructor and the alternate code paths that it triggers can be seen in the function FShaderCompileThreadRunnable::
The Wine implementation of GetProcessMemoryInfo() calls the function NtQueryInformationProcess() to gather memory information about the target process, and the Wine implementation of that function takes two different code paths depending on whether the target is the current process. If the current process is being queried then the code calls an internal Wine function named fill_fill_
and the request handler read process memory information from the /proc/PID/status interface file (where PID
is the Linux process ID of the target process).
For performance reasons, the memory statistics reported by /proc/PID/status
are updated by the Linux kernel in an asynchronous manner. As a result, the values represent an approximation of the current memory use for the process rather than a precise measurement. During our investigation, we observed that memory usage was often over-reported for the ShaderCompileWorker (SCW) child processes. This presented a problem, since over-reporting SCW memory usage means the pool of shader compilation processes will not be scaled up as much as it should be, leading to reduced concurrency and longer overall cook times as discussed in the section Memory-based scaling logic for asset cooking above. The problem is also compounded as the number of processes increases, since the larger the pool of SCW processes, the greater the cumulative discrepancy becomes between the actual and reported memory use.
To address this issue, we modified Wine’s implementation of GetProcessMemoryInfo()
to instead read memory information from the /proc/PID/smaps_rollup interface file. This file is an aggregated variant of the /proc/PID/smaps interface file, which provides very accurate information for each of a process’s virtual memory mappings. However, this accuracy comes at a performance cost due to the need to iterate over each of the memory mappings in the process and sum their sizes rather than simply reading a pre-cached value. Fortunately, this greater level of accuracy is only required by the SCW pool scaling logic, which runs within a dedicated thread named ShaderGetProcessMemoryInfo()
originating from the named shader compilation thread will trigger a read from /proc/PID/smaps_rollup
. Calls originating from all other threads will instead fall back to the default behaviour of reading the approximate values provided by /proc/PID/status
.
Due to the Unreal Engine-specific nature of these changes, such as checking for a specific thread name when determining whether or not to use smaps, they have not been submitted for inclusion in the upstream Wine project and are instead maintained out-of-tree. For details on how to obtain these patches, see the section Get started using the Wine patches today below.
Differences in memory overcommit behaviour under Windows and Linux
Understanding memory states under Windows and Linux
Under both Windows and Linux, anonymous private memory that is allocated by user applications (i.e. memory that is not explicitly mapped to a file on disk or shared with other processes) can broadly exist in one of three states (partially overlapping with the three documented page states in the Windows documentation):
-
Reserved: when allocating memory, applications can advise the operating system that the allocated pages are simply “reserved” and will not be accessed in their current state. The operating system will reserve the addresses in the requested range of the application’s virtual address space so they cannot be used by other allocations, and will treat any access to the reserved pages as an error. The application may subsequently request that the operating system update the state of some or all of these pages to make them accessible, which can be useful for allocating sparse data structures or for implementing security and debugging mechanisms that detect invalid access to memory. Alternatively, the application may leave the range in a reserved state so the addresses can be used for specialised pointer arithmetic or as sentinel pointer values. Since reserved memory cannot result in actual memory use without its state first being updated, reserved memory does not impact system-wide memory use from the perspective of the operating system’s internal memory accounting mechanisms. Under Windows, applications can reserve memory by passing the
MEM_RESERVE
flag to the VirtualAlloc() function. Under Linux, applications can reserve memory by passing thePROT_NONE
memory protection flag to the mmap() function. Although it is also possible to reserve memory that is marked as writeable by passing both thePROT_WRITE
memory protection flag and theMAP_
flag, this latter flag is ignored if the system is configured to disable memory overcommit (see below for details of this Linux-specific configuration option), so the recommended way to reserve memory under Linux is by specifying theNORESERVE PROT_NONE
memory protection flag. -
Committed: when applications allocate memory that is intended for actual use (i.e. they do not mark the memory as reserved), or they update the state of previously reserved memory to indicate that it will now be used, that memory is considered by the operating system to be “committed” memory and is tracked as such by the system’s internal memory accounting mechanisms. Committed memory is named as such to reflect the fact that the operating system has made a commitment to fulfil access requests to this memory, although the nature of that commitment differs significantly between Windows and the default settings used by the Linux kernel. As discussed in the dot point below, committed memory may or may not have a physical backing and therefore may or may not contribute to system-wide physical memory use, but it always counts towards the system-wide committed memory use (referred to as the “commit charge” under Windows). Committed memory use is tracked by both Windows and Linux, but the two operating systems take very different approaches to imposing limits on committed memory, as discussed below. Under Windows, applications can commit memory by passing the
MEM_COMMIT
flag to the VirtualAlloc() function, and can subsequently decommit memory to return it to a reserved state by passing theMEM_DECOMMIT
flag to the VirtualFree() function. Under Linux, applications can commit memory by passing thePROT_WRITE
memory protection flag when calling the mmap() function (making sure to omit theMAP_NORESERVE
flag to guarantee consistent behaviour regardless of system configuration), or by passing thePROT_WRITE
memory protection flag to the mprotect() function. Although it is possible to decommit memory by subsequently calling mprotect() without thePROT_WRITE
memory protection flag, this only works under very specific circumstances, and so the recommended way to decommit memory under Linux is to instead pass theMADV_DONTNEED
flag to the madvise() function. -
Physically backed: when applications allocate committed memory, the operating system does not immediately establish a mapping to a physical backing store such as RAM or non-volatile storage on disk (i.e. the page file under Windows and swap space under Linux). Instead, the operating system waits until the application first accesses a given page of committed memory, and then establishes its mapping to a physical backing in an on-demand manner (known as “demand paging”). Once a page of committed memory has been accessed at least once, it will be backed by physical storage and will count towards system-wide physical memory use. This remains true regardless of how many times a given page of memory may migrate between different physical storage devices (e.g. if paged out to disk after a period of inactivity or subsequently paged back into RAM due as the result of being accessed), since it will always retain some sort of physical backing until it is decommitted or deallocated. Applications do not need to explicitly call an API function to establish physical backing for memory under Windows or Linux since the memory need only be accessed to trigger demand paging, but applications can optionally call the VirtualLock() function under Windows and the mlock() function under Linux to force a given region of memory to remain resident in RAM.
The semantics of shared memory and memory-mapped files are more nuanced. Under both Windows and Linux, named shared memory objects can be created and mapped into the memory of processes in the same manner as files. Named shared memory objects can be created by passing INVALID_
to the CreateFileMapping() function under Windows, and by calling the shm_open() function under Linux. Once named shared memory objects have been created, they can be mapped into memory by calling the MapViewOfFile() function under Windows and the mmap() function under Linux. These same functions can also be called to map files from disk into memory, by providing a handle (under Windows) or file descriptor (under Linux) that represents an open file rather than a named shared memory object. The impact that this mapping has on committed memory varies depending on the scenario:
-
If a file on disk is being mapped into memory then the memory protection flags specified by the application will determine whether mapped pages contribute towards system-wide committed memory use. Pages that are directly read from or written to the mapped file will not be treated as committed memory, since the pages used by the operating system to cache these read and write operations will be reclaimable memory provided by the standby and modified lists under Windows, and the kernel page cache under Linux. (For more details on reclaimable memory, see the section Accounting for kernel page cache usage above.) If pages are mapped with copy-on-write semantics by passing the
FILE_MAP_COPY
flag to MapViewOfFile() under Windows and passing bothMAP_PRIVATE
andPROT_WRITE
to mmap() under Linux, then any writes to mapped pages will result in a private copy being created on behalf of the process and committed. Under Windows, copy-on-write pages are counted against the commit charge pre-emptively when a file is mapped into memory, to ensure that there will be sufficient space available in a physical backing store if the application subsequently writes to all of the mapped pages. The same is true under Linux unless theMAP_NORESERVE
flag was passed tommap()
, in which case copy-on-write pages will be committed in an on-demand basis when they are accessed. -
If a named shared memory object is being mapped into memory then the application can control whether the mapped pages are reserved or committed. Under Windows, applications can pass either the
SEC_RESERVE
flag orSEC_COMMIT
flag (the default if otherwise unspecified) when calling the CreateFileMapping() function, and this value will determine whether MapViewOfFile() reserves or commits the mapped pages. Passing theSEC_COMMIT
flag to the CreateFileMapping() function also pre-emptively increases the system commit charge to ensure sufficient space is available in a physical backing store before applications map a view of the shared memory object into their address space. It is worth noting that passing theFILE_MAP_COPY
flag to MapViewOfFile() will result in the mapped pages counting against the system commit charge even if theSEC_RESERVE
flag was specified when calling CreateFileMapping(), due to the pre-emptive accounting logic for copy-on-write pages discussed in the dot point above. Under Linux, applications can either pass or omit theMAP_NORESERVE
flag when calling the mmap() function and mapped pages will be reserved or committed as described previously. It is worth noting that when creating named shared memory objects with the the shm_open() function (and similarly when creating anonymous memory-backed files using the memfd_create() function) the memory object will initially be created with a length of zero, so applications will need to set the size by calling the ftruncate() function prior to mapping the memory into their address space. The call toftruncate()
will not commit the pages for the named shared memory object, but writing to the file descriptor will, as will mapping the pages without theMAP_NORESERVE
flag as described above.
Memory overcommit under Windows and Linux
The distinction between the memory states described in the section above becomes relevant when considering the different approaches that Windows and Linux implement when enforcing limits on the amount of memory that can be committed. Windows unconditionally enforces a strict limit on committed memory, referred to as the system “commit limit”, which is equal to the total amount of physical backing space that exists across RAM and all configured page files. If an application makes an allocation request for committed memory that would exceed the commit limit then Windows denies the request and the VirtualAlloc() function sets the calling thread’s last error code to ERROR_
. (The limit is also enforced pre-emptively when mapping copy-on-write pages into memory, and MapViewOfFile() will similarly fail if permitting the mapping to proceed would exceed the commit limit.) The system commit limit is enforced regardless of the amount of memory that is physically backed, so it is possible for memory allocation requests to fail even if there is available physical memory. However, due to the pre-emptive nature of the commit limit enforcement, it is not possible for demand paging to fail when establishing the mapping to a physical backing for a given page. As such, applications running under Windows are responsible for handling Out Of Memory (OOM) errors when allocating memory, but can subsequently rely on the guarantee that Windows will fulfil its commitment to provide all memory that has been successfully committed.
Under Linux, the default behaviour is to permit applications to commit memory far in excess of the commit limit, also known as “overcommitting” memory. This behaviour is configurable via the vm.overcommit_memory sysctl, and the kernel supports three overcommit handling modes that are all implemented by the internal kernel function __vm_
-
Heuristic overcommit: in this mode, overcommitting memory is permitted, but individual allocation requests are checked to verify that the requested size does not exceed the total combined size of the system’s physical backing stores (i.e. RAM and swap space). This heuristic helps to guard against individual allocation requests that are blatantly incorrect, but still permits the system-wide committed memory use to exceed the available physical backing space. This has been the default overcommit handling mode in the mainline Linux kernel since commit 502bff0685, which is present in kernel version 2.5.30 and subsequent releases. (It is worth noting that although the implementation of the heuristic check has changed during the subsequent 22 years, the purpose and description of this mode remains unchanged.)
-
Always overcommit: in this mode, overcommitting memory is permitted unconditionally and no further checks are carried out by __vm_
enough_ memory() when considering individual allocation requests. Callers such as mmap()
will still enforce address space limits imposed by system-wide resource limits, but this is functionally unrelated to overcommitting memory and is true for all overcommit handling modes. -
Never overcommit: in this mode, overcommitting memory is not permitted, and allocation requests will be denied if they would exceed the configured commit limit. Unlike Windows, Linux makes the commit limit fully configurable via the vm.overcommit_ratio and vm.overcommit_kbytes sysctls, which allow system administrators to configure either a limit based on percentage of system RAM or an absolute limit, respectively. In addition to enforcing the configured commit limit, a small amount of memory is reserved to help maintain smooth system operation in the event that the commit limit is reached. As mentioned previously, the mmap() function also ignores the
MAP_NORESERVE
flag when it detects that memory overcommit is disabled, so applications must use thePROT_NONE
memory protection flag in order to reserve memory. This overcommit handling mode was introduced to the mainline Linux kernel in commit 502bff0685, which is present in kernel version 2.5.30 and subsequent releases.
When the Linux kernel is configured to use a mode in which overcommitting memory is permitted, applications have no guarantee that the kernel can actually provide all of the memory that has been successfully committed. Unlike under Windows, where Out Of Memory (OOM) errors are reported pre-emptively to applications by way of failed allocation requests, OOM errors under Linux manifest at the point when memory is accessed and the kernel attempts to establish a mapping to a physical backing. When this occurs, the kernel itself handles the error rather than reporting it to the application that triggered physical memory exhaustion. The kernel attempts to resolve the issue by invoking the Linux Out Of Memory (OOM) Killer, a process that examines all non-essential user processes and selects candidates for termination. The OOM Killer will identify processes with high memory use and terminate them one-by-one by sending a SIGKILL signal until enough memory is freed for the system to continue normal operation. Since the SIGKILL signal cannot be caught or handled by applications, there is no opportunity to react to the OOM error or meaningfully report it; OOM-killed applications are simply terminated immediately.
This behaviour also holds true for Windows applications running under Wine, which presents a problem for applications that rely on the memory accounting guarantees provided by Windows. In particular, the special OOM error handling logic implemented by Unreal Engine will not function correctly under Wine if the underlying Linux kernel is configured to permit overcommitting memory, since memory commit operations will always succeed and the OOM handler code will never be triggered. Worse still, if the process is forcibly terminated by the Linux OOM Killer then the regular crash reporting code will never run either, so the engine will be unable to generate and submit a crash report. (In a best-case scenario, the immediate parent process may detect and report the failure if the OOM was triggered by a child process, but the generated report will contain no indication than an OOM error is to blame, or any details about the state of the child process at the moment in time when the OOM state was triggered.) This significantly compromises the ability to identify and diagnose OOM failures in Unreal Engine cloud workloads, undermining both reliability and developer confidence as discussed in the section Error reporting and Out Of Memory (OOM) handling above.
The obvious remedy for this problem would be to simply configure the Linux kernel to disable memory overcommit entirely and enforce strict commit limits. However, this is undesirable when deploying Wine inside Linux containers, since the vm.overcommit_memory
sysctl is a kernel-wide setting that impacts the entire underlying host system and all container workloads, with no available mechanism to isolate its effects to a given container. It is of course possible to schedule all cloud workloads that require overcommit prevention together on the same underlying host systems during deployment to minimise this impact, as some articles recommend, but this still affects the orchestration software running on the host system and any other applications deployed in sidecar containers. It also impacts local testing, since the local environment will need to be configured to disable memory overcommit and this may interfere with other applications running on the same local host system. This can be particularly undesirable for developers who are running Windows and using WSL2 for local testing of Linux containers, since WSL2 runs a single virtual machine and separates distros using Linux namespaces, so any kernel-wide configuration options such as the overcommit handling mode will affect all applications running under all WSL2 distros on a given Windows host machine.
Disabling memory overcommit at the kernel level may also lead to other undesirable side effects. In our testing, we observed that some native Linux applications exhibited unusual behaviour when overcommit was disabled, likely due to their code having been written based on the assumption that all memory allocation requests should unconditionally succeed. (This practice is apparently fairly common, and critics of the Linux kernel’s choice of default overcommit handling mode often cite the prevalence of this assumption in application and library code as a key concern.) Since it is not possible to rely on the Linux kernel’s native support for preventing memory overcommit without the potential for introducing unwanted side effects, the only remaining option during our investigation was to implement overcommit prevention in userspace code, both in native Linux code and within the Windows API implementation in Wine itself.
Emulating Windows overcommit prevention logic under Wine
During our investigation, we considered a number of potential designs for emulating Windows overcommit prevention behaviour under Wine. The primary design constraint was the need to keep performance overheads to a minimum without sacrificing accuracy. Since the Linux kernel tracks committed memory use regardless of the configured overcommit handling mode, it would have been desirable to query this value and avoid the need to introduce any additional bookkeeping. However, committed memory use appears to be tracked only at a system-wide level, and there is no documented mechanism provided to query the commit charge for a cgroup. This precludes the use of the kernel-tracked commit value when attempting to determine the total amount of committed memory use for a Linux container. We also briefly considered implementing comprehensive accounting for committed memory use in userspace, aggregating values from both Wine and native Linux code. However, the engineering effort that would have been required to implement this in a reliable and performant manner was outside the scope of our investigation.
Ultimately, we identified only one set of kernel-tracked values that could satisfy all of our requirements regarding accuracy, performance, and cgroup granularity: the cgroup interface files that we had previously used for reporting container memory use, as discussed in the section Querying cgroup memory usage above. Since there is no value for committed memory use, we instead opted to reuse our computed value for physically-backed memory use (which takes into account factors such as reclaimable memory to calculate an accurate number, as discussed in the section Accounting for kernel page cache usage above). Given that this value tracks memory with a physical backing rather than committed memory, the only way to guarantee that the full commit charge is accounted for is to ensure that a mapping is established to a physical backing store for all committed memory immediately after it is first committed. This provides reliable and performant accounting for all committed memory within the container, at the cost of completely negating the benefits afforded by demand paging. This compromise was deemed acceptable for Unreal Engine asset cooking workloads and we observed no negative impact in our testing, but we expect that this approach would likely be unsuitable for many other types of cloud workloads.
Three potential options were considered for establishing a physical backing for each committed page: calling the mlock() or mlockall() functions to force the page to be resident in RAM, manually accessing the contents of the page, or passing the MADV_
or MADV_
flag (as appropriate for the protection flags for the given page) to the madvise() function:
-
Passing the
MCL_CURRENT
andMCL_FUTURE
flags to themlockall()
function would have the desirable property of requiring only a single call per process that would guarantee that all subsequent allocations become physically backed without further intervention. This would also ensure that copy-on-write pages are pre-emptively copied, sincemlock()
andmlockall()
both call the internal kernel function __mm_populate(), and this in turn calls the internal kernel function populate_ vma_ page_ range(), which specifically performs a write fault when populating writable pages to ensure copy-on-write is triggered if applicable. However, the use of this approach inside a container requires that the underlying container runtime grants the container additional capabilities which are not granted by default. This requirement was deemed unsuitable, since we wanted to keep the number of required container configuration parameters to a minimum, particularly in light of the other unavoidable required parameters discussed in the section Preventing native Linux code from triggering OOMs below. This approach also has the potentially undesirable side effect of preventing committed memory from subsequently being paged out to swap space. -
Manually accessing the committed pages of memory requires intervention every time memory is committed, but has the benefit that it will work in any environment without requiring special configuration and will not prevent pages from subsequently being swapped out. Manually reading data from each page that can be read, and writing that data back to each page that can be written, would establish a physical backing for each page whilst ensuring that any copy-on-write operations are triggered pre-emptively without altering the contents of the pages being accessed. However, this approach has the downside of potentially interfering with the state of CPU data caches, since all pages are actually accessed as they are faulted in. For large ranges of memory that might not otherwise be accessed in their entirety, this could potentially lead to other useful data being evicted from one or more CPU caches, necessitating that the data is fetched again from the physical backing store when it is next accessed and thus incurring a performance penalty.
-
The Linux-specific
MADV_
andPOPULATE_ READ MADV_
flags supported by the madvise() function provide equivalent functionality to manually accessing memory pages, but with the important distinction that the contents of the pages are not actually accessed, therefore reducing the chance of inadvertently interfering with the state of CPU data caches. This was the approach that we ultimately selected, since it will work in any environment and provides the desired copy-on-write behaviour without incurring unnecessary performance overheads. This provides an optimal mechanism to establish a physical backing for all pages that have not been configured with thePOPULATE_ WRITE PROT_NONE
memory protection flag, which is precisely what we want since thePROT_NONE
memory protection flag indicates reserved memory rather than committed memory, as discussed in the section Understanding memory states under Windows and Linux above.
With the fundamental approach identified, the next step was to implement overcommit prevention logic within the Wine codebase. This required modifications to each of the Windows API functions that commit memory:
-
VirtualAlloc() and its variations: these functions were the most straightforward to patch. The Wine implementations of VirtualAlloc(), VirtualAllocEx() and VirtualAlloc2() all call NtAllocateVirtualMemory(), which in turn calls an internal Wine function named allocate_
virtual_ memory(). We added a check to allocate_
just after parameter validation but immediately prior to entering the critical section, which only runs when thevirtual_ memory() MEM_COMMIT
flag has been specified. The check queries the current memory use for the container and compares it to the memory limit for the container (if one is configured) to determine whether there is sufficient available memory to satisfy the commit request. If there is insufficient memory then the function returnsSTATUS_
(which will subsequently be translated to theNO_ MEMORY ERROR_
error code), otherwise execution proceeds as normal. We added a second check that runs when the function has succeeded, which faults in each of the newly-committed pages of memory if theNOT_ ENOUGH_ MEMORY MEM_COMMIT
flag was specified. This ensures the pages have a physical backing and will be correctly accounted for in subsequent checks that verify the availability of sufficient memory to satisfy commit requests. -
CreateFileMapping(): this function was more complex to patch, since it involves code paths for creating both file-backed mappings and also named shared memory objects. The Wine implementation of Create
File MappingA() simply converts its string parameter from a narrow string to a UTF-16 wide-character string and then calls Create File MappingW(), which in turn calls Nt Create Section(). The implementation of NtCreateSection()
performs an RPC call to the Wine server (see the section Misleading exit codes for abnormal process termination above for additional details on this component of Wine), which is processed by the create_mapping request handler. The handler implementation calls an internal Wine function named create_mapping(), which has two code paths that handle mappings for regular files and named shared memory objects, respectively. Prior to these code paths diverging, an object is created to represent the mapping by calling an internal Wine function named create_ named_ object(), which in turn calls either alloc_ object() or create_ object(). This latter function also calls alloc_
, along with another function called alloc_object() name(). Both of these functions call an internal function named mem_ alloc() to allocate memory, which can increase the commit charge. We patched this function to add a check that verifies that there is sufficient memory available to satisfy the commit request. There was no need for us to add logic to fault in the newly-allocated pages immediately after this, since mem_
calls the native Linux malloc() function to perform the allocation and the pages will instead be faulted in by the mechanism discussed in the section Handling native Linux memory allocations below. After this point, the code paths diverge based on the type of mapping being created:alloc() -
The code path for regular files can only increase the commit charge if creating a shared mapping for an executable image (DLL or EXE file). When creating a mapping for an image, it calls an internal Wine function named get_
image_ params() to parse the image headers and populate the mapping metadata relevant to image files. Once the headers have been parsed, the function calls another internal Wine function named build_ shared_ mapping() to create a temporary file that will back the mapping and copy data into it from the executable image file. This function allocates memory by calling alloc_ object() (which we previously patched as described above) and also by directly calling the Linux malloc() function. We added a check immediately prior to the direct malloc() call that verifies that there is sufficient memory available to satisfy the commit request. There was no need for us to add logic to fault in the newly-allocated pages immediately after this, since that will be handled by the native Linux memory allocation accounting mechanisms described in the section Handling native Linux memory allocations below. In addition to directly allocating memory, the build_
function also creates a temporary file by calling an internal Wine function named create_shared_ mapping() temp_ file(). The creation of this temporary file can increase the commit charge if it is created under a filesystem that is backed by memory rather than disk, such as ramfs or tmpfs. This is actually the same function that is called by the code path for named shared memory objects, so at this point the two code paths for CreateFileMapping()
converge once more and both can be handled by the checks that are discussed in the dot point below. -
The code path for named shared memory objects always has the potential to increase the commit charge, but whether it actually does so is dependent on the configuration of the underlying container environment or host system. In this code path, the line of interest is the one that populates the Unix file descriptor with the return value of the internal Wine function named create_
temp_ file(). This function is responsible for creating a temporary file that will provide the backing for the named shared memory object. The temporary file may be created in one of two possible locations: either the directory that the Wine server uses for storing files related to the current server instance (such as the master socket file), or the configuration directory (which is either the Wine prefix or $HOME/.wine
). The server directory is the preferred option and this is typically a location under/tmp
(except under Android, but this is not relevant to cloud deployment scenarios), and this directory may or may not be backed by memory depending on how the container’s mount options are configured. (For example, containers run with Docker can use tmpfs mounts to back/tmp
, whilst Kubernetes Pods can use an emptyDir volume with theemptyDir.
field set tomedium "Memory"
.) Once the temporary file has been created it is resized by calling an internal Wine function named grow_file(), which specifically writes to the file prior to calling the ftruncate() function. As discussed in the section Understanding memory states under Windows and Linux above, ftruncate()
will not commit memory when operating on memory-backed files, but writing to the file will commit memory. As a result, in scenarios where the temporary file is created in a directory that is backed by memory, the pages for the named shared memory object that it represents will be committed regardless of whether theSEC_RESERVE
orSEC_COMMIT
flag was specified when callingCreateFileMapping()
. To account for this, we added a check immediately prior to the call to grow_file() which runs regardless of the input flags. The check first determines whether the temporary file is backed by memory, and if this is the case then it verifies that sufficient memory is available to satisfy the commit request. There was no need for us to add logic to fault in the newly-allocated pages immediately after this, since the write operation will ensure that they are already accounted for in any subsequent checks that verify the availability of sufficient memory to satisfy commit requests.
-
-
MapViewOfFile() and its variations: this function was also more complex to patch, since it involves separate code paths for mapping views of executable images (DLL and EXE files) and views of other files or named shared memory objects. The Wine implementations of Map
View Of File(), Map View Of FileEx() and Map View Of File From App() all call Nt Map View Of Section(), whilst Map View Of File3() calls Nt Map View Of SectionEx(). (Wine 9.0 does not include an implementation of the Map View Of File2() function, and the various Numa
suffixed versions of this family of functions currently ignore any specified NUMA node preference and simply call their non-suffixed counterparts.) BothNt
andMap View Of Section() Nt
in turn call an internal Wine function named virtual_Map View Of SectionEx() map_ section() which implements the common mapping logic. This is where the code paths diverge: -
The code path for mapping executable images calls an internal Wine function named load_
builtin() to determine whether a Wine built-in DLL or EXE file should be loaded or if a native DLL or EXE file should be mapped instead. In the case where a native DLL or EXE file is loaded, the internal Wine function virtual_ map_ image() is called to perform the mapping. In the case where a Wine built-in DLL or EXE file is loaded, load_builtin()
calls another internal Wine function named find_builtin_ dll() to locate the appropriate file. Since Wine built-in DLL and EXE files may actually be implemented as native Linux .so
files, this function can call either open_builtin_ pe_ file() or open_ builtin_ so_ file() to load the file, depending on the type: -
In the case of Windows PE files (DLL or EXE),
open_
calls an internal Wine function named open_builtin_ pe_ file() dll_ file(). This in turn calls Nt Create Section() (whose memory commit operations are handled by the patches for Create
described above) and then an internal function named virtual_File Mapping() map_ builtin_ module(), which ultimately calls virtual_
and converges with the code path for loading native DLL and EXE files. This function calls an internal Wine function named map_map_ image() image_ view(), which in turn calls map_ view(). This function allocates memory by calling an internal Wine function named anon_ mmap_ alloc(), which can increase the commit charge if the allocated memory is writable. We patched this function to add a check that verifies that there is sufficient memory available to satisfy the commit request if the PROT_WRITE
memory protection flag was specified. There was no need for us to add logic to fault in the newly-allocated pages immediately after this, since this function calls the native Linux mmap() function to perform the allocation and the pages will instead be faulted in by the mechanism discussed in the section Handling native Linux memory allocations below. In addition to directly allocating memory,map_
also calls three internal Wine functions that establish different types of mappings, named map_view() fixed_ area(), map_ reserved_ area() and map_ free_ area(), the last of which in turn calls another internal function named try_ map_ free_ area(). All of these functions ultimately allocate memory by calling one or both of the internal Wine functions anon_ mmap_ tryfixed() and anon_ mmap_ fixed(). Much like the anon_
function discussed above, these functions call the native Linux mmap() function and can increase the commit charge if the allocated memory is writable. We added the same check to these functions that we added to the non-fixed variant, and similarly the newly-allocated pages will be faulted in by the mechanism discussed in the section Handling native Linux memory allocations below. After all of these memory allocations have completed, themmap_ alloc() virtual_
function performs an RPC call to the Wine server (see the section Misleading exit codes for abnormal process termination above for additional details on this component of Wine), which is processed by the map_map_ image() image_ view request handler. The request handler allocates memory by calling the internal Wine function mem_ alloc(), which we previously patched as described in the dot point discussing Create
above.File Mapping() -
In the case of Linux
.so
files,open_
calls dlopen_builtin_ so_ file() dll(), which in turn calls internal Wine functions named map_ so_ dll() and virtual_ create_ builtin_ view(). The map_
function allocates memory by calling the internal Wine function named anon_so_ dll() mmap_ fixed(), which we previously patched as discussed above. The virtual_
function calls an internal Wine function named create_create_ builtin_ view() view(), which in turn calls two functions that allocate memory, alloc_ pages_ vprot() and alloc_ view(). Both of these functions call the internal Wine function named anon_ mmap_ alloc() to perform the memory allocation, which we previously patched as discussed above. After all of this is complete, the virtual_
function performs an RPC call to the Wine server (see the section Misleading exit codes for abnormal process termination above for additional details on this component of Wine), which is processed by the map_create_ builtin_ view() builtin_ view request handler. Much like the map_
request handler discussed above, this request handler allocates memory by calling the internal Wine function mem_image_ view alloc(), which we previously patched as described in the dot point discussing Create
above.File Mapping()
-
-
The remaining code path handles all other types of mappings. In a similar manner to
VirtualAlloc()
, we added a check immediately prior to entering the critical section that only runs if the view is being mapped with copy-on-write semantics. Just like the checks described above, it verifies that there is sufficient available memory to accommodate the copy-on-write mapping, and returns an error if there is not, matching the behaviour under Windows. We add a second check when the function has succeeded which faults in each of the newly-committed pages of memory if copy-on-write behaviour was requested. This pre-emptively triggers the copy-on-write operations and ensures the pages will be correctly accounted for in subsequent checks.
-
Handling native Linux memory allocations
With our Wine patches in place to handle memory commit operations originating from the Windows API, the next step was to implement a similar mechanism to account for allocations made by native Linux code. This is necessary to ensure that the commit charge value used in our overcommit prevention checks correctly includes any committed memory that is allocated by the native Linux components of the Wine codebase or by other native Linux applications running inside the container. The goal for native Linux code was the same as for Windows code running under Wine: to insert additional logic that intervenes whenever an application performs a memory commit operation, and immediately establishes a physical backing for the newly-committed pages. However, accomplishing this goal under Linux is far more complex and nuanced than it is under Wine, due to key differences in the underlying architecture of Windows and Linux.
Windows does not publicly expose the set of system calls that the kernel provides for servicing requests from user applications. Instead, all system calls are abstracted away behind Windows API functions, and applications are expected to call these functions to interact with services provided by the kernel. Historically, Microsoft has not provided any guarantees of system call stability between different releases of Windows, although this stance was subsequently reversed and efforts to stabilise the underlying ABI were undertaken to improve Windows container version compatibility as part of the release of Windows Server 2022. As a result of these factors, most Windows applications traditionally use the Windows API for all low-level system interactions and do not make any direct system calls. This makes it easy to identify memory management requests by intercepting or modifying the relevant Windows API calls, as discussed in the Emulating Windows overcommit prevention logic under Wine section above. (It is worth noting that this does not necessarily hold true for all Windows applications in the wild, and there has been a reported increase in Windows applications using direct system calls in recent years. Most notably, Wine itself has had to contend with the issue of how to efficiently emulate these direct system calls, which resulted in the addition of Linux kernel functionality to selectively redirect system calls to userspace handlers in commit 1446e1d, which is present in kernel version 5.11.0 and subsequent releases.)
Under Linux, the opposite is true. The Linux kernel publicly exposes a set of system calls for direct use by userspace system libraries and applications, and these system calls remain largely stable between kernel releases. Individual applications can make direct system calls if desired, although in practice most applications will typically use the abstractions provided by the programming language in which they are written. Applications written in C and C++ typically use the system call wrapper functions provided by the C standard library, although there are multiple implementations of this library available. The GNU C Library (glibc) is shipped as a standard component of many Linux distributions, but alternative implementations such as the lightweight musl libc are growing in popularity. (Most notably, musl is the libc implementation shipped with Alpine Linux, a distribution that has proven quite popular as a compact container base image.) Although glibc typically cannot be statically linked into applications without library features breaking, musl and many other libc implementations can be statically linked, making the system call wrapper functions largely inseparable from the application code that calls them. The runtimes for other programming languages may choose to wrap the relevant functions from the system’s libc, or may choose to implement their own syscall wrapper functions. As a result of the sheer variety of options available, there is no single set of function calls that can be used to identify memory management requests for all applications as there is under Windows. Instead, the only point at which all of these implementations converge is the system call interface itself, as execution transitions from userspace into kernel mode.
Unfortunately, the implementation of a mechanism that identifies all possible memory allocation requests by intercepting the relevant systems calls in order to then fault in newly-committed pages was outside the scope of our investigation. Since we could not identify memory allocations from all possible Linux applications, we instead settled for intercepting calls to the memory allocation functions in the system libc, since this will address the overwhelming majority of traditional Linux applications running under a distribution that uses a dynamically-linked glibc. This compromise allowed us to disregard a great deal of complexity and instead focus on simply modifying the behaviour of a small set of function calls in a similar manner to Windows. Unlike our Wine patches, in which we directly added code to the implementation of the relevant Windows API functions, we did not wish to modify the libc implementation itself. Instead, our intent was to simply intercept calls to the relevant functions and then trigger our logic to fault in the new pages if we detected that the call had successfully committed memory. This approach has the benefit of providing compatibility with different versions of glibc as shipped by various Linux distributions, and also with distributions that ship dynamically-linked libc implementations that provide the same interface as glibc with respect to the functions being intercepted.
Fortunately, Linux makes it trivial to intercept system library function calls and replace them at runtime with alternative implementations, thanks to a dynamic linker feature known as symbol interposition (referred to as preemptable symbols in Chapter 4 of the System V Application Binary Interface specification). When a process loads multiple dynamic libraries that contain a symbol with the same name then the linker will resolve the version of the symbol from the first library, regardless of which library provided the symbol at the time the application was compiled. The dynamic linker provides a mechanism to inject arbitrary libraries into a process at the start of the search order, making it easy to leverage symbol interposition behaviour to replace system library functions. Additional libraries can be specified via either an environment variable called LD_
or a configuration file named /etc/
, and if libraries are specified via both sources then the libraries from the environment variable will be injected before those from the configuration file. Once an alternative implementation of a system library function has been interposed in its place, the new implementation can pass the RTLD_NEXT
flag to the dlsym() function to resolve the address of the “real” version of the function and call it if desired.
We created a small shared library (which we dubbed the “memory shim”) containing alternative implementations for libc library functions that allocate or commit memory, including malloc(), mmap() and mprotect(). (Note that we chose not to provide an alternative implementation for the brk() function, since this is rarely used outside of the implementation of malloc()
and we already handle calls to that function directly.) Each of these implementations simply calls the real version of the function and then checks whether the call represented a memory commit operation. If it did, and the call succeeded, then our implementation faults in the newly-committed pages using the same logic as the code in our Wine patch, as described in the Emulating Windows overcommit prevention logic under Wine section above. This memory shim library can be injected into all of a container’s native Linux processes via the LD_
environment variable or the /etc/
configuration file to ensure that all memory committed by each process is physically backed and therefore accounted for when Wine queries the commit charge for the container.
Preventing native Linux code from triggering OOMs
The memory shim library described in the section Handling native Linux memory allocations above ensures that memory committed by native Linux processes is accounted for in the calculated commit charge in the same manner as memory committed by Windows processes running under Wine, but it does not enforce overcommit prevention for native Linux code or reject commit requests that would exceed the commit limit. This is a deliberate choice, since enforcing overcommit prevention may cause unusual behaviour when running native Linux applications that have been designed with the assumption that memory allocation requests will always succeed, as discussed in the section Memory overcommit under Windows and Linux above. However, the lack of commit limit enforcement for native Linux processes invites the risk that a native Linux allocation (even a small one, depending on the current commit charge) will exhaust the container’s available memory and trigger the Linux Out Of Memory (OOM) killer, undermining the entire purpose of the overcommit prevention logic added to Wine. To achieve the overarching goal of robust error reporting in OOM situations without introducing adverse side effects, this risk needs to be mitigated in a manner that ensures correct behaviour for both Windows applications and native Linux applications.
During our investigation, we endeavoured to devise solutions to this problem that would satisfy both the technical constraints of our container environment and our desire for simple configuration during both local testing and cloud deployment as previously mentioned during the discussion of the mlockall() function in the section Emulating Windows overcommit prevention logic under Wine above. However, the available technical mechanisms at our disposal ultimately necessitated an approach that requires external configuration parameters when running the container. More frustratingly, the required configuration parameters are not fully supported by Kubernetes (a proposal for memory QoS functionality implements something similar, but that proposal is currently stalled), so this approach will only work when running containers locally with Docker or orchestrating containers in the cloud with tools such as the Docker driver provided by HashiCorp Nomad. Nonetheless, it is the best solution that we could devise that functions correctly within the current technical constraints arising from memory reporting limitations inside Linux containers and the decision to track commit charge via the use of physically backed memory.
As previously stated, the Linux OOM killer will be triggered if a container exceeds its maximum memory limit and memory overcommit is enabled. However, cgroups actually support multiple types of memory limits, which are used for a variety of purposes. The following memory limits are supported by cgroups version 2:
-
memory.min
: specifies the minimum amount of memory that is guaranteed to the cgroup at all times. Memory under this limit will never be reclaimed, and if insufficient memory is available to guarantee this limit then the OOM killer will be triggered. -
memory.low
: specifies a best-effort memory guarantee for the cgroup. Memory under this limit will only be reclaimed if there is no reclaimable memory available elsewhere, and if the cgroup’s memory usage exceeds this limit then it will become subject to reclaim pressure proportional to the overage. -
memory.high
: specifies a throttle limit for the cgroup. If the cgroup’s memory usage exceeds this limit then processes will be throttled and subject to heavy reclaim pressure, but the OOM killer will not be invoked. -
memory.max
: specifies the hard memory limit for the cgroup. If the cgroup’s memory usage exceeds this limit and memory overcommit is enabled then the OOM killer will be invoked. If memory overcommit is disabled then allocation requests that exceed this limit will be denied.
Our selected approach to guarding against OOMs triggered by native Linux memory allocations was to establish two separate limits for the container: a “soft limit” used as the commit limit by the overcommit prevention logic in Wine, and a “hard limit” that specifies when the OOM killer will be invoked. Maintaining a delta between these two limits can provide a safety net that absorbs momentary spikes from native Linux memory allocations and ensures the OOM killer is not invoked before Windows code has an opportunity to detect the overcommit due to a failed allocation request and formally report an OOM error. Docker supports configuring the memory.max
hard limit by specifying the -m
or --memory
flags, and supports configuring the memory.low
best-effort memory guarantee by specifying the --memory-
flag, so this value is the best available option for use as the soft limit. The only code change we needed to make in order to facilitate this dual-limit approach was to modify the cgroup memory reporting code discussed in the section Querying cgroup memory usage above and add support for optionally reporting the soft limit value as the container’s total memory commit limit rather than the default hard limit value. With this in place, the use of this safety net is entirely a matter of configuration.
The optimal size for the delta between the hard limit and the soft limit needs to be determined on a workload-by-workload basis, through trial and error testing. The goal of this testing should be to determine the smallest buffer size that prevents the OOM killer from being triggered for a given workload, thus providing the desired behaviour whilst minimising wastage from unused memory that is inaccessible to Windows code. Once the optimal delta has been determined then it simply needs to be configured via the appropriate Docker flags or their Nomad equivalent, and Unreal Engine cloud workloads should correctly and reliably report OOM errors under Wine in the same manner that they would when deployed under Windows.
Due to the complex nature of the overcommit prevention patches and their dependency on both external container configuration parameters and the out-of-tree cgroup memory reporting patches, they have not been submitted for inclusion in the upstream Wine project and are instead maintained out-of-tree. For details on how to obtain these patches, see the section Get started using the Wine patches today below.
Get started using the Wine patches today
All of the Wine patches discussed in the sections above, both those that have been submitted for inclusion in the upstream Wine project and those that are maintained out-of-tree, are available to try today! Epic Games has released a repository providing resources to help Unreal Engine licensees get up and running using Wine to deploy cloud workloads, including patches, Dockerfiles, and supporting documentation. These initial resources are designed to facilitate Unreal Engine asset cooking workloads, but any future investigations into additional types of workloads may lead to the development and release of additional resources.
The Wine resources repository is available now on GitHub at: https://github.com/EpicGamesExt/WineResources.