An open architecture for scalable Pixel Streaming

Posted by Adam Rehn and Luke Bermingham on 12 February 2021

Tags: Pixel Streaming, Reports, Unreal Engine, WebRTC

Overview

The Unreal Engine’s WebRTC-based Pixel Streaming system enables a wide variety of innovative new uses of the Unreal Engine. However, the reference implementation provided by Epic Games does not include a fully-featured backend server architecture for deploying Pixel Streaming at scale, leaving Engine licensees to implement their own backend systems if they wish to implement large-scale solutions built on the technology. This report proposes a backend architecture for scalable Pixel Streaming that is open, extensible and unopinionated.

1. Background
2. Proposed architecture
3. Implementation roadmap
Acknowledgements
References

1. Background

In the sections that follow, we provide background information about both the technologies which are relevant to deploying Pixel Streaming applications at scale and about the reference implementation of Pixel Streaming itself. This background information provides the context for our proposed architecture for scalable Pixel Streaming, which is presented in Section 2.

1.1. Streaming protocols

1.1.1. WebRTC

WebRTC is a complex suite of protocols and accompanying Application Programming Interfaces (APIs) that provide functionality for real-time multimedia streaming and communication over the web. WebRTC supports streaming both media streams (audio and/or video) and arbitrary binary data through a feature called “data channels” ¹. Originally developed by Google, the high-level browser APIs are defined in the WebRTC 1.0 W3C Proposed Recommendation ² and the underlying implementation details are defined in a series of IETF RFCs and Drafts, including an overview of the suite ³, audio processing requirements ⁴, data channel establishment and definition ¹⁵, security considerations ⁶, and underlying transport details ⁷⁸. WebRTC is specifically designed for real-time communication with minimal latency and includes signal processing features to smooth out the effects of unstable network connections, although the constraints discussed in Section 1.2.4 limit the extent to which it can scale to large numbers of concurrent users without incurring significant infrastructure costs. As such, WebRTC is best suited to use cases where true real-time communication is required but extremely high levels of scalability are not necessary.

WebRTC is built atop a stack of lower-level protocols and technologies, most of which use UDP as the underlying transport protocol for performance reasons. ⁹ The key protocols used for real-time communication are the Secure Real-time Transport Protocol (SRTP) ¹⁰ for media transmission and the Stream Control Transmission Protocol (SCTP) ¹¹ for sending messages over data channels. All data is encrypted using Datagram Transport Layer Security (DTLS) ¹² and this encryption is mandatory. For an excellent visual representation of this protocol stack, see Figure 18-3 from the Creative Commons licensed online textbook High Performance Browser Networking ⁹, which cannot be directly reproduced here due to the non-commercial clause of its license terms and the funded nature of this report.

WebRTC connections are fundamentally peer-to-peer in nature and require the use of robust connection establishment procedures to account for the variety of potential network topologies which may separate any given pair of peers. ² WebRTC uses the Interactive Connectivity Establishment (ICE) protocol ¹³ for establishing peer-to-peer connections, which in turn uses both Session Traversal Utilities for NAT (STUN) ¹⁴ and (if necessary) Traversal Using Relays around NAT (TURN) ¹⁵ to ensure connectivity can be maintained in the presence of network topologies that employ Network Address Translation (NAT) or similar technologies and would otherwise preclude direct peer-to-peer communication. At a high level, ICE functions by gathering and testing a list of connection “candidates” to determine the optimal connection mechanism for communication with a peer, automatically querying STUN and TURN servers as necessary. ¹³ However, before ICE candidate exchange can begin, peers must first notify one another of their intent to establish a WebRTC session and negotiate the parameters of this session. This initial “signalling” phase is of particular interest because the WebRTC standard defines the data format required for the exchange of session parameters and ICE candidates but not the transport mechanism or protocol. ³

Session negotiation between WebRTC peers requires the exchange of an “offer” and corresponding “answer” using the Session Description Protocol (SDP) ¹⁶, followed by ICE candidate exchange. The SDP offer and answer specify session parameters including the number and type of media and data channels, codec information for audio and video streams, bandwidth information and encryption keys. ¹⁶ The WebRTC standard expects that both SDP offers/answers and ICE candidates will be exchanged via application-specific protocols, and explicitly avoids imposing any requirements on these signalling mechanisms in the interest of interoperability with existing systems. ⁹ In most materials discussing WebRTC architecture, this role is filled by a software entity referred to as a “signalling server”, which implements a protocol defined by the application at hand. ¹⁷ This terminology is used by the reference implementation of Pixel Streaming discussed in Section 1.4 and so we remain consistent with this use when discussing our proposed scalable architecture for Pixel Streaming in Section 2.

For details on common architectures used to scale WebRTC sessions to multiple participants, see Section 1.2.

1.1.2. HTTP Live Streaming (HLS)

HTTP Live Streaming (HLS) is a video streaming protocol developed by Apple and defined in IETF RFC 8216 ¹⁸. Unlike WebRTC, HLS is built atop the standard HTTP protocol and transfers data in the same manner as serving regular files from a web server. This introduces latency not present in WebRTC due to the use of TCP rather than UDP as the underlying transport protocol, but provides wide compatibility with existing software and allows content providers to leverage traditional web infrastructure such as a Content Delivery Network (CDN) to provide high levels of scalability. A set of low-latency extensions to HLS ¹⁹ (known as Low-Latency HLS or LL-HLS) form a key part of the proposed draft of the 2nd Edition of the standard ²⁰ and aim to reduce latency levels as much as possible, although not to the levels possible with WebRTC. As such, HLS is most suited to use cases where extremely high levels of scalability are required but true real-time communication is not necessary.

At a high level, HLS works by encoding the source media stream at multiple quality levels (each known as a “variant stream” ¹⁸) and segmenting each variant stream into a series of discrete chunks (known as “segments”.) The individual segment files are accompanied by playlist files that provide index information about the available variant streams and segments. Once generated by the server, the segment and playlist files can then be distributed to clients via any standard web server. Clients retrieve and parse the playlist data in order to determine the URIs of the appropriate segment files, and then retrieve the relevant segments of the most appropriate variant stream for the current network conditions. The process of automatically serving the most appropriate quality level for the available network conditions is often referred to as adaptive bitrate streaming ²¹, although the term is not used in the text of the HLS standard.

The use of multiple pre-encoded variants of the source media stream is conceptually identical to the WebRTC technique known as simulcast, although the responsibility of selecting an appropriate stream falls to a Selective Forwarding Unit when using WebRTC (see Section 1.2.3), rather than to the client itself as is the case for HLS. It is worth noting that these approaches based on multiple independent stream variants are distinct from the layer-based scalable video coding techniques discussed in Section 1.3, which are expected to greatly improve the efficiency and flexibility of all streaming protocols that adopt them.

1.1.3. Dynamic Adaptive Streaming over HTTP (DASH)

Dynamic Adaptive Streaming over HTTP (DASH, also referred to as MPEG-DASH) is a streaming protocol standard developed by the Moving Picture Experts Group (MPEG) and defined in ISO/IEC 23009-1 ²². Conceptually, DASH functions in the same manner as HLS, and features similar extensions for low-latency operation. The primary difference between the two is that DASH is a ratified international standard, whereas HLS is controlled by Apple. DASH also permits the use of any arbitrary codec for audio and video streams ²², whereas HLS mandates the use of either H.264 or HEVC (H.265) for video streams and AAC for audio streams. ²³

From the perspective of the scalable Pixel Streaming usage scenario discussed in Section 2.3.3, we consider HLS and DASH to be interchangeable, and recommend the use of both protocols to ensure compatibility with the widest array of client web browsers and devices. This is because DASH is built upon the Media Source Extensions (MSE) functionality supported by all modern web browsers with the exception of Apple Safari on mobile platforms ²⁴, which instead feature native support for HLS ²³. Supporting both protocols simultaneously has been relatively straightforward since Apple introduced support for fragmented MP4 files in HLS ²⁵, since both the HLS and DASH playlists can refer to the same set of encoded segment files.

1.2. WebRTC architectures for multiple participants

As stated in Section 1.1.1, WebRTC connections are fundamentally peer-to-peer in nature and the underlying protocols are concerned purely with direct communication between peers. Applications wishing to support sessions involving multiple participants must compose these sessions from multiple WebRTC peer connections, often with the introduction of additional servers to act as peer endpoints. The sections that follow discuss common approaches to composing these sessions.

1.2.1. Peer-to-peer Mesh

The simplest option for composing a multi-party WebRTC session is to simply have every peer maintain a connection to every other peer. This mesh topology does not require any additional servers, but results in connections in the order of N² (where N is the number of participants.) This rapidly becomes inefficient with respect to both bandwidth use and computational requirements on each peer, since every peer must encode and transmit their outbound stream to all other peers whilst simultaneously receiving and decoding all inbound streams. ⁹ As such, this architecture is rarely used in real-world applications supporting multiple participants. ²⁶

1.2.2. Multipoint Conferencing Unit (MCU)

A Multipoint Conferencing Unit (MCU) is a server which aggregates media streams from participants and composes a single view from these streams which is then transmitted to all peers. ²⁷ In architectures that makes use of an MCU, each peer connects directly to the MCU itself, resulting in only N connections (where N is the number of participants) but extremely high computational costs for the MCU itself due to the need to decode multiple streams and then re-encode the composited view for each participant. Note that this architecture is designed primarily for many-to-many videoconferencing applications and provides limited benefit in one-to-many use cases involving only a single media source per session, such as Pixel Streaming.

1.2.3. Selective Forwarding Unit (SFU)

A Selective Forwarding Unit (SFU) is a server which intelligently routes media streams between participants. ²⁷ Unlike an MCU, an SFU does not mix or aggregate media streams, but is purely concerned with receiving stream data from peers and delivering it to its intended recipient peers, optionally subsetting the data to adapt to the network conditions of each recipient. ²⁸ There are multiple strategies for bandwidth adaptation that can be implemented when using an SFU:

Naïve: prior to widespread support for the strategies discussed below, SFUs did not have any available mechanism for subsetting routed data streams and so bandwidth adaptation was purely the responsibility of the clients participating in session. In this scenario, clients are encoding a single stream with a simple bitrate target, and typically opt to raise or lower the target bitrate based on the detected bandwidth constraints of the other participants. ²⁷
Simulcast: WebRTC simulcast is conceptually identical to the concept of “variant streams” in HLS, whereby the encoder of a source media stream generates multiple variants of the stream at different bitrates or resolutions and sends these variants to the SFU. The SFU can then select which variant of the stream to transmit to each recipient based on their current network conditions. ²⁸
Scalable coding: in a scenario where peers are generating and consuming media streams encoded with one of the scalable video coding techniques discussed in Section 1.3, the SFU can unpack the encoded layers and deliver to each peer only the subset of layers which will fit within its current bandwidth constraints. ²⁷

The information in the paragraph below does not constitute legal advice.

The authors of this report are not lawyers and the information presented below is provided as general guidance only. For questions regarding patent licensing requirements for your specific circumstances we strongly recommend that you seek professional legal advice.

US company Vidyo, Inc. holds a variety of patents related to the use of the SFU architecture, particularly in combination with scalable video coding techniques. ²⁹³⁰ Vidyo has submitted Intellectual Property Rights (IPR) disclosures to the IETF for drafts and RFCs that make use of technologies covered by these patents, most of which concern RTP payload formats that facilitate interoperability between SFUs and scalable video coding techniques for different codecs. The IPR disclosure for each draft or RFC grants a license to implementors of the standard described by that document, which permits the use of the relevant Vidyo patents so long as the use is necessary for implementation of the standard and implementors agree not assert their own patent claims against Vidyo or its affiliates. Vidyo is also a current Promoter Member of the Alliance for Open Media ³¹, which means the Alliance’s patent policy applies to any Vidyo patents referenced in materials to which the company has contributed. ³² Specifically, the RTP payload format specification for AV1 was drafted by the AV1 Real-Time Communications Subgroup ³³, which is co-chaired by Vidyo’s chief scientist Dr Alex Eleftheriadis. ³⁴

1.2.4. Scalability constraints

Each of the WebRTC architectures discussed in the previous sections feature inherent limitations with respect to scaling to extremely large numbers of concurrent users:

Mesh: this architecture becomes intractable at scale from both a computational and bandwidth perspective, due to the N² (where N is the number of participants) manner in which these resource requirements scale for many-to-many use cases. ⁹ Although resource use scales only linearly for one-to-many use cases due to the absence of additional media streams, the need to encode the source media stream for every other participant will inevitably result in resource contention for the peer performing the encoding. This is particularly problematic if hardware accelerated video encoding is used (as is the case for Pixel Streaming) since hardware video encoders often enforce a limit on the maximum number of concurrent encoding sessions. ³⁵ This can be mitigated by only encoding the stream once and transmitting the same encoded data to all participants, but targeting a single bitrate for all participants introduces additional complexities with respect to balancing the needs of peers with disparate network connections and device constraints. ²⁶
MCU: this architecture features an inherently high computational cost for each server instance due to heavy decoding and encoding requirements, and so becomes extremely expensive to operate at scale. ²⁷ For one-to-many use cases, this architecture simply shifts the burden of encoding multiple streams from the peer generating the source media to the MCU itself. This may be advantageous if the MCU has access to greater (or cheaper) computational resources than the peer, but this benefit must be weighed against the additional latency introduced by encoding the stream twice (once by the peer and once by the MCU.) ²⁸ The financial complexity of operating this architecture at scale is exacerbated when using hardware accelerated video encoding, since GPU resources are typically more expensive than CPU resources but also more performant with respect to video encoding, meaning cost-to-performance tradeoffs must be carefully balanced. ³⁶
SFU: when combined with either simulcast or scalable video encoding techniques, this architecture scales in a performant and cost-effective manner. ²⁷ For many-to-many use cases, the limiting factor when scaling to large numbers of participants will typically be the bandwidth and computational resources required for each peer to receive and decode the media streams from all other peers. ²⁸ For one-to-many use cases, the limiting factor will typically be the bandwidth available to the SFU for transmitting streams to recipient peers, but this limitation can be mitigated by introducing additional SFUs to act as “relay servers” and distribute the system’s total bandwidth requirements across multiple servers. ³⁷ However, this approach still requires the deployment of an increasing number of relay servers connected by high-bandwidth networks as the number of participants grows, ultimately resembling the topology of a traditional CDN at extreme scales. For use cases requiring millions of globally distributed participants observing the same stream, we suspect that the use of either a hosted WebRTC platform or HLS/DASH leveraging an existing CDN will likely be the most cost-effective approach for operators who do not already maintain their own CDN-like infrastructure.

1.3. Scalable video coding techniques

In contrast to approaches such as HLS/DASH and WebRTC simulcast which make use of multiple independent variants of a stream to facilitate quality adaptation, scalable video coding techniques allow media sources to encode and transmit a single stream which encapsulates multiple levels of quality. This is typically achieved by using a layered encoding approach, whereby each layer represents an incremental improvement to quality and depends on all of the quality layers beneath it. ³⁸ As a result, parts of the bitstream can be selectively discarded during transmission to adapt to bandwidth constraints and the recipient can still correctly decode the stream at a reduced level of quality. ³⁹

There are three common types of scalability supported by scalable video coding techniques:

Temporal scalability: a temporally scalable bitstream represents the media at different framerates, allowing frames from higher framerate layers to be discarded and the stream decoded at a lower framerate. ⁴⁰
Spatial scalability: a spatially scalable bitstream represents the media at different resolutions, allowing the layers for higher resolutions to be discarded and the stream decoded at a lower resolution. ³⁸
SNR scalability: an SNR scalable bitstream represents the media at the same resolution but with different levels of fidelity or signal-to-noise ratio (SNR), allowing the layers for higher fidelity levels to be discarded and the stream decoded at a lower level of fidelity. ³⁹

In addition to these three common scalability types, some video codecs support hybrid schemes which combine two or more types of scalability. ⁴⁰

1.3.1. H.264/AVC Annex G: Scalable Video Coding (SVC)

Support for scalable video coding is formally defined in Annex G of the official H.264 standard. ⁴¹ Temporal, spatial and SNR scalability are all supported, as well as any combination of the three. ³⁹ Scalability is facilitated by extensions to the H.264 bitstream format and a set of new scalable profiles, called “Scalable Baseline”, “Scalable Constrained Baseline”, “Scalable High”, “Scalable Constrained High” and “Scalable High Intra”. ⁴¹

It is worth noting that Annex G represents the third update to the H.264 standard and the changes introduced to the bitstream format are not backwards compatible with older decoders. As a result, it is often helpful to differentiate between the bitstream format used by the profiles included in the original standard and that used by the new scalable profiles. The terms “H.264/AVC” and “H.264/SVC” are typically used when referring to the original H.264 bitstream format and that of its scalable profiles, respectively. ⁴²

1.3.2. H.265/HEVC Annex H: Scalable High Efficiency Video Coding (SHVC)

Although the original version of the H.265/HEVC standard already included support for temporal scalability, the second version introduced a set of scalability extensions that added support for both spatial scalability and SNR scalability, along with additional scalability modes for bit depth and colour gamut. ⁴³ The common functionality used by the extensions introduced in the second version of the official H.265 standard are formally defined in Annex F and the scalability extensions themselves are formally defined in Annex H. A number of new scalable profiles are defined, including “Scalable Main”, “Scalable Main 10”, “Scalable Monochrome”, “Scalable Monochrome 12” and “Scalable Monochrome 16”. ⁴⁴

1.3.3. VP8 temporal scalability and VP9/AV1 spatio-temporal scalability

The VP8 codec supports temporal scalability through the use of features inherited from its direct predecessor, VP7. The key feature which enables temporal scalability is the inclusion of multiple alternate reference frame buffers, known as “last frame”, “golden frame” and “alternate reference frame” (the first two of which were present in VP7 and the last of which was added in VP8.) ⁴⁵⁴⁶ Any regular video frame may choose which of these reference buffers to update or use as a point of reference instead of always simply referring to the frame immediately preceding it. Intelligent use of this flexibility in reference buffer update and selection behaviour allows for the creation of frames with a hierarchical dependency structure, effectively forming temporal layers. ⁴⁷

VP9 builds directly upon the scalability features of VP8 and introduces additional flexibility in how frames are encoded. Instead of only three reference frame buffers, the codec supports a total of eight reference frame buffers and any given frame may update and/or reference up to three of these buffers. ⁴⁸ Frames are also permitted to be of a different resolution to the buffers that they reference (so long as the ratio between the two resolutions falls within a predefined range) and four different entropy contexts are available to be updated or referenced by any given frame, in the same manner as reference frame buffers. This increased flexibility facilitates temporal, spatial and SNR scalability, along with any combination of the three. ⁴⁸ It is worth noting that the VP9 bitstream does not feature an explicit structure for representing SNR layers. Instead, SNR layers are simply represented by spatial layers with no change in resolution. ⁴⁹

AV1 expands even further upon the scalability features of VP9 and introduces explicit metadata fields to describe the scalability structures present in a given bitstream. Sections 6.7.5 and 6.7.6 of the official AV1 bitstream specification describe these metadata structures, which include the number of spatial layers, the ratio of the resolutions represented by the spatial layers, the number of temporal layers and whether layers depend upon one another. ⁵⁰ There are no fields to explicitly describe SNR layers, since these are represented by spatial layers without any resolution change in the same manner as VP9. ³³

1.3.4. WebRTC support

The official WebRTC 1.0 standard does not include support for scalable video coding techniques, but an extension to add support has been proposed for inclusion in the next version of the WebRTC standard, currently referred to as “WebRTC Next Version (WebRTC-NV)”. ⁵¹⁵² The proposed extension relies on support for scalability metadata in the underlying SRTP protocol and the types of scalability supported varies based on the RTP payload format for any given codec:

The RTP payload format for H.264/SVC supports temporal, spatial and SNR scalability. ⁴²
The RTP payload format for H.265/HEVC supports temporal scalability only. ⁵³
The RTP payload format for VP8 supports temporal scalability only. ⁵⁴
The RTP payload format for VP9 supports both temporal and spatial scalability (including SNR scalability represented by spatial layers.) ⁴⁹
The RTP payload format for AV1 supports both temporal and spatial scalability (including SNR scalability represented by spatial layers.) ³³

1.3.5. GPU video encoder support

Although real-time video encoding can be performed by software encoders and dedicated hardware such as field-programmable gate arrays (FPGAs) ⁵⁵, GPU accelerated video encoding is of particular interest when considering use cases such as Pixel Streaming which require low latency compression of input data sourced from the GPU’s own framebuffer. ⁵⁶ Codec support varies between GPU vendors, as does support for scalable video coding techniques in combination with any given codec:

The NVIDIA NVENC encoder supports temporal scalability for H.264, but does not appear to support scalable encoding for H.265/HEVC. ⁵⁷ NVENC does not currently support hardware accelerated encoding of VP8, VP9 or AV1. ³⁵
The AMD Advanced Media Framework (AMF) encoder supports temporal scalability for H.264 ⁵⁸ and for H.265/HEVC ⁵⁹. AMF does not currently support hardware accelerated encoding of VP8, VP9 or AV1. ⁶⁰
The Intel Quick Sync encoder supports temporal scalability for H.264 ⁶¹, H.265/HEVC ⁶² and for VP9 ⁶³. Hardware accelerated encoding of VP9 requires an Ice Lake or newer Intel CPU. ⁶⁴

1.4. Pixel Streaming reference implementation

1.4.1. Design and intent

Pixel Streaming is a real-time streaming solution that Epic Games introduced in Unreal Engine version 4.23, which allows the audio and video output of an Unreal Engine application to be streamed to a web browser via WebRTC and for control signals to be transmitted back to the application over WebRTC data channels. ⁵⁶ The reference implementation of Pixel Streaming that is distributed with the Unreal Engine source code consists of a plugin which implements all of the necessary functionality inside the Engine itself, accompanied by a set of server applications to facilitate deploying Pixel Streaming in a cloud environment. ⁶⁵ The Pixel Streaming plugin is responsible for capturing the Unreal Engine’s audio and video output and encoding a H.264 video stream that can be transmitted to clients connected via WebRTC. In the original Unreal Engine 4.23 implementation, the Pixel Streaming plugin was paired with a proxy application that was responsible for implementing WebRTC support and communicating with both clients and the bundled server applications. In Unreal Engine 4.24 this proxy application was merged into the Pixel Streaming plugin itself, which now communicates directly with WebRTC peers and with its accompanying server applications. ⁶⁶

The servers accompanying the Pixel Streaming plugin are implemented as Node.js Javascript applications and include the following:

“Cirrus” signalling and web server: this server acts as both the WebRTC signalling server (as defined in Section 1.1.1) for the Pixel Streaming system and optionally also as a traditional web server which serves the static files for the browser-based client. The signalling server portion of the application communicates with the Pixel Streaming plugin and the client browser via the WebSocket protocol ⁶⁷ and coordinates the lifecycle of WebRTC sessions, using the persistent WebSocket connections to monitor the connectivity of both the Pixel Streaming plugin and the client browser. ⁶⁸ One instance of the Cirrus server is expected to be run for each instance of a Pixel Streaming application ⁶⁹ and each server instance is designed to facilitate the establishment of direct connections between its corresponding Pixel Streaming application instance and client web browsers. The Cirrus server is hardcoded to establish these direct connections and does not contain any functionality to facilitate the use of an intermediate server such as a Multipoint Conferencing Unit (MCU) or a Selective Forwarding Unit (SFU), limiting the reference implementation of Pixel Streaming to a pure peer-to-peer WebRTC architecture. ⁶⁸
Matchmaking server: this server provides optional load balancing functionality for the Pixel Streaming system, monitoring the availability of a pool of registered Cirrus servers and directing new clients to the first available server when they attempt establish a connection. The matchmaking server communicates with Cirrus server instances via TCP connections and requires that each Cirrus instance both register its presence and also provide status updates regarding its availability to receive client connections. ⁶⁸ The matchmaking server will not direct more than a single client to a given Cirrus server instance at any time and will prompt clients to retry if there are no Cirrus instances currently available and idle. The fact that the matchmaking server is hardcoded to expect an existing pool of Cirrus servers (and thus Pixel Streaming application instances) instead of creating and destroying these on demand is one of the primary factors limiting the ability to dynamically scale deployments of the reference implementation of Pixel Streaming. ⁶⁹

It is important to note that the limitations discussed in this section are not design flaws in the Pixel Streaming system itself but are in actuality a reflection of the intended purpose of the reference implementation code. The reference implementation is explicitly designed to provide a starting point upon which developers will build their own solutions, and it is expected that developers will modify or replace the logic in both the Cirrus server and the matchmaking server to suit their specific requirements. ⁶⁹

1.4.2. Platform support

At the time of writing, the reference implementation of the Pixel Streaming plugin supports only Windows 8 and Windows 10. ⁷⁰ TensorWorks maintains a fork of the reference implementation called “Pixel Streaming for Linux” which adds support for Linux ⁷¹ and is currently in the process of attempting to merge the relevant changes into the upstream reference implementation code. Once this merge is complete, Pixel Streaming for Linux will be deprecated in favour of the upstream reference implementation.

1.4.3. Handling of connected peers

Neither the Pixel Streaming plugin itself or the Cirrus server make any distinction between WebRTC peers that connect to the Pixel Streaming application. As a result, all connected peers are able to control the Pixel Streaming application, potentially producing clashing inputs. This lack of distinction is hardcoded in the reference implementation, and if different peers require different client interfaces or different levels of control then Epic Games actually recommends serving customised client code to different classes of peers so that these differences can be enforced in the browser. ⁶⁹

1.4.4. Adaptive bitrate control behaviour

The Pixel Streaming plugin is designed to automatically increase or decrease both the encoded video framerate and bitrate in response to changing network conditions that limit the receiving bandwidth of connected WebRTC peers. In much the same way that the Pixel Streaming plugin does not differentiate between connected peers for the purposes of distinguishing input (as discussed above in Section 1.4.3), the adaptive bitrate control code does not differentiate between peers and will directly update encoder parameters based on variations in the network conditions of any connected peer. This behaviour is compounded by the fact that the Pixel Streaming plugin uses only a single hardware accelerated video encoding session and transmits only a single encoded video stream to all connected peers. ⁶⁸ As such, the presence of even a single WebRTC peer with poor network conditions will result in poor video quality for all connected peers.

2. Proposed architecture

In the sections that follow, we present our proposed backend architecture for deploying Pixel Streaming applications at scale. The design of our proposed architecture is informed by both the background information discussed in Section 1 and our key goals for such an architecture, which are discussed in the section below.

2.1. Goals

The purpose of our proposed backend architecture is to empower Unreal Engine developers to deploy solutions built on Pixel Streaming at unprecedented scale and to make the technology more accessible to the growing Unreal Engine user community across all industry verticals. In order to achieve this purpose, we believe it is critical that our architecture address the following fundamental goals:

Open: the architecture must provide open access to use and contribution by all Unreal Engine developers. The architecture’s design and component APIs must be publicly documented, and the implementation of the architecture must be free and open source (FOSS) software that does not depend upon any components which are proprietary or require the payment of patent royalties.
Extensible and modular: the architecture must allow developers to easily swap out, add or extend components without requiring modification of the source code for existing components. Developers must be able to fully control the behaviour of the backend once deployed and must be free to build other frameworks and solutions on top of it.
Unopinionated: the architecture must not force implementation or network topology decisions upon developers which limit their ability to control or extend the environment in which the backend is deployed. Aside from technology choices necessary to provide core functionality and maintain compatibility with the reference implementation of Pixel Streaming (i.e. the network protocols used to implement the public APIs of core components) the architecture must not force developers to adopt specific technologies or deployment approaches.

2.2. Definitions

As discussed in Section 1.4.3, the reference implementation of Pixel Streaming does not distinguish between connected WebRTC peers in any meaningful manner. However, many real-world uses of Pixel Streaming require a clear delineation of client roles for different peers. Explicit consideration of these roles is fundamental to identifying and discussing the differing technical requirements of various use cases, and so we define the following terms to describe the roles that connected WebRTC peers commonly adopt in the context of a deployed Pixel Streaming application:

Controlling Peer: the single WebRTC peer that is currently in full control of a given instance of a Pixel Streaming application. This peer is an active participant and can send keyboard/mouse input and control signals to the application. In some use cases the role of Controlling Peer may migrate between different WebRTC peers, but there is only ever one Controlling Peer at any given time.
Observing Peer: a WebRTC peer that views the video output of a given instance of a Pixel Streaming application, but is not the Controlling Peer. This peer is a passive spectator and cannot send any input to the application without the authorisation of the Controlling Peer, although in some use cases Observing Peers may be able to communicate with one another and with the Controlling Peer. There may be any number Observing Peers connected to a single instance of a Pixel Streaming application at a given time, or even none at all.

Note that these two roles are conceptual constructs which we define purely for the purpose of classifying common usage scenarios for Pixel Streaming applications. The extensible nature of our proposed architecture allows developers to implement security models which define any number of arbitrary roles for representing the specific permissions which are granted to individual clients.

2.3. Usage scenarios

There are an extremely wide variety of potential use cases for applications built upon the Pixel Streaming system, each with differing technical requirements. As discussed in the previous section, we believe the key factor which drives these differing technical requirements is the manner in which WebRTC peers are permitted to interact with the Pixel Streaming application, for which we defined the two roles Controlling Peer and Observing Peer. Based on these roles, we have grouped potential use cases into three common scenarios, distinguished by the expected composition of connected peers. The sections below examine each of these scenarios in turn, discussing the technical requirements of each scenario and providing examples of potential use cases to which the scenario applies.

2.3.1. Scenario 1: Controlling Peer only

The most common scenario for Pixel Streaming use cases is one in which there exists a one-to-one relationship between connected clients and instances of the Pixel Streaming application. In this scenario, every peer is a Controlling Peer, and there is no requirement for peers to observe the video output from instances controlled by other peers. Examples of use cases to which this scenario applies include product configurators, real estate walkthroughs and solo interactive experiences.

This scenario features the simplest technical requirements, and can be serviced by simply ensuring there exists a Pixel Streaming application instance for each new client as it connects, and then establishing a direct peer-to-peer WebRTC connection between the client and the Pixel Streaming application instance, as depicted in the diagram below.

Usage Scenario 1: each instance of the Pixel Streaming application services only a single Controlling Peer. Note that this diagram depicts the scenario from the perspective of only a single instance of the Pixel Streaming application for the sake of clarity.

2.3.2. Scenario 2: Controlling Peer + small number of Observing Peers

The second most common scenario for Pixel Streaming use cases is one in which there exists a many-to-one relationship between connected clients and instances of the Pixel Streaming application. In this scenario, each instance of the Pixel Streaming application is controlled by a single Controlling Peer and its video output is observed by a small to moderate number of Observing Peers. Some use cases may require that the role of Controlling Peer be able to migrate between the connected peers, whilst other use cases may prefer that control remains with a single Controlling Peer for the duration of the application session and is never transferred. Examples of use cases to which this scenario applies include collaborative product development and review, team presentations and interactive training systems.

Although it would be possible to service this scenario using only peer-to-peer WebRTC connections, the control and video quality limitations of the Pixel Streaming plugin discussed in Section 1.4.3 and Section 1.4.4 would make such a solution sub-optimal with respect to both security and user experience. Instead, this scenario is best served by the introduction of a Selective Forwarding Unit (SFU) (described in Section 1.2.3) to distribute the video stream from the Pixel Streaming plugin to the connected peers and to relay the control signals to the Pixel Streaming plugin from the Controlling Peer. To ensure optimal video quality for all peers in the presence of one or more peers with a poor network connection, the Pixel Streaming plugin should also be modified to use one of the more advanced bandwidth adaptation techniques supported by modern SFU implementations. Due to the lack of widespread support for scalable video coding techniques in GPU accelerated video encoders (as discussed in Section 1.3.5), we currently recommend the use of WebRTC simulcast when implementing bandwidth adaptation functionality for this scenario. This is depicted in the diagram below.

Usage Scenario 2: each instance of the Pixel Streaming application services both a single Controlling Peer and a small number of Observing Peers, with the option to migrate control between the peers. Note that this diagram depicts the scenario from the perspective of only a single instance of the Pixel Streaming application for the sake of clarity.

2.3.3. Scenario 3: Controlling Peer + large number of Observing Peers

The least common scenario for Pixel Streaming use cases is one in which there exists a many-to-one relationship between connected clients and instances of the Pixel Streaming application and there is a requirement that the system scale to an extremely large number of geographically distributed Observing Peers. This scenario is very similar to the scenario described in the section above, but the requirement for extreme levels of scalability introduces unique technical challenges that must be addressed, due to the scalability limitations of common WebRTC architectures discussed in Section 1.2.4. Examples of use cases to which this scenario applies include live broadcast of events or presentations, game streaming and e-sports.

Although our understanding is that it should indeed be feasible to service this scenario with a set of SFUs acting as distributed relay servers, we stand by our assertion in Section 1.2.4 that the use of either a hosted WebRTC platform or HLS/DASH leveraging an existing CDN will likely be the most cost-effective approach for most developers. The flexible and unopinionated nature of our proposed architecture should facilitate the use of either of these options as desired. For implementations which integrate HLS/DASH, an additional encoder will be required to generate the segment files from the received WebRTC stream, which is depicted in the diagram below.

Usage Scenario 3: each instance of the Pixel Streaming application services both a single Controlling Peer and a large number of Observing Peers, with no option to migrate control between the peers due to the distinct streaming technologies involved. Note that this diagram depicts the scenario from the perspective of only a single instance of the Pixel Streaming application for the sake of clarity.

2.4. Architecture

2.4.1. Overview

Informed by the key goals set out in Section 2.1 and the usage scenarios enumerated in the previous sections, we propose the following backend architecture for deploying Pixel Streaming applications at scale, depicted in the diagram below:

Our proposed backend architecture for deploying Pixel Streaming applications at scale. Note that this diagram omits components from Usage Scenario 3 (Controlling Peer + large number of Observing Peers) for the sake of clarity.

Our proposed architecture takes design cues from the high-level architecture of the Kubernetes container orchestration system, whereby components which make decisions about the state of the system are delineated into a conceptual “control plane” and the state of all other components is managed by the control plane components. ⁷² This is not to suggest that our architecture requires the use of Kubernetes, however. In keeping with our goal of remaining unopinionated, the control plane components are designed around a plugin architecture which allows developers to provide implementations that leverage any desired mechanism for managing the state of the other components in the system. Plugins might choose to leverage an existing orchestration solution such as Kubernetes, to implement a custom orchestration system, or even to forego orchestration entirely and leverage a matchmaking system that coordinates existing Pixel Streaming application instances in the same manner as the matchmaking server provided in the reference implementation of Pixel Streaming.

It is worth noting that our proposed architecture is also unopinionated with respect to the following functionality:

Serving static files to clients
Selection of STUN/TURN servers for ICE
Load-balancing new client connections between multiple signalling server instances

As such, the architecture does not include components to provide this functionality and developers are left to utilise their preferred solutions for these components. For example, developers might leverage load balancing functionality specific to the cloud provider(s) hosting their compute resources, leverage CDNs for serving static files, deploy STUN/TURN servers and web servers alongside the control plane components, or even implement control plane plugins that manage servers in the same manner as other components such as Pixel Streaming application instances.

The sections that follow discuss the control plane components and managed components of our proposed architecture in more detail, and contrast our design with the servers provided by Epic Games in the reference implementation of Pixel Streaming.

2.4.2. Control plane components

The control plane encapsulates the components which make decisions about the state of the system as a whole and are responsible for managing the state of all other components. The control plane consists of three key components:

Signalling Server: the signalling server is the core of the control plane and the component through which all control signals flow. Much like the Cirrus signalling server from the reference implementation of Pixel Streaming, our signalling server provides signalling functionality via WebSocket communication for establishing WebRTC sessions and exchanging ICE candidates, and monitors the connectivity of WebRTC peers. This functionality is expanded slightly to support a wider variety of peers (e.g. connecting Pixel Streaming application instances to SFUs and SFUs to clients), but is otherwise conceptually identical to its reference implementation counterpart. As mentioned in the previous section, however, our signalling server does not act as a web server to serve static files to clients, since our proposed architecture defers the implementation of this functionality to developers in order to remain unopinionated. The signalling server also does not contain any logic for authenticating WebRTC peers, determining the routing between internal and external WebRTC peers, or for selecting the Pixel Streaming application instances to which external peers will be connected. Instead, it delegates these responsibilities to plugins, allowing developers to customise its behaviour without needing to modify the source code of the signalling server itself. To ensure maximum flexibility, plugins are implemented as network services that expose functionality via the gRPC remote procedure call framework. ⁷³ This ensures maximum flexibility, since plugins can be implemented in any of the programming languages supported by gRPC and can even be replicated and load-balanced for high availability deployments. This approach of using gRPC services to implement plugins is quite common amongst popular cloud native software, including Kubernetes ⁷⁴⁷⁵, containerd ⁷⁶, and HashiCorp tools such as Packer ⁷⁷ and Terraform ⁷⁸.
Authentication Plugin: the Authentication Plugin is a gRPC service that provides functionality for authenticating and identifying users, as well as determining their security permissions with respect to how they can control and interact with the Pixel Streaming application. Implementations of this plugin might communicate with database servers, leverage external identity providers, or accept dummy credentials for development and testing purposes. Developers are also free to define the security model employed by implementations of the plugin, which might range from simply associating permissions directly with user accounts all the way through to sophisticated Role-Based Access Control (RBAC) mechanisms. The Authentication Plugin provides its functionality to both the signalling server and the Instance Manager Plugin.
Instance Manager Plugin: the Instance Manager Plugin is a gRPC service that provides functionality for managing Pixel Streaming application instances (along with any accompanying components such as SFUs), for selecting the instance that each new client will connect to, and for determining the routing paths between internal and external WebRTC peers (e.g. establishing a connection between a Pixel Streaming application instance and an SFU, and then connections between the SFU and external clients.) Implementations of this plugin might leverage existing orchestration solutions such as Kubernetes, spin up virtual machines using tools such as HashiCorp Terraform, spin up local containers or child processes for development and testing purposes, or even perform matchmaking based on an existing pool of available Pixel Streaming application instances. The Instance Manager Plugin can also communicate with the Authentication Plugin in the event that specific identity information is required in order to correctly manage or select Pixel Streaming application instances.

2.4.3. Managed components

All components outside of the control plane are managed by the Instance Manager Plugin, and so their definition and lifecycle will vary between plugin implementations. Managed components should typically fall into two categories:

Pixel Streaming application instances: the actual instances of the Pixel Streaming application to which WebRTC peers will connect. Whether the Instance Manager Plugin controls the lifecycle of these instances or simply monitors existing instances in order to coordinate communication is defined by the plugin implementation.
Supporting components: ancillary components that accompany the Pixel Streaming application instances and support their functionality. This might include WebRTC servers such as SFUs, web servers for serving static files to clients, logging or monitoring services, etc. Whether the Instance Manager Plugin controls the lifecycle of these components or simply monitors existing components in order to coordinate communication is defined by the plugin implementation.

2.4.4. Comparison to reference implementation components

The core functionality provided by the servers from the reference implementation of Pixel Streaming and our proposed backend architecture is deliberately identical, since both are backends for the Pixel Streaming system and our proposed architecture aims to act as a drop-in replacement for the reference servers. The key point of distinction which enables our proposed architecture to scale is the careful placement of responsibilities within each component to ensure that any component in the system can be transparently replaced without the need to modify other components, allowing developers to easily define and reuse logic which leverages the capabilities of any desired external system. Scalability will be achieved through the development of plugin implementations (both by TensorWorks and the Unreal Engine developer community) that build upon the ever-expanding ecosystem of cloud technologies designed to rapidly and reliably deploy applications at scale.

The table below summarises the placement of key responsibilities in the reference implementation of Pixel Streaming and our proposed architecture:

Responsibility	Reference implementation	Proposed architecture
WebRTC signalling	Cirrus signalling server	Signalling Server
Serving static files	Cirrus signalling server	User-defined
Authentication	Cirrus signalling server	Authentication Plugin
Selecting application instances	Matchmaking server	Instance Manager Plugin

Placement of key responsibilities in components from the reference implementation of Pixel Streaming and our proposed backend architecture.

One of the most important design decisions we made in our proposed architecture was to invert the control flow for selecting the Pixel Streaming application instances that new WebRTC peers will connect to. In the reference implementation, the matchmaking server maintains active TCP connections to a pool of signalling server instances and load-balances new WebRTC connections between idle instances. The signalling server instances are responsible for implementing authentication (albeit only for serving static files, not for protecting actual WebRTC access) and establishing the WebRTC connection between peers and their respective Pixel Streaming application instances. ⁶⁸ This design has the effect of enforcing a one-to-one relationship between signalling server instances and Pixel Streaming application instances, and enforcing data flow and topology choices that may have implications for load-balancing at scale. By inverting this control and allowing the signalling server to delegate directly to plugins for all key logic including authentication and Pixel Streaming application instance selection, we avoid these limitations and give developers complete freedom to define and control how the backend behaves once deployed.

2.5. Recommendations

Although our proposed architecture is unopinionated and gives developers the freedom to integrate their preferred technologies and deployment approaches, we acknowledge that there is still value in providing guidance which developers can take into consideration when making these choices. We therefore propose the following recommendations based on our experience with Pixel Streaming and cloud deployments of Unreal Engine applications:

Use containers: the use of containers to encapsulate packaged Unreal Engine projects provides a number of benefits, including portability with respect to the host environment within which Pixel Streaming application instances are run, reduced overheads (and thus higher deployment density) when compared to encapsulating instances in virtual machines, and compatibility with the growing ecosystem of container-oriented tooling and infrastructure, all without sacrificing performance when Linux containers are used ⁷⁹. Our work porting Pixel Streaming to Linux was primarily motivated by the desire to make these benefits readily available to developers by allowing Pixel Streaming applications to run inside GPU accelerated Linux containers. ⁷¹
Use a container orchestration system: container orchestration systems such as Kubernetes make it easy to deploy a dynamic number of containers across a cluster of worker nodes, and managed Kubernetes services from public cloud providers will typically also handle dynamic resizing of the cluster itself in response to workload fluctuations. ⁸⁰⁸¹⁸² This reduces both operational burden and cost when compared to manually managing infrastructure and scheduling workloads.

3. Implementation roadmap

Our intention is to implement and maintain not only the core components of our proposed backend architecture, but also a suite of plugin implementations and supporting components that will allow developers to quickly and easily get started deploying Pixel Streaming applications at scale. The following development phases are currently planned:

Phase 1: Foundation. In this phase, the core signalling server will be implemented, along with a no-op implementation of the Authentication Plugin and a simple implementation of the Instance Manager Plugin that spins up local container instances. This will provide the foundational components required in order to develop and test the backend architecture locally.
Phase 2: Kubernetes and OAuth. In this phase, an implementation of the Instance Manager Plugin will be developed that leverages the Kubernetes orchestration system to create and schedule container instances within a cluster. Conceptually, this implementation will behave similarly to the “Pod Broker” component in Google’s proposed architecture for a Kubernetes-based application streaming solution, which addresses similar use cases to our proposed architecture but does so in a heavily opinionated manner that relies on a number of specific Google Cloud services. ⁸³ An implementation of the Authentication Plugin will also be developed during this phase which leverages the OAuth 2.0 authorisation standard ⁸⁴ to access external identity providers.
Phase 3: Frontend and Extras. In this phase, we will develop supporting components that are not required in order to use our proposed architecture but which make it easier for developers to get started building and deploying applications. The most important of these components will be an extensible client implementation for use in web browsers that connect to the signalling server. Other components developed during this phase will be identified through testing during the previous phases and through feedback from the Unreal Engine developer community.
Phase 4: Integration Suites. In this phase, we will work with public cloud providers to develop solutions which integrate our proposed architecture with their specific platforms and services. These turnkey solutions will allow developers to deploy Pixel Streaming applications at scale without the need to perform extensive setup or maintenance activities. Note that unlike all of the components developed during previous phases, these integrated solutions will likely be commercial offerings which generate revenue to help fund the continued development and maintenance of the open source components.

Acknowledgements

The authors would like to thank the team at Epic Games for providing feedback on our proposed architecture during the development of this report.