TECHNOLOGY

Ps2 GS emulation – the leisure frontier of Vulkan compute emulation


Because it is probably you’ll well possibly also honest, or might well honest no longer know, I wrote paraLLEl-RDP again in 2020. It aimed toward implementing the N64 RDP in Vulkan compute. Lightning speedy, and extraordinarily accurate, plus the added attend of up-scaling on high. I’m somewhat completely happy how it grew to change into out. Pointless to claim, the crude accuracy became as soon as due to Angrylion being historical as reference and I might well possibly aim for bit-exactness in opposition to that implementation.

Since then, there’s been the lingering opinion of doing the identical yell, however for PlayStation 2. Except now, there’s indubitably completely been one implementation in city, GSdx, which has remained the cutting-edge for 20 years.

paraLLEl-GS is indubitably no longer the first compute implementation of the PS2 GS. An strive became as soon as made again in 2014 for OpenCL as a ways as I purchase, however it indubitably became as soon as never finished. Now now not no longer as a lot as, I cannot rep it in the modern upstream repo anymore.

The argument for doing compute shader raster on PS2 is positively weaker than on N64. Angrylion became as soon as – and is – extraordinarily unhurried, and N64 is amazingly soft to accuracy the set apart hardware acceleration with graphics APIs is rarely any longer probably with out serious compromises. PCSX2 on the quite so a lot of hand has a effectively-optimized instrument renderer, and a moderately proper graphics-essentially based mostly renderer, however that doesn’t point on the market aren’t disorders. The instrument renderer does no longer attend up-scaling as an illustration, and there are a myriad bugs and system defects with the graphics-essentially based mostly renderer, especially with up-scaling. As we’ll compare, the PS2 GS is reasonably the nightmare to emulate in its have reach.

My most vital motivation here is indubitably “because I’m in a position to”. I already had a mission mendacity spherical that did “generic” compute shader rasterization. I figured that per chance we might well possibly retro-match this to attend PS2 rendering.

I didn’t work on this mission by myself. My colleague, Runar Heyer, helped out a huge deal in the beginning to rep this started, doing all your total leg-work to peep the PS2 from assorted sources, doing the preliminary prototype implementation and fleshing out the Vulkan GLSL to emulate PS2 shading. Finally, we hit some serious roadblocks in debugging assorted video games, and the mission became as soon as build on ice for a whereas since I became as soon as too drained facing immoral D3D12 recreation debugging day in and day out. The closing months haven’t been a fixed fire fight, so I’ve lastly had the mental vitality to cessation it.

My determining of the GS is basically in step with what Runar learned, and what I’ve viewed by debugging video games. The GSdx instrument renderer does no longer seem love it’s hardware bit-accurate, so we had been repeatedly 2nd-guessing issues when making an are attempting to envision output. This precipitated a most vital yell after we had the root of writing detailed tests that straight away when put next in opposition to GSdx instrument renderer, and the take a look at-driven reach fell flat very quickly. As a end result, paraLLEl-GS isn’t indubitably aiming for bit-accuracy in opposition to hardware, however it indubitably tries laborious to lead clear of glaring accuracy disorders at the least.

In style GS overview

All over again, that is in step with my determining, and it couldn’t be upright. 😀

GS is a pixel-pushing monster

The GS is unsuitable for its insane own-rate and bandwidth. It can well possibly push over a billion pixels per 2nd (in opinion no no longer as a lot as) again in 2000 which became as soon as nuts. While the VRAM is reasonably diminutive (4 MiB), it became as soon as designed to be continuously streamed into the exercise of the quite so a lot of DMA engines.

Given the crude own-rate requirements, we possess to possess our renderer accordingly.

GS pixel pipeline is terribly total, however quirky

In many ways, the GS is indubitably less advanced than N64 RDP. Single texture, and a single cycle combiner, the set apart N64 had a two stage combiner + two stage blender. Whatever AA attend is there might be amazingly total as effectively, the set apart N64 is delightfully esoteric. The parts of the pixel pipeline that is painful to enforce with archaic graphics APIs is:

Mixing goes previous 1.0

Inherited from PS1, 0x80 is handled as 1.0, and it will traipse your total reach as a lot as 0xff (nearly 2). Inspiring by 7 is more straightforward than dividing by 255 I exclaim. I’ve viewed some extraordinarily gruesome workarounds in PCSX2 earlier than to possess a study out working spherical this since UNORM formats cannot attend this as is. Textures are the same, the set apart alpha > 1.0 is representable.

There could be wrapping good judgment that might also be historical for when colours or alpha goes above 0xFF.

Destination alpha making an are attempting out

The commute role alpha might also be historical as a pseudo-stencil of forms, and that is amazingly painful with out programmable mixing. I believe this became as soon as added as PS1 compatibility, since PS1 also had this odd feature.

Conditional mixing

Fixed with the alpha, it’s probably to conditionally disable mixing. Barely awkward with out programmable mixing … Here is but every other PS1 compat feature. With PS1, it might in point of fact well possibly even be emulated by rendering every broken-down twice with issue changes in-between, however this quickly gets impractical with PS2.

Alpha correction

Earlier than alpha is written out, it’s probably to OR in the MSB. In actuality forcing alpha to 1. It is no longer an similar to alphaToOne on the other hand, since it’s a piece-smart OR of the MSB.

Alpha take a look at can in part discard

A enjoyable yell alpha tests can cease is to in part discard. E.g. it is probably you’ll well possibly also discard accurate shade, however preserve the depth write. Barely nutty.

AA1 – protection-to-alpha – can regulate depth write per pixel

Here could be kinda awkward. The completely anti-alias PS2 has is AA1 which is a protection-to-alpha feature. Supposedly, no longer as a lot as 100% protection might well honest soundless disable depth writes (and mixing is enabled), however the GSdx instrument renderer habits here is amazingly puzzling. I don’t indubitably designate it but.

32-bit mounted-point Z

I’ve soundless but to leer any video games in point of fact the exercise of this, however technically, it has D32_UINT attend. Relaxing! From what I might well possibly receive, GSdx instrument renderer implements this with FP64 (one of the heaps of causes I refuse to exclaim GSdx is bit-accurate), however FP64 is completely impractical on GPUs. As soon as I in point of fact possess to, I’ll enforce this with mounted-point math. 24-bit Z and 16-bit might well honest soundless be elegant with FP32 interpolation I feel.

Pray you possess programmable mixing

In case you’re on a pure TBDR GPU most of that is reasonably doable, however immediate mode desktop GPUs quickly degenerates into ROV or per-pixel boundaries after every broken-down to emulate programmable mixing, each that are horrifying for performance. Pointless to claim, with compute we are in a position to price our have TBDR to bypass all this. 🙂

D3D9-style raster guidelines

Primitives are fortunately equipped in a undeniable originate in clip-role. No awkward N64 edge equations here. The VU1 unit is supposed to total transforms and clipping, and emit assorted per-vertex attributes:

X/Y: 12.4 unsigned mounted-point


Z: 24-bit or 32-bit uint


FOG: 8-bit uint


RGBA: 8-bit, for per-vertex lights


STQ: For point of view upright texturing with normalized coordinates. Q = 1 / w, S = s Q, T = t Q. It sounds as if the lower 8-bits of the mantissa are clipped away, so bfloat24? Q might also be unfavorable, which is repeatedly enjoyable. No opinion how this interacts with Inf and NaN …


UV: For non-point of view upright texturing. 12.4 mounted-point un-normalized.

  • Triangles are high-left raster, accurate love in style GPUs.
  • Pixel heart is on integer coordinate, accurate love D3D9. (Here’s a in style possess mistake that D3D10+ and GL/Vulkan avoids).
  • Traces exercise Bresenham’s algorithm, which is rarely any longer indubitably feasible to upscale, so we possess to fudge it with rect or parallelogram.
  • Parts snap to nearest pixel. In doubt which rounding is historical despite the indisputable truth that … There might be now not always a interpolation ala gl_PointCoord.
  • Sprites are easy quads with two coordinates. STQ or UV might also be interpolated and it appears to use non-rotated coordinates. To attend rotation, you’d need 3 coordinates to disambiguate.

All of this is in a position to be applied somewhat with out problems in unheard of graphics APIs, as prolonged as we don’t capture into memoir upscaling. We now possess to depend on implementation most vital facets in GL and Vulkan, since these APIs don’t technically guarantee high-left raster guidelines.

Since X/Y is unsigned, there might be an XY offset that might also be applied to heart the viewport the set apart it is top to have possess. This capability the efficient range of X/Y is +/- 4k pixels, a wholesome guard band for 640×448 resolutions.

Vertex queue

The GS feels very noteworthy love broken-down college OpenGL 1.0 with glVertex3f and mates. It even supports TRIANGLE_FAN! Phenomenal … RGBA, STQ and diverse registers are field, and each XYZ register write forms a vertex “kick” which latches vertex issue and advances the queue. An XYZ register write might also be a drawing kick, which attracts a broken-down if the vertex queue is sufficiently filled. The vertex queue is managed in a different map reckoning on the topology. The semantics here seem to be somewhat straight ahead the set apart strip primitives shift the queue by one, and list primitives obvious the queue. Triangle followers preserve the first ingredient in the queue.

Relaxing swizzling formats

A suave opinion is that whereas rendering to 24-bit shade or 24-bit depth, there might be 8 bits left unused in the MSBs. That that it is probably you’ll well possibly also issue textures there, because why no longer. 8H, 4HL, 4HH formats attend 8-bit and 4-bit palettes properly.

Pixel coordinates on PS2 are organized into “pages”, that are 8 KiB, then subdivided into 32 blocks, after which, the smaller blocks are swizzled into a layout that suits effectively with a DDA-style renderer. E.g. for 32-bit RGBA, a internet page is 64×32 pixels, and 32 8×8 blocks are Z-elaborate swizzled into that internet page.

Framebuffer cache and texture cache

There’s a dedicated cache for framebuffer rendering and textures, one internet page’s worth. Games most incessantly abuse this to possess suggestions loops, the set apart they render on high of the pixels being sampled from. Here is the root field off of crude disaster. N64 avoided this yell by having train copies into TMEM (and no longer indubitably having the bandwidth to total elaborate suggestions effects), and diverse consoles rendered to embedded SRAM (ala a tiler GPU), so these feedbacks aren’t as painful, however the GS is total YOLO. Facing this gracefully might well very effectively be the finest yell. Combined with the PS2 being a bandwidth monster, developers knew capture honest accurate thing about copious mixing and blurring passes …

Texturing

Texturing on the GS is each very familar, and arcane.

On the plus aspect, the texel-heart is at half-pixel, accurate love in style APIs. It appears love it has 4-bit sub-texel precision as a replace of 8 on the other hand. Here is effectively solved with some rounding. It also appears to possess ground-rounding as a replace of nearest-rounding for bi-linear.

The bi-linear filter is a unheard of bi-linear. No irregular 3-point N64 filter here.

On the extra weird and wonderful aspect, there are two special addressing modes.

REGION_CLAMP supports an arbitrary clamp interior a texture atlas (wouldn’t this be good in unheard of graphics APIs? :D). It also works with REPEAT, so that you would be in a position to well possibly also possess REPEAT semantics on border, however then clamp honest a itsy-bitsy into the next “wrap”. Here is trivial to emulate.

REGION_REPEAT is … worse. Here we are in a position to possess customized bit-smart computation per coordinate. So something love u’ = (u & MASK) | FIX. Here is done per-coordinate in bi-linear filtering, which is … painful, however solvable. Here is but every other irregular PS1 feature that became as soon as probably inherited for compatibility. Now now not no longer as a lot as on PS1, there became as soon as no bi-linear filtering to complicate issues 🙂

Mip-mapping could be a piece ordinary. In preference to counting on derivatives, the log2 of interpolated Q yell, along with some scaling factors are historical to compute the LOD. Here is reasonably suave, however I haven’t indubitably viewed any video games exercise it. The down-aspect is that triangle-setup becomes moderately advanced at the same time as you occur to desire to possess to memoir for upright tri-linear filtering, and it cannot attend e.g. anisotropic filtering, however that is 2000, who cares! Now now not counting on derivatives is a giant boon for the compute implementation.

Codecs are repeatedly “normalized” to RGBA8_UNORM. 5551 format is expanded to 8888 with out bit-replication. There might be now not always a RGBA4444 format.

It’s somewhat feasible to enforce the texturing with undeniable bindless.

CLUT

Here’s a 1 KiB cache that holds the modern palette. There might be an train duplicate step from VRAM into that CLUT cache earlier than it might in point of fact well possibly even be historical. Why hello there, N64 TMEM!

The CLUT is organized such that it’ll attend one full 256 shade palette in 32-bit colours. On the quite so a lot of stop, it’ll attend 32 palettes of 16 colours at 16 bpp.

TEXFLUSH

There might be an train present that positive aspects love a “sync and invalidate texture cache”. Within the foundation I became as soon as hoping to depend on this to manual the hazard monitoring, however oh how naive I became as soon as. Within the tip, I simply had to ignore TEXFLUSH. Usually, there are two forms of caching we might well possibly capture with GS.

With “maximal” caching, we are in a position to use that frame buffer caches and texture caches are infinitely gigantic. The completely reach a hazard needs to be blueprint to be is after an train flush. This … breaks laborious. Both video games neglect to exercise TEXFLUSH (since it came about to work on true hardware), or they TEXFLUSH reach too noteworthy.

With “minimal” caching, we use there might be now not one of these thing as a caching and hazards are tracked straight away. Some edge case facing is believed to be for suggestions loops.

I went with “minimal”, and I feel GSdx did too.

Poking registers with style – GIF

The reach to work along with the GS hardware is thru the GIF, which is indubitably a unit that reads files and pokes the upright hardware registers. Initially of a GIF packet, there might be a header which configures which registers might well honest soundless be written to, and the map in which many “loops” there are. This maps completely to mesh rendering. We are in a position to capture into memoir something love one “loop” being:

  • Write RGBA vertex shade
  • Write texture coordinate
  • Write role with map kick

And if we possess 300 vertices to render, we’d exercise 300 loops. Command registers might also be poked thru the Address + Recordsdata pair, which accurate encodes goal register + 64-bit payload. It’s probably to render this reach too pointless to claim, however it indubitably’s accurate inefficient.

Textures are uploaded thru the identical mechanism. Diverse issue registers are written to field up transfer destinations, formats, and so forth, and a assorted register is nudged to transfer 64-bit at a time to VRAM.

Hello Trongle – GS

In case you overlooked the brain-needless simplicity of OpenGL 1.0, that is the API for you! 😀

For making an are attempting out applications, I added a instrument to generate a .gs dump format that PCSX2 could be pleased. Here is handy for evaluating implementation habits.

First, we program the frame buffer and scissor:

TESTBits take a look at = {};
take a look at.ZTE = TESTBits::ZTE_ENABLED;
take a look at.ZTST = TESTBits::ZTST_GREATER; // Inverse Z, LESS is rarely any longer supported.
iface.write_register(RegisterAddr::TEST_1, take a look at);

FRAMEBits frame = {};
frame.FBP = 0x0 / PAGE_ALIGNMENT_BYTES;
frame.PSM = PSMCT32;
frame.FBW = 640 / BUFFER_WIDTH_SCALE;
iface.write_register(RegisterAddr::FRAME_1, frame);

ZBUFBits zbuf = {};
zbuf.ZMSK = 0; // Enable Z-write
zbuf.ZBP = 0x118000 / PAGE_ALIGNMENT_BYTES;
iface.write_register(RegisterAddr::ZBUF_1, zbuf);

SCISSORBits scissor = {};
scissor.SCAX0 = 0;
scissor.SCAY0 = 0;
scissor.SCAX1 = 640 - 1;
scissor.SCAY1 = 448 - 1;
iface.write_register(RegisterAddr::SCISSOR_1, scissor);

Then we nudge some registers to map:

struct Vertex
{
    PackedRGBAQBits rgbaq;
    PackedXYZBits xyz;
} vertices[3] = {};

for (auto &vert : vertices)
{
   vert.rgbaq.A = 0x80;
   vert.xyz.Z = 1;
}

vertices[0].rgbaq.R = 0xff;
vertices[1].rgbaq.G = 0xff;
vertices[2].rgbaq.B = 0xff;

vertices[0].xyz.X = p0.x << SUBPIXEL_BITS;
vertices[0].xyz.Y = p0.y << SUBPIXEL_BITS;
vertices[1].xyz.X = p1.x << SUBPIXEL_BITS;
vertices[1].xyz.Y = p1.y << SUBPIXEL_BITS;
vertices[2].xyz.X = p2.x << SUBPIXEL_BITS;
vertices[2].xyz.Y = p2.y << SUBPIXEL_BITS;

PRIMBits prim = {};
prim.TME = 0; // Turn off texturing.
prim.IIP = 1; // Interpolate RGBA (Gouraud shading)
prim.PRIM = int(PRIMType::TriangleList);

static const GIFAddr addr[] = { GIFAddr::RGBAQ, GIFAddr::XYZ2 };
constexpr uint32_t num_registers = sizeof(addr) / sizeof(addr[0]);
constexpr uint32_t num_loops = sizeof(vertices) / sizeof(vertices[0]);
iface.write_packed(prim, addr, num_registers, num_loops, vertices);

This draws a triangle. We provide coordinates directly in screen-space.

And finally, we need to program the CRTC. Most of this is just copy-pasta from whatever games tend to do.

auto &priv = iface.get_priv_register_state();

priv.pmode.EN1 = 1;
priv.pmode.EN2 = 0;
priv.pmode.CRTMD = 1;
priv.pmode.MMOD = PMODEBits::MMOD_ALPHA_ALP;
priv.smode1.CMOD = SMODE1Bits::CMOD_NTSC;
priv.smode1.LC = SMODE1Bits::LC_ANALOG;
priv.bgcolor.R = 0x0;
priv.bgcolor.G = 0x0;
priv.bgcolor.B = 0x0;
priv.pmode.SLBG = PMODEBits::SLBG_ALPHA_BLEND_BG;
priv.pmode.ALP = 0xff;
priv.smode2.INT = 1;

priv.dispfb1.FBP = 0;
priv.dispfb1.FBW = 640 / BUFFER_WIDTH_SCALE;
priv.dispfb1.PSM = PSMCT32;
priv.dispfb1.DBX = 0;
priv.dispfb1.DBY = 0;
priv.display1.DX = 636; // Magic values that center the screen.
priv.display1.DY = 50; // Magic values that center the screen.
priv.display1.MAGH = 3; // scaling factor = MAGH + 1 = 4 -> 640 px wide.
priv.display1.MAGV = 0;
priv.display1.DW = 640 4 - 1;
priv.display1.DH = 448 - 1;

dump.write_vsync(0, iface);
dump.write_vsync(1, iface);

When the GS is dumped, we are in a position to load it up in PCSX2 and voila:

And here’s the identical .gs dump is played thru parallel-gs-replayer with RenderDoc. For debugging, I’ve spent heaps of time making it reasonably convenient. The photos are debug storage photos the set apart I'm in a position to retailer earlier than and after shade, depth, debug values for interpolants, depth making an are attempting out issue, and so forth, and so forth. It’s gigantic handy to slim down yell cases. The render pass might also be split into 1 or extra triangle chunks as wished.

To add some textures, and flex the capabilities of the CRTC a piece, we are in a position to are attempting uploading a texture:

int chan;
auto *buf = stbi_load("/tmp/take a look at.png", &w, &h, &chan, 4);
iface.write_image_upload(0x300000, PSMCT32, w, h, buf,
                         w h sizeof(uint32_t));
stbi_image_free(buf);

TEX0Bits tex0 = {};
tex0.PSM = PSMCT32;
tex0.TBP0 = 0x300000 / BLOCK_ALIGNMENT_BYTES;
tex0.TBW = (w + BUFFER_WIDTH_SCALE - 1) / BUFFER_WIDTH_SCALE;
tex0.TW = Util::floor_log2(w - 1) + 1;
tex0.TH = Util::floor_log2(h - 1) + 1;
tex0.TFX = COMBINER_DECAL;
tex0.TCC = 1; // Spend texture alpha as blend alpha
iface.write_register(RegisterAddr::TEX0_1, tex0);

TEX1Bits tex1 = {};
tex1.MMIN = TEX1Bits::LINEAR;
tex1.MMAG = TEX1Bits::LINEAR;
iface.write_register(RegisterAddr::TEX1_1, tex1);

CLAMPBits clamp = {};
clamp.WMS = CLAMPBits::REGION_CLAMP;
clamp.WMT = CLAMPBits::REGION_CLAMP;
clamp.MINU = 0;
clamp.MAXU = w - 1;
clamp.MINV = 0;
clamp.MAXV = h - 1;
iface.write_register(RegisterAddr::CLAMP_1, clamp);

While PS2 requires POT sizes for textures, REGION_CLAMP is handy for NPOT. Mountainous significant for texture atlases.

struct Vertex
{
    PackedUVBits uv;
    PackedXYZBits xyz;
} vertices[2] = {};

for (auto &vert : vertices)
    vert.xyz.Z = 1;

vertices[0].xyz.X = p0.x << SUBPIXEL_BITS;
vertices[0].xyz.Y = p0.y << SUBPIXEL_BITS;
vertices[1].xyz.X = p1.x << SUBPIXEL_BITS;
vertices[1].xyz.Y = p1.y << SUBPIXEL_BITS;
vertices[1].uv.U = w << SUBPIXEL_BITS;
vertices[1].uv.V = h << SUBPIXEL_BITS;

PRIMBits prim = {};
prim.TME = 1; // Turn on texturing.
prim.IIP = 0;
prim.FST = 1; // Use unnormalized coordinates.
prim.PRIM = int(PRIMType::Sprite);

static const GIFAddr addr[] = { GIFAddr::UV, GIFAddr::XYZ2 };
constexpr uint32_t num_registers = sizeof(addr) / sizeof(addr[0]);
constexpr uint32_t num_loops = sizeof(vertices) / sizeof(vertices[0]);
iface.write_packed(prim, addr, num_registers, num_loops, vertices);

Here we render a sprite with un-normalized coordinates.

Finally, we use the CRTC to do blending against white background.

priv.pmode.EN1 = 1;
priv.pmode.EN2 = 0;
priv.pmode.CRTMD = 1;
priv.pmode.MMOD = PMODEBits::MMOD_ALPHA_CIRCUIT1;
priv.smode1.CMOD = SMODE1Bits::CMOD_NTSC;
priv.smode1.LC = SMODE1Bits::LC_ANALOG;
priv.bgcolor.R = 0xff;
priv.bgcolor.G = 0xff;
priv.bgcolor.B = 0xff;
priv.pmode.SLBG = PMODEBits::SLBG_ALPHA_BLEND_BG;
priv.smode2.INT = 1;

priv.dispfb1.FBP = 0;
priv.dispfb1.FBW = 640 / BUFFER_WIDTH_SCALE;
priv.dispfb1.PSM = PSMCT32;
priv.dispfb1.DBX = 0;
priv.dispfb1.DBY = 0;
priv.display1.DX = 636; // Magic values that center the screen.
priv.display1.DY = 50; // Magic values that center the screen.
priv.display1.MAGH = 3; // scaling factor = MAGH + 1 = 4 -> 640 px wide.
priv.display1.MAGV = 0;
priv.display1.DW = 640 4 - 1;
priv.display1.DH = 448 - 1;

Aesthetic 256×179 price 😀

Implementation most vital facets

The rendering pipeline

Earlier than we rep into the online page tracker, it’s significant to elucidate a rendering pipeline the set apart synchronization is implied between every stage.

  • Synchronize CPU duplicate of VRAM to GPU. Here is basically unused, however happens for keep issue load, or the same
  • Upload files to VRAM (or possess local-to-local duplicate)
  • Update CLUT cache from VRAM
  • Unswizzle VRAM into VkImages that might also be sampled straight away, and contend with palettes as wished, sampling from CLUT cache
  • Contain rendering
  • Synchronize GPU duplicate of VRAM again to CPU. This might well even be significant for readbacks. Then CPU might well honest soundless be in a role to unswizzle straight faraway from a HOST_CACHED_BIT buffer as wished

This pipeline fits what we demand a recreation to total over and over:

  • Upload texture to VRAM
  • Upload palette to VRAM
  • Update CLUT cache
  • Plan with texture
    • Situation off unswizzle from VRAM into VkImage if wished
    • Begins building a “render pass”, a batch of primitives

When there don't seem to be any backwards hazards here, we are in a position to fortunately preserve batching and defer any synchronization. Here is serious to rep any performance out of this style of renderer.

Some in style hazards here encompass:

Replica to VRAM which became as soon as already written by duplicate

Here can be a faulty definite, however we cannot be conscious per-byte. This becomes an easy duplicate barrier and we traipse on.

Replica to VRAM the set apart a texture became as soon as sampled from, or CLUT cache learn from

Since the GS has a minute 4 MiB VRAM, it’s very in style that textures are continuously streamed in, sampled from, and thrown away. When that is detected, we possess to post all vram duplicate work, all texture unswizzle work after which birth a brand fresh batch. Outdated school batches are no longer disrupted.

This capability we’ll most incessantly compare:

  • Replica xN
  • Barrier
  • Unswizzle xN
  • Barrier
  • Replica xN
  • Barrier
  • Unswizzle xN
  • Barrier
  • Rendering

Sample texture that became as soon as rendered to

Identical, however here we favor to flush out all the pieces. This in total breaks the render pass and we birth but every other one. Too heaps of these is problematic for performance obviously.

Replica to VRAM the set apart rendering came about

Usually identical as sampling textures, that is a full flush.

A quantity of hazards are uncared for, since they're implicitly handled by our pipeline.

Web page tracker

Arguably, the toughest share of GS emulation is facing hazards. VRAM is learn and written to with reckless abandon and any doable learn-after-write or write-after-write hazard needs to be dealt with. We are in a position to no longer depend on any recreation doing this for us, since PS2 GS accurate deals with sync most incessantly, and TEXFLUSH is the completely true present video games will exercise (or neglect to exercise).

Monitoring per byte is ridiculous, so my solution is to first subdivide the 4 MiB VRAM into pages. A internet page is the unit for frame buffers and depth buffers, so it is the most meaningful issue to birth.

PageState

On internet page granularity, we be conscious:

  • Pending frame buffer write?
  • Pending frame buffer learn? (learn-completely depth)

Textures and VRAM copies possess 256 byte alignment, and to lead clear of a ton of faulty positives, we favor to trace on a per-block basis. There are 32 blocks per internet page, so a u32 bit-cover is okay.

  • VRAM duplicate writes
  • VRAM duplicate reads
  • Pending learn into CLUT cache or VkImage
  • Blocks which had been clobbered by any write, on next texture cache invalidate, throw away photos that overlap

As talked about earlier, there are also cases the set apart it is probably you'll well possibly also render to 24-bit shade, whereas sampling from the simpler 8-bits with out hazard. We now possess to optimize for that case too, so there could be:

  • A write cover for framebuffers
  • A learn cover for textures

Within the example above, FB write cover is 0xffffff and texture cache cover is 0xff000000. No overlap, no invalidate 😀

For host entry, there are also timeline semaphore values per internet page. These values issue which sync existing preserve up for if the host needs mapped learn or mapped write entry. Mapped write entry might well honest require extra sync than mapped learn if there are pending reads on that internet page.

Caching textures

Every internet page includes a checklist of VkImages which had been related with it. When a internet page’s textures has been invalidated, the image is destroyed and must be unswizzled again from VRAM.

There's a one-to-many relationship with textures and pages. A texture might well honest span a few internet page, and it’s ample that completely one internet page is clobbered earlier than the feel is invalidated.

Overall, there are heaps of micro-most vital facets here, however the most vital issues to existing here is that conservative and uncomplicated monitoring is rarely any longer going to work on PS2 video games. Monitoring at a 256 byte block diploma and inquisitive about write/learn masks is serious.

Particular cases

There are assorted scenarios the set apart we are in a position to also honest possess faulty positives due to how textures work. Since textures are POT sized, it’s somewhat in style for e.g. a 512×448 texture of a render goal to be programmed as a 512×512 texture. The unused field might well honest soundless ideally be clamped out with REGION_CLAMP, however most video games don’t. A render goal might well possibly receive these unused pages. As prolonged as the sport’s UV coordinates don’t prolong into the unused purple zone, there don't seem to be any hazards, however that is terribly painful to trace. We would possess to envision each broken-down to detect if it’s sampling into the purple zone.

As a workaround, we ignore any doable hazard in that purple zone, and accurate pray that a recreation isn’t in a formulation counting on ridiculous spooky-motion-at-a-distance hazards to work in the sport’s choose.

There are extra tantalizing special cases, especially with texture sampling suggestions, however that could be for later.

Updating CLUT in a batched reach

Since we favor to batch texture uploads, we possess to batch CLUT uploads too. To worth this work, we possess 1024 copies of CLUT, a ring buffer of snapshots.

One workgroup loops thru the updates and writes them to an SSBO. I did a the same yell for N64 RDP’s TMEM replace, the set apart TMEM became as soon as instanced. Happily, CLUT replace is a ways less advanced than TMEM replace.

shared uint tmp_clut[512];

// ...

// Replica from previous occasion to enable a
// CLUT entry to be in part overwritten and historical later
uint read_index = registers.read_index CLUT_SIZE_16;
tmp_clut[gl_LocalInvocationIndex] =
    uint(clut16.files[read_index]);
tmp_clut[gl_LocalInvocationIndex + 256u] =
    uint(clut16.files[read_index + 256u]);
barrier();

for (uint i = 0; i < registers.clut_count; i++)
{
  // ...
  if (active_lane)
  {
    // replace tmp_clut. If 256 shade, all threads participate.
    // 16 shade replace is a partial replace.
  }

  // Flush modern CLUT issue to SSBO.
  barrier();
  clut16.files[gl_LocalInvocationIndex + clut.instance CLUT_SIZE_16] =
    uint16_t(tmp_clut[gl_LocalInvocationIndex]);
  clut16.files[gl_LocalInvocationIndex + clut.instance CLUT_SIZE_16 + 256u] =
    uint16_t(tmp_clut[gl_LocalInvocationIndex + 256u]);
  barrier();
}

One doable optimization is that for 256 shade / 32 bpp updates, we are in a position to parallelize the CLUT replace, since nothing from previous iterations could be preserved, however the CLUT replace time is minute anyway.

Unswizzling textures from VRAM

Since that is Vulkan, we are in a position to accurate allocate a brand fresh VkImage, suballocate it from VkDeviceMemory and blast it with a compute shader.

Utilizing Vulkan’s specialization constants, we specialize the feel format and your total swizzling good judgment becomes straight ahead code.

REGION_REPEAT shenanigans could be resolved here, in disclose that the ubershader doesn’t possess to capture into memoir that case and cease handbook bilinear filtering.

Even for render targets, we roundtrip thru the VRAM SSBO. There's no longer indubitably a point going to the length of making an are attempting to ahead render targets into textures. Manner too many bugs to squash and edge cases to exclaim.

Triangle setup and binning

Like paraLLEl-RDP, paraLLEl-GS is a tile-essentially based mostly renderer. Earlier than binning can occur, we need triangle setup. As inputs, we present attributes in three arrays.

Command
struct VertexPosition
{
  ivec2 pos;
  float z;     // TODO: Ought to be uint for 32-bit Z.
  int padding; // Free true-property?
};
Per-Vertex attributes
struct VertexAttribute
{
  vec2 st;
  float q;
  uint rgba; // unpackUnorm4x8
  float fog; // overkill, however might well possibly be padding anyway
  u16vec2 uv;
};
Per-broken-down attributes
struct PrimitiveAttribute
{
  i16vec4 bb; // Scissor
  // Index into issue UBO, as well to misc issue bits.
  uint issue;
  // Texture issue which might well honest soundless be scalarized. Affects code paths.
  // Also holds the feel index (for bindless).
  uint tex;
  // Texture issue love lod scaling factors, and so forth.
  // Doesn't possess an affect on code paths.
  uint tex2;  
  uint alpha; // AFIX / AREF
  uint fbmsk;
  uint fogcol;
};

For rasterization, we possess a straight ahead barycentric-essentially based mostly rasterizer. It is carefully inspired by https://fgiesen.wordpress.com/2011/07/06/a-outing-thru-the-graphics-pipeline-2011-share-6/, which in turn is in step with A Parallel Algorithm for Polygon Rasterization (Paneda, 1988) and describes the “commonplace” reach to write a rasterizer with parallel hardware. Pointless to claim, the PS2 GS is DDA, i.e. a scanline rasterizer, however in educate, that is precise a interrogate of nudging ULPs of precision, and since I’m no longer attentive to a piece-proper description of the GS’s DDA, that is sublime. paraLLEl-RDP implements the raw DDA originate as an illustration. It’s undoubtedly probably if we possess to.

As an extension to a straight-ahead triangle rasterizer, I also favor to attend parallelograms. Here is historical to enforce wide-traces and sprites. In particular wide-line is kinda questionable, however I’m no longer obvious it’s probably to fully solve up-scaling + Bresenham in the total case. Now now not no longer as a lot as I haven’t trail into a case the set apart this indubitably issues.

Evaluating protection and barycentric I/J turns into something love this:

bool evaluate_coverage_single(PrimitiveSetup setup,
  bool parallelogram, 
  ivec2 parallelogram_offset,
  ivec2 coord, inout float i, inout float j)
{
  int a = idot3(setup.a, coord);
  int b = idot3(setup.b, coord);
  int c = idot3(setup.c, coord);

  true float i_result = float(b) setup.inv_area + setup.error_i;
  true float j_result = float(c) setup.inv_area + setup.error_j;
  i = i_result;
  j = j_result;

  if (parallelogram && a.x < 0)
  {
    b += a + parallelogram_offset.x;
    c += a + parallelogram_offset.y;
    a = 0;
  }

  return all(greaterThanEqual(ivec3(a, b, c), ivec3(0)));
}

inv_area is computed in a personalised mounted-point RCP, which is ~24.0 bit accurate. Utilizing the commonplace GPU RCP might well possibly be tainted since it’s accurate ~22.5 bit accurate and no longer consistent across implementations. There might be now not always a cause to skimp on reproducibility and accuracy, since we’re no longer doing work per-pixel.

error_i and error_j terms are precipitated by the downsampling of the brink equations and tie-atomize guidelines. As a aspect cease of the GS’s [-4k, +4k] pixel range, the diversity of the unsuitable-product requires 33-bit in signed integers. By downsampling a piece, we are in a position to rep 32-bit integer math to work accurate elegant with 8 sub-pixel accuracy for gigantic-sampling / multi-sampling. Theoretically, this suggests our better up-sampling limit is 8×8, however that’s ridiculous anyway, so we’re accurate here.

The parallelogram offsets are very diminutive numbers supposed to nudge the tie-atomize guidelines in our choose as wished. The right kind most vital facets of the implementation speed me. I wrote that code years prior to now. It’s no longer very laborious to win on the other hand.

Every broken-down gets a struct of transformed attributes as effectively. Here is completely learn if we in point of fact stop up shading a broken-down, so it’s most vital to preserve this separate to lead clear of polluting caches with too noteworthy garbage.

struct TransformedAttributes
{
  vec4 stqf0;
  vec4 stqf1;
  vec4 stqf2;
  uint rgba0;
  uint rgba1;
  uint rgba2;
  uint padding;
  vec4 st_bb;
};

Utilizing I/J love this might well honest end result in diminutive inaccuracies when interpolating primitives which demand to land exactly on the tip-left corner of a texel with NEAREST filtering. To fight this, a minute epsilon offset is historical when snapping texture coordinates. Very YOLO, however what are you able to cease. As a ways as I know, hardware habits is sub-texel ground, no longer sub-texel spherical.

true vec2 uv_1 = uv scale_1;

// Desire a fragile-ground here, no longer spherical habits.
const float UV_EPSILON_PRE_SNAP = 1.0 / 16.0;
// We now possess to bias no longer as a lot as 1 / 512th texel, in disclose that linear filter will RTE to upright subpixel.
// Here's a 1 / 1024th pixel bias to counter-act any non-POT inv_scale_1 inflicting a spherical-down event.
const float UV_EPSILON_POST_SNAP = 16.0 / 1024.0;

if (sampler_clamp_s)
  uv_1.x = texture_clamp(uv_1.x, region_coords.xz, LOD_1);
if (sampler_clamp_t)
  uv_1.y = texture_clamp(uv_1.y, region_coords.yw, LOD_1);

// Lift faraway from micro-precision disorders with UV and flooring + nearest.
// Right rounding on hardware is somwhat unclear.
// SotC requires proper rounding precision and is hit in particular tainted.
// If the epsilon is too excessive, then FF X keep screen is screwed over,
// so ... uh, ye.
// We probably desire a extra principled reach that is indubitably HW accurate in mounted point.
uv_1 = (ground(uv_1 16.0 + UV_EPSILON_PRE_SNAP) + UV_EPSILON_POST_SNAP) inv_scale_1 0.0625;

Binning

Here is basically dull. Every NxN pixel block gets an array of u16 broken-down indices to shade. This makes the most more than a few of primitives per render pass 64okay, however that’s ample for PS2 video games. Most video games I’ve viewed up to now tend to be between 10okay and 30okay primitives for the “most vital” render pass, however I haven’t tested the categorical juggernauts of broken-down state but, however even so, having to total somewhat little bit of incremental rendering isn’t a huge deal.

NxN is mostly 32×32, however it indubitably might also be dynamically changed reckoning on how heavy the geometry load is. For gigantic resolutions and excessive broken-down counts, the binning and memory price is unacceptable if the resolution is precise 16×16 as an illustration. One subgroup is accountable for iterating thru all primitives in a block.

Since binning and triangle is issue-less, triangle-setup and binning for again-to-again passes are batched up properly to lead clear of many of silly boundaries.

The ubershader

A key distinction between N64 and PS2 is own-rate and per-pixel complexity. For N64, the supreme reach is to specialize the rasterizing shader, write out per-pixel shade + depth + protection + and so forth, then merge that files in a noteworthy less advanced ubershader that completely needs to capture into memoir depth and blend issue moderately than full texturing issue and combiner issue. Here is terribly bandwidth intensive on the GPU, however the quite so a lot of is the slowest ubershader written by man. We’re saved by the indisputable truth that N64 own-rate is abysmal. Check out this video by Kaze to leer how immoral it is.

The GS is a somewhat assorted beast. Fill-rate is terribly excessive, and per-pixel complexity is somewhat low, so a pure ubershader is viable. We are in a position to also depend on bindless this time spherical too, so texturing complexity becomes a share of what I had to handle on N64.

Stunning-grained binning

Every tile is 4×4, 4×8 and 8×8 for subgroup sizes 16, 32 and 64 respectively. For gigantic-sampling it’s even smaller (it’s 4×4 / 4×8 / 8×8 in the elevated resolution arena as a replace).

Within the outer loop, we pull in as a lot as SubgroupSize’s worth of primitives, and bin them in parallel.

for (int i = 0; i < tile.coarse_primitive_count;
     i += int(gl_SubgroupSize))
{
  int prim_index = i + int(gl_SubgroupInvocationID);
  bool is_last_iteration = i + int(gl_SubgroupSize) >= 
                           tile.coarse_primitive_count;

  // Bin primitives to tile.
  bool binned_to_tile = faulty;
  uint bin_primitive_index;
  if (prim_index < tile.coarse_primitive_count)
  {
    bin_primitive_index = 
      uint(coarse_primitive_list.data[
           tile.coarse_primitive_list_offset + prim_index]);
    binned_to_tile = primitive_intersects_tile(bin_primitive_index);
  }

  // Iterate per binned primitive, do per pixel work now.
  // Scalar loop.
  uvec4 work_ballot = subgroupBallot(binned_to_tile);

In the inner loop, we can do a scalarized loop which checks coverage per-pixel, one primitive at a time.

// Scalar data
uint bit = subgroupBallotFindLSB(work_ballot);

if (gl_SubgroupSize == 64)
{
  if (bit >= 32)
    work_ballot.y &= work_ballot.y - 1;
  else
    work_ballot.x &= work_ballot.x - 1;
}
else
{
  work_ballot.x &= work_ballot.x - 1;
}

shade_primitive_index = subgroupShuffle(bin_primitive_index, bit);

Early Z

We are in a position to capture honest accurate thing about early-Z making an are attempting out pointless to claim, however we are in a position to also honest soundless watch out if there are rasterized pixels we haven’t resolved but, and there are Z-writes in flight. On this case we possess to defer to leisurely Z to possess take a look at.

// Shall we possess to desire opaque flag.
bool pending_z_write_can_affect_result =
  (pixel.request.z_test || !pixel.request.z_write) &&
  pending_shade_request.z_write;

if (pending_z_write_can_affect_result)
{
  // Demote the pixel to leisurely-Z,
  // it be now no longer opaque and we cannot discard earlier pixels.
  // We now possess to in a formulation explore the previous outcomes.
  pixel.opaque = faulty;
}

Deferred on-tile shading

Since we’re an uber-shader, all pixels are “on-chip”, i.e. in registers, so we are in a position to capture honest accurate thing about culling pixels that obtained’t be visible anyway. The elemental opinion here is that after rasterization, if a pixel is believed to be opaque, this might well honest simply replace the shading request that exists for that framebuffer coordinate. It obtained’t be visible the least bit anyway.

Lazy pixel shading

We completely favor to possess shading after we in point of fact possess to, i.e., we’re shading a pixel that relies upon on the previous pixel’s outcomes. This might well occur for e.g. alpha take a look at (if take a look at fails, we withhold modern files), shade write masks, or pointless to claim, alpha mixing.

If our pixel stays opaque, we are in a position to accurate waste the pending pixel shade request. Very good certainly. The possess here wasn’t as unbelievable as I had hoped since PS2 video games adore mixing, however it indubitably helps culling out heaps of shading work.

if (pixel.request.protection > 0)
{
  need_flush = !pixel.opaque && pending_shade_request.protection > 0;

  // If there might be now not one of these thing as a hazard, we are in a position to overwrite the pending pixel.
  // If no longer, defer the replace except we trail a loop iteration.
  if (!need_flush)
  {
    set_pending_shade_request(pixel.request, shade_primitive_index);
    pixel.request.protection = 0;
    pixel.request.z_write = faulty;
  }
}

If we possess flushes that favor to occur, we cease so if one pixel wants it. It’s accurate as speedy to rep to the underside of all pixels anyway.

// Scalar branchif (subgroupAny(need_flush))
{
  shade_resolve();
  if (has_work && pixel.request.protection > 0)
    set_pending_shade_request(pixel.request, shade_primitive_index);
}

The rep to the underside of is a straight ahead waterfall loop that stays in uniform regulate waft to be effectively outlined on devices with out maximal reconvergence attend.

whereas (subgroupAny(has_work))
{
  if (has_work)
  {
    uint state_index =
      subgroupBroadcastFirst(pending_shade_request.issue);
    uint tex = subgroupBroadcastFirst(prim_tex);
    if (state_index == pending_shade_request.issue && prim_tex == tex)
    {
      has_work = faulty;
      shade_resolve(pending_primitive_index, state_index, tex);
    }
  }
}

This scalarization ensures that all branches on issues love alpha take a look at mode, blend modes, and so forth, are purely scalar, and GPUs love that. Scalarizing on the feel index is technically no longer that serious, however it indubitably capability we stop up hitting the identical branches for filtering modes, UBOs for scaling factors are loaded uniformly, and so forth.

When all the pieces is done, the resulting framebuffer shade and depth is written out to SSBO. GPU bandwidth is kept to a minimal, accurate love a unheard of TBDR renderer.

Mountainous-sampling

Real implementing single sampled rendering isn’t ample for this renderer to be indubitably significant. The instrument renderer is positively somewhat speedy, however no longer speedy ample to preserve up with intense gigantic-sampling. We are in a position to repair that now.

For e.g. 8x SSAA, we preserve 10 variations of VRAM on the GPU.

  • 1 duplicate represents the single-sampled VRAM. It is gigantic-sampled.
  • 1 duplicate represents the reference price for single-sampled VRAM. This permits us to trace after we are in a position to also honest soundless discard the huge-samples and splat the single pattern to all. This might well occur if any individual copies to VRAM over a render goal for no topic cause.
  • 8 copies which every represent the huge-samples. Technically, we are in a position to reconstruct a elevated resolution image from these samples if we indubitably favor to, however completely the CRTC might well possibly with out problems cease that.

When rendering gigantic-sampled, we load the single-sampled VRAM and reference. Within the event that they match, we load the huge-sampled model. Here is most vital for cases the set apart we’re doing incremental rendering.

On tile completion we exercise clustered subgroup ops to total multi-pattern rep to the underside of, then write out the huge-samples, and the 2 single-sampled copies.

uvec4 ballot_color = subgroupBallot(fb_color_dirty);
uvec4 ballot_depth = subgroupBallot(fb_depth_dirty);

// No favor to cover, we completely care about legitimate ballotfor the
// first pattern we write-again.
if (NUM_SAMPLES >= 16)
{
  ballot_color |= ballot_color >> 8u;
  ballot_depth |= ballot_depth >> 8u;
}

if (NUM_SAMPLES >= 8)
{
  ballot_color |= ballot_color >> 4u;
  ballot_depth |= ballot_depth >> 4u;
}

if (NUM_SAMPLES >= 4)
{
  ballot_color |= ballot_color >> 2u;
  ballot_depth |= ballot_depth >> 2u;
}

ballot_color |= ballot_color >> 1u;
ballot_depth |= ballot_depth >> 1u;

// GLSL does no longer bring collectively cluster reduction as spec fixed.
if (NUM_SAMPLES == 16)
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 16) / 16.0);
else if (NUM_SAMPLES == 8)
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 8) / 8.0);
else if (NUM_SAMPLES == 4)
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 4) / 4.0);
else
  fb_color = packUnorm4x8(subgroupClusteredAdd(
    unpackUnorm4x8(fb_color), 2) / 2.0);

fb_color_dirty = subgroupInverseBallot(ballot_color);
fb_depth_dirty = subgroupInverseBallot(ballot_depth);

The valuable honest accurate thing about gigantic-sampling over straight up-scaling is that up-scaling will soundless possess jagged edges, and gigantic-sampling retains a coherent visible leer the set apart 3D parts possess the same resolution as UI parts. One in every of my pet peeves is when UI parts possess a considerably assorted resolution from 3D objects and textures. HD texture packs can pointless to claim alleviate that, however that’s a extraordinarily assorted beast.

Mountainous-sampling also lends itself completely to CRT submit-processing shading, which could be a pleasing bonus.

Facing gigantic-sampling artifacts

It’s a truth of lifestyles that enormous-sampling repeatedly introduces immoral artifacts if no longer handled with utmost care. Mitigating that is arguably more straightforward with instrument renderers over archaic graphics APIs, since we’re no longer restricted by the mounted impartial interpolators. These tricks obtained’t price it finest by any capability, however it indubitably very a lot mitigates jank in my ride, and I already mounted many upscaling bugs that GSdx Vulkan backend does no longer solve as we shall compare later.

Sprite primitives might well honest soundless repeatedly render at single-rate

Sprites are repeatedly UI parts or the same, and video games cease no longer demand us to up-scale them. Doing so either outcomes in artifacts the set apart we pattern out of doorways the intended rect, or we possibility overblurring the image if bilinear filtering is historical.

The trick here is precise to power-snap the pixel coordinate we exercise when rasterizing and interpolating. Here is terribly inefficient pointless to claim, however UI shouldn’t soak up the total screen. And if it does (love in a menu), the GPU load is minute anyway.

const uint SNAP_RASTER_BIT = (1u << STATE_BIT_SNAP_RASTER);
const uint SNAP_ATTR_BIT = (1u << STATE_BIT_SNAP_ATTRIBUTE);

if (SUPER_SAMPLE && (prim_state & SNAP_RASTER_BIT) != 0)
  fb_pixel = tile.fb_pixel_single_rate;

res.request.protection = evaluate_coverage(
  prim, fb_pixel, i, j,
  res.request.multisample, SAMPLING_RATE_DIM_LOG2);

Flat primitives might well honest soundless interpolate at single-pixel coordinate

Going further, we are in a position to demote SSAA interpolation to MSAA heart interpolation dynamically. Many UI parts are unfortunately rendered with unheard of triangles, so we are in a position to also honest soundless be a piece extra cautious. This snap completely impacts attribute interpolation, no longer Z pointless to claim.

res.request.st_bb = faulty;
if (SUPER_SAMPLE &&
    (prim_state & (SNAP_RASTER_BIT | SNAP_ATTR_BIT)) == SNAP_ATTR_BIT)
{
  vec2 snap_ij = evaluate_barycentric_ij(
    prim.b, prim.c, prim.inv_area,
    prim.error_i, prim.error_j, tile.fb_pixel_single_rate,
    SAMPLING_RATE_DIM_LOG2);

  i = snap_ij.x;
  j = snap_ij.y;
  res.request.st_bb = ethical;
}

Here, we snap interpolation to the tip-left pixel. This fixes any artifacts for primitives which align their rendering to a pixel heart, however some video games are mis-aligned, so this snapping can field off texture coordinates to exit of doorways the expected role. To elegant this up, we compute a bounding box of remaining texture coordinates. In conjunction with bounding containers can technically field off notorious block-edge artifacts, however that became as soon as largely an part on PS1 since emulators love to remodel nearest sampling to bilinear.

The heuristic for that is somewhat easy. If point of view is historical, if all vertices in a triangle possess proper identical Q, we use it’s a flat UI broken-down. The broken-down’s Z coordinates must also match. Here is done right thru triangle setup on the GPU. There can pointless to claim be faulty positives here, however it indubitably might well honest soundless be rare. In my ride this hack works effectively ample in the video games I tried.

Outcomes

Here’s a accurate example of up-sampling going awry in PCSX2. Here is with Vulkan backend:

See the bloom on the glass being mis-aligned and a fragile (?) rectangular pattern being overlaid over the image. Here is precipitated by a submit-processing pass rendering in a internet page-love pattern, presumably to optimize for GS caching habits.

With 8x SSAA in paraLLEl-GS it appears love this as a replace. There might be FSR1 submit-upscale in cease here which changes the leer a piece, however the in style trappings of tainted upscale cannot be noticed here. Here is but every other cause to total gigantic-pattern; texture mis-alignment has a tendency to repair itself.

Also, at the same time as you occur to’re staring at the perf numbers, that is RX 7600 in a low energy issue :’)

Conventional UI disorders might also be viewed in video games as effectively. Here’s native resolution:

and 4x upscale, which … does no longer leer acceptable.

This UI is hard to render in upscaled mode, since it uses triangles, however the MSAA snap trick above works effectively and avoids all artifacts. With straight upscale, that is laborious to cessation in unheard of graphics APIs since you’d need interpolateAtOffset previous 0.5 pixels, which isn’t supported. Almost definitely it is probably you'll well possibly possibly cease customized interpolation with derivatives or something love that, however either reach, this glitch might also be avoided. The core message is indubitably to never upscale UI previous undeniable nearest neighbor integer scale. It accurate appears tainted.

There are cases the set apart PCSX2 asks for excessive mixing accuracy. One example is MGS2, and I chanced on a role the set apart GPU perf is murdered. My desktop GPU cannot preserve 60 FPS here at 4x upscale. PCSX2 asks you to turn up blend-accuracy for this recreation, however …

What happens here is we hit the programmable mixing route with barrier between every broken-down. Ouch! This wouldn’t be tainted for the tiler mobile GPUs, however for a desktop GPU, it is the set apart perf goes to die. The shader in interrogate does subpassLoad and does programmable mixing as expected. Barrier, minute triangle, barrier, minute triangle, hnnnnnnng.

paraLLEl-GS on the quite so a lot of hand repeatedly runs with 100% blend accuracy (assuming no bugs pointless to claim). Here’s 16xSSAA (an similar to 4x upscale). Here is precise 25 W and 17% GPU utilization on RX 7600. Now now not tainted.

A quantity of no longer easy cases encompass texture sampling suggestions. One particular case I chanced on became as soon as in Valkyrie Profile 2.

This recreation has a case the set apart it’s sampling it’s have pixel’s alpha as a palette index. Quirky as all hell, and an similar to MGS2 there’s a barrier between every pixel.

In paraLLEl-GS, this case is detected, and we emit a magical texture index, which resolved to accurate taking a leer at in-register framebuffer shade as a replace. Programmable mixing traipse brr. These cases might well honest soundless be checked per broken-down, which is reasonably rough on CPU time, however it indubitably is what it is. If we don’t hit the finest route, GPU performance completely tanks.

The trick here is to envision the efficient UV coordinates, and compare if UV == framebuffer role. If we drop off this route, we possess to traipse thru texture uploads, which is tainted.

ivec2 uv0_delta = uv0 - pos[0].pos;
ivec2 uv1_delta = uv1 - pos[1].pos;
ivec2 min_delta = min(uv0_delta, uv1_delta);
ivec2 max_delta = max(uv0_delta, uv1_delta);

if (!quad)
{
  ivec2 uv2_delta = uv2 - pos[2].pos;
  min_delta = min(min_delta, uv2_delta);
  max_delta = max(max_delta, uv2_delta);
}

int min_delta2 = min(min_delta.x, min_delta.y);
int max_delta2 = max(max_delta.x, max_delta.y);

// The UV offset might well honest soundless be in range of [0, 2^SUBPIXEL_BITS - 1].
// This guarantees snapping with NEAREST.
// 8 is supreme. That suggests pixel centers right thru interpolation
// will land exactly in the heart of the texel.
// In opinion we might well possibly enable LINEAR if uv delta became as soon as
// exactly 8 for all vertices.
if (min_delta2 < 0 || max_delta2 >= (1 << SUBPIXEL_BITS))
  return ColorFeedbackMode::Sliced;

// Perf traipse brrrrrrr.
return ColorFeedbackMode::Pixel;
if (feedback_mode == ColorFeedbackMode::Pixel)
{
  mark_render_pass_has_texture_feedback(ctx.tex0.desc);
  // Particular index indicating on-tile suggestions.
  // Shall we add a distinct sentinel for depth suggestions.
  // 1024okay CLUT conditions and 32 sub-banks. Suits in 15 bits.
  // Spend bit 15 MSB to designate suggestions texture.
  return (1u << (TEX_TEXTURE_INDEX_BITS - 1u)) |
         (render_pass.clut_instance 32 + uint32_t(ctx.tex0.desc.CSA));
}

It’s very with out problems full-velocity on PCSX2 here, despite the copious more than a few of boundaries, however paraLLEl-GS is moderately terminate perf-smart, in point of fact. At 8x SSAA.

Overall, we rep away with 18 render pass boundaries as a replace of 500+ which became as soon as the case with out this optimization. That that it is probably you'll well possibly also honest leer the interlacing artifacts on the swirlies. Silly recreation has a modern scan output, however downsamples it by itself to a self-discipline earlier than hitting CRTC, hnnnnng 🙁 Redirecting framebuffer areas in CRTC might well possibly work as a per-recreation hack, however either reach, I soundless favor to capture into memoir the next de-interlacer. Some video games in point of fact render explicitly in fields (640×224), which is terribly tense.

This scene in the MGS2 intro also exposes some silly edge cases with sampling.

To rep the camo cease, it’s sampling its have framebuffer as a texture, with overlapping coordinates, however no longer pixel aligned, so this raises some serious questions about caching habits. PCSX2 doesn’t seem so that you might add any boundaries here, and I kinda had to total the identical yell. It appears elegant to me when put next to instrument renderer no no longer as a lot as.

if (feedback_mode == ColorFeedbackMode::Sliced)
{
  // If recreation explicitly clamps the rect to a diminutive field,
  // it be probably doing effectively-outlined feedbacks.
  // E.g. Tales of Abyss most vital menu ping-pong blurs.
  // This code is reasonably unsuitable,
  // and I'm no longer obvious what the upright solution is but.
  if (desc.clamp.desc.WMS == CLAMPBits::REGION_CLAMP &&
      desc.clamp.desc.WMT == CLAMPBits::REGION_CLAMP)
  {
    ivec4 clamped_uv_bb(
      int(desc.clamp.desc.MINU),
      int(desc.clamp.desc.MINV),
      int(desc.clamp.desc.MAXU),
      int(desc.clamp.desc.MAXV));

    ivec4 hazard_bb(
      std::max(clamped_uv_bb.x, bb.x),
      std::max(clamped_uv_bb.y, bb.y),
      std::min(clamped_uv_bb.z, bb.z),
      std::min(clamped_uv_bb.w, bb.w));

    cache_texture = hazard_bb.x > hazard_bb.z ||
                    hazard_bb.y > hazard_bb.w;
  }
  else
  {
    // Questionable,
    // however it indubitably appears nearly no longer probably to total this accurately and speedy.
    // Need to emulate the PS2 texture cache exactly,
    // which is precise insane.
    // This might well honest soundless be elegant.
    cache_texture = faulty;
  }
}

If we’re in a mode the set apart texture facets on to the frame buffer we are in a position to also honest soundless relax out the hazard monitoring a piece to lead clear of 2000+ boundaries. Here is clearly spooky since Tales of Abyss’s bloom cease as shown earlier relies upon on this to be effectively behaved, however in that case, no no longer as a lot as it uses REGION_CLAMP to explicitly designate the ping-pong habits. I’m no longer obvious what the ethical solution is here.

The completely believable solution to ethical bit-accuracy with true hardware is to emulate the caches straight away, one pixel at a time. That that it is probably you'll well possibly also kiss performance accurate bye in that case.

One in every of the worst stress tests I’ve chanced on up to now must be Shadow of the Collosus. Real in the intro, we are in a position to price the GPU kneel down to 24 FPS with most blend accuracy on PCSX2, at accurate 2x upscale! Even with unheard of mixing accuracy, it is entirely heavy right thru the intro cinematic.

At 8x SSAA, perf is soundless taking a leer somewhat accurate for paraLLEl-GS, however it indubitably’s clearly sweating now.

We’re in point of fact soundless CPU depart on the geometry processing. Optimizing the CPU code hasn’t been a giant priority but. There’s unfortunately heaps of code that has to trail per-broken-down, the set apart hazards can occur spherical every corner that must be dealt with in a formulation. I cease some glaring optimizations, however it indubitably’s obviously no longer as effectively-oiled as PCSX2 in that regard.

Deck?

It appears speedy ample to very with out problems cease 4x SSAA. Almost definitely no longer in SotC, however … hello. 😀

What now?

For now, the completely true reach to take a look at that is thru GS dumps. There’s a hack-patch for PCSX2 that allows you to dump out a raw GS hint, that might also be replayed. This works thru mkfifo as a impolite hack to take a look at in true-time, however some extra or less integration into an emulator needs to occur sooner or later if that is to change into something that’s significant for stop customers.

There’s assured to be 1,000,000 bugs lurking because the PS2 library is ridiculously gigantic and there’s completely so noteworthy I might also be arsed to take a look at myself. Now now not no longer as a lot as, paraLLEl-GS has now change into my most in style reach to play PS2 video games, so I'm in a position to claim mission total.

A doable exercise case for that is due to its standalone library nature, it'd be significant as very broken-down-college rendering API for the broken-down greybeards spherical that also yearn for the day of PS2 programming for no topic cause :p

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button