Graphics Card VRAM Latency Performance Analysis

We’ve gotten used to measuring CPU cache efficiency parameters and RAM latencies, so why not do the identical for GPUs? Like CPUs, GPUs have developed to make use of multi-level cache hierarchies to handle the rising hole between computing efficiency (GPU) and reminiscence (VRAM), and similar to CPUs, we will use dots pointer lookup reference (in OpenCL) to measure the cache latency and VRAM of the graph.

VRAM Latency on Ampere and RDNA Graphics 2

The cache in AMD RDNA 2 graphics playing cards may be very quick and there’s a lot of it. Compared to Ampere, latency is decrease in any respect ranges, and Infinity Cache it solely provides round 20nm in comparison with L2 cache and has decrease latency than Ampere’s L2. Surprisingly, RDNA 2’s VRAM latency is roughly the identical as NVIDIA’s Ampere, though RDNA 2 performs two further ranges of cache checking on the best way to reminiscence.

VRAM Ampere and RDNA Latency 2

In distinction, NVIDIA is left with a extra typical reminiscence subsystem with solely two ranges of cache and excessive latency on L2. Going from L1 devoted to Ampere SMs to L2 takes greater than 100 ns; RDNA 2’s L2 cache is round 66ns of L0, even with an L1 cache between them. Bypassing the massive die of the GA102 chip appears to take many cycles for Ampere GPUs, penalizing their efficiency.

This might clarify the wonderful efficiency that AMD’s RDNA 2 graphics present at decrease resolutions. RDNA 2’s low-latency L2 and L3 caches can provide you a bonus with smaller workloads, the place occupancy is just too low to obfuscate latency. In comparability, Ampere chips require extra parallelism as a way to stand out by way of efficiency.

If we examine CPU and GPU, we see a bloodbath

CPUs are designed to run critical workloads as quick as potential, whereas GPUs are designed to run large workloads in parallel. Since the take a look at has been completed with OpenCL, we will run it with out modification on a CPU to see the way it compares to a GPU.

GPU CPU memory latency

The instance above makes use of a Haswell processor whose cache and DRAM latencies are so low {that a} logarithmic scale had for use, in any other case it might seem like a flat line effectively beneath the RDNA 2 figures. Core i7-4770 with 1600MHz DDR3 CL9 used can do a reminiscence spherical journey in simply 63ns, whereas a Radeon RX 6900 XT with GDDR6 takes 226ns to do the identical course of, over 3.5 extra occasions.

However, from one other perspective, the latency of the GDDR6 VRAM itself is just not that dangerous. A CPU or GPU has to verify the cache earlier than placing it into reminiscence and due to this fact we will get a extra ‘uncooked’ view of reminiscence latency simply by how lengthy it takes for information to enter reminiscence from that hits the cache. The delta between a success and a last-level cache error is 53.42 ns in Haswell and 123.2 ns in RDNA 2.

What about earlier era GPUs?

The Maxwell and Pascal architectures are very comparable, and a GTX 980 Ti is prone to endure from a bigger die and decrease clock speeds, so information takes longer to bodily move via the chip. NVIDIA doesn’t permit OpenCL to make use of the L1 texture cache on any of the architectures, so sadly the very first thing you see is the L2 cache.


Turing begins to look extra like Ampere; there’s comparatively low L1 latency, then L2, and eventually reminiscence. L2 latency appears kind of consistent with Pascal, whereas uncooked reminiscence latency appears to be like comparable as much as 32MB, after which goes increased.

As for AMD, there isn’t any rationalization for the latency being so low beneath 32 KB. AMD says Terascale has an eight KB L1 information cache however the outcomes do not match; the take a look at may be hitting some form of vertex reuse cache (since reminiscence payloads are compiled into vertex fetch clauses).

AMD GPU Latency

GNC and RDNA 2 look as anticipated, and it’s fairly fascinating to see that AMD’s latency in any respect ranges decreases as time goes on.