First of all, remember the fact that this info is simply rumors and that whereas ready for NVIDIA to verify details about this structure, we’ll touch upon the knowledge that comes out in regards to the subsequent NVIDIA structure, which we don’t anticipate earlier than 2022.
What can we anticipate from Lovelace?
One of the issues that stunned many people within the GeForce Ampere, RTX 3000, was the truth that NVIDIA doubled the variety of ALUs in FP32 inside every SM, the trick to do it’s the truth that from the NVIDIA Turing was to do a quite simple change to elucidate in SM items.
In all GeForces as much as Pascal, the CUDA “cores” had two sorts of ALUs inside them, one integer and one floating level, however these had been totally switched and shared the information path. It was from Volta and Turing, NVIDIA assigned two paths, one for integers and the opposite for floating level, however solely having the ability to activate 16 paths, however in Ampere they made the information path that communicated with the 16 integer ALUs to take action as properly. with one other 16 floating level ALUs, which added to the beforehand present ones, counted a most of 128 ALUs in FP32 per SM, thus doubling the utmost floating level calculation capability.
We have no idea if NVIDIA goes to make a change of the identical sort throughout the Lovelace options, but when the rumors are true then the way in which to extend the efficiency by a part of NVIDIA won’t be to extend the variety of ALUs in FP32 by SM or no matter one other factor, however to extend the quantity of SMs in complete.
An enormous variety of CUDA cores in Lovelace
If we already mentioned the rumor of the characteristics of Lovelace with a configuration of 12 GPC and 6 TPC per GPC, or in different phrases, of the 12 GPC with 12 SM Each one, now we now have to speak about what this might imply within the occasion that NVIDIA doesn’t make architectural modifications and ignoring different components of the equation that may make these rumors kind of doable or credible.
12 GPC * 12 SM by GPC = 144 SM.
144 SM * 128 ALUs in FP32 (CUDA cores) = 18432 “cores” CUDA / ALUs in FP32
In comparability, the NVIDIA RTX 3090 has 82 SMs with the identical configuration, 10496 “cores” CUDA, so it’s a leap of 70% that along with an eventual enchancment in clock pace may imply efficiency past 60 TFLOPS for computing energy. All this paying unique consideration to the knowledge of the rumors and ignoring different elements.
Lovelace options want deeper modifications
Normally NVIDIA allotted an quantity of GPCs equal to the reminiscence bandwidth, usually it allotted 6 GPCs within the case of a 384-bit bus, 5 GPCs within the case of a 320-bit bus, Four GPCs within the case of a bus of 256 bits, and many others. But, if the rumors about NVIDIA Lovelace change into true then this equivalence between L2 cache and GPCs can be damaged, there can be extra GPCs per L2 cache partitions and the required bandwidth can be larger.
If we make a fast remark we’ll see how from the leap from NVIDIA Maxwell to NVIDIA Pascal the quantity of GPC was maintained however the quantity of TPC per GPC was elevated by one. The similar occurred from Pascal to Turing, however with Ampere they didn’t enhance the width per GPC however as a substitute went from 7 to six GPC.
If Lovelace’s characteristics are true, then the change that may enable numerous GPCs to be positioned can be within the L2 Cache of the GPU, this at all times communicates the totally different GPCs with one another, so with a purpose to scale the variety of GPCs there should be a tips on how to scale Cache L2, however in the intervening time we have no idea it and may solely speculate about it.
One chance is that simply as a PAM-Four interface has been used for communication with exterior reminiscence within the case of GDDR6, then NVIDIA has determined to do the identical with the L2 cache, rising its bandwidth and permitting it to feed the 12 GPCs that may encompass Cache L2.
Of course, this additionally doesn’t resolve that such quite a lot of items additionally require a big bandwidth for VRAM, which makes us consider that NVIDIA has a couple of methods up its sleeve and that it’s going to take a very long time to know them.