Understanding GPU's Texture Mapping Unit and Texture Fill Rate

Graphics and GPU Programming Programming AMD GPU TMU HW

Started by Pagur April 24, 2024 02:12 PM

3 comments, last by JoeJ 1 week ago

Author

April 24, 2024 02:12 PM

Hi guys, I'm doing some tests to better understand common bottlenecks on modern GPUs.
I'm taking a look at the valuable Sebastian Aaltonen's performance test tool (https://github.com/sebbbi/perftest).
What I'm try to understand is how those numbers relate with theoretical GPU texture fill-rate throughput.

The first thing I'm having some trouble to understand is how the nominal texture fill-rate should be computed, since usually is just:
Texture Rate = Clock Rate * TMUs.

However, taking as an example the AMD Radeon RX 5700 XT, which is a well documented GPU, this number doesn't match with what I'm reading in RDNA architecture presentation (https://www.amd.com/system/files/documents/rdna-whitepaper.pdf).
The paper states: "[...] the texture mapping unit, which can perform filtering for up to eight texture addresses per clock – again twice the throughput of the prior generation. For each address, the TMU will sample the four nearest neighbors, decompress the data, and perform interpolation. The final texel value is passed back to the SIMD via then response bus".

So from this statement I would say that each TMU can perform up to 8 bilinears samples per clock.
So if we take the number of TMUs listed in the GPU specs (https://www.techpowerup.com/gpu-specs/radeon-rx-5700-xt.c3339), which are 160, it seems legit to compute the texture fill rate as follow:
Texture Rate: 1905 MHz x 160 TMU x 8 (sample/clock) = 2.438 TTexel/s.

But this is clearly not the number provided by AMD which is just 304.8 GTexel/s (1905 MHz x 160 TMU).

A second question is related to the performance test itself.

The performance test mentioned above runs 1024x1024 CS threads, each one reading 256 elements from a different type of buffer, this whole operation is repeated 30 times for a total number of 8 billions reads.
The total time of those reads is collected via GPU queries.

On the RX 5700 XT the random RGBA8 buffer read takes 12.617 ms, which is roughly 332 reads/clock.
Does anyone have any idea how this number could be inferred from the nominal GPU specs and architecture?

cgrant

1,874

April 24, 2024 06:56 PM

Just an aside, if your test is to understand GPU bottlenecks, texture fill rate doesn't really account for much as that is all ‘theoretical’. Texture data is stored in memory whose access will always be limiting factor when it comes to texturing. Doesn't really answer your question but just another point to consider…

Pagur

Author

April 24, 2024 07:35 PM

cgrant said:
Just an aside, if your test is to understand GPU bottlenecks, texture fill rate doesn't really account for much as that is all ‘theoretical’. Texture data is stored in memory whose access will always be limiting factor when it comes to texturing. Doesn't really answer your question but just another point to consider…

Yes that's absolutely true, in my experience bottlenecks happened more often because of other reasons, such as pixel overdraw, anisotropic fetches, bad texture locality or poor occupancy.
Let's say this question is more focused about learning how the GPU works under the hood.
But can be also useful also in practice when profiling to know which counters to look for.

JoeJ

4,187

April 26, 2024 06:32 AM

Pagur said:
But this is clearly not the number provided by AMD which is just 304.8 GTexel/s (1905 MHz x 160 TMU).

A RX 5700 has memory bandwidth of 448 GB/s, and assuming a texel is one byte, this number matches better.

Pagur said:
it seems legit to compute the texture fill rate as follow: Texture Rate: 1905 MHz x 160 TMU x 8 (sample/clock) = 2.438 TTexel/s.

If i do just 1905 x 160 i get 304,800 precisely.
Eventually the 8 samples per clock are pipelined, so it processes 8 samples in parallel, but it takes 8 cycles in total to complete the batch. (Similar to how GCN SIMDs worked by using 16wide SIMDs to serve 64 threads, so each instruction takes at least 4 cycles of latency.)
Assuming first RDNA still used 64 threads in lockstep, feeding each with a texture sample would take 8 samples x 8 operations.

But well, i do not really know what i talk about either. ; )

Understanding GPU's Texture Mapping Unit and Texture Fill Rate

Popular Topics

Recommended Tutorials

Understanding GPU's Texture Mapping Unit and Texture Fill Rate

Popular Topics

Recommended Tutorials

Reticulating splines