# The Industry's Foundation for High Performance Graphics

1. ## Unpacking ternary digits

I have a shader which needs a map which includes ternary digits (i.e. value can be 0, 1, 2), aka trits.

I know this can be (fairly) efficiently packed at five trits ber byte (since 3 to power 5 = 243).

What is the most efficient way to unpack these in shader code? Or is unpacking so slow that I would be better storing 4 trits per byte and living with the increased memory requirements? (I know ASTC compression uses this packing, but that's being done in hardware not shader.)

I also want to know how to efficiently unpack quints (base 5 digits) stored at 3 per 7 bits (5 * 5 * 5 = 125) ... and storing only 2 quints per byte would be really inefficient.

Thanks

2. Originally Posted by barneypitt
Utterly useless information, no attempt to answer the question.
That's because it's spam (which I've reported); it's just an attempt to raise the page ranking for the link.

If you want a definitive answer to your original question, you're just going to have to test the various methods to see which turns out to be the fastest. A lookup table is one option, but that may end up being slower than just using quotient and remainder. Another option is to perform division using multiply and shift:
x/y = ((x+1)*floor(2k/y))>>k for sufficiently large k (k=16 works for both 3^5 and 5^3).
x%y = x-y*(x/y)
Using vector operations (e.g. x/ivec4(3,9,27,81)) may be faster than separate calculations, or it may not.

3. Originally Posted by GClements
That's because it's spam (which I've reported); it's just an attempt to raise the page ranking for the link.

If you want a definitive answer to your original question, you're just going to have to test the various methods to see which turns out to be the fastest. A lookup table is one option, but that may end up being slower than just using quotient and remainder. Another option is to perform division using multiply and shift:
x/y = ((x+1)*floor(2k/y))>>k for sufficiently large k (k=16 works for both 3^5 and 5^3).
x%y = x-y*(x/y)
Using vector operations (e.g. x/ivec4(3,9,27,81)) may be faster than separate calculations, or it may not.
Yes, I wasn't sure whether a LUT would beat quotient and remainder (hadn't considered the multiply and shift). I'm much more au fait with OpenCL, where I could just put the LUT in constant (or private, or workgroup local) memory space and be pretty certain of very fast access. But with glsl you seem to have no say in this and have to assume the worst case (cache miss).

Is there any way I can influence whether a (small) array gets stored in fast (nearby) memory? Could passing an array as a direct read-only shader argument (rather than using a texture lookup) help? Or declaring and initialising an array within the shader file itself? (It does seem crazy to me that a single array access could end up being the slow option.)

Problem with testing the methods is I have no idea if I would get the same result on different hardware - though I do have upto three GPUs I can test, I suppose.

Thanks

4. Originally Posted by barneypitt
Is there any way I can influence whether a (small) array gets stored in fast (nearby) memory? Could passing an array as a direct read-only shader argument (rather than using a texture lookup) help? Or declaring and initialising an array within the shader file itself? (It does seem crazy to me that a single array access could end up being the slow option.)
A const-qualified global variable would seem to be the most likely candidate. The main issue there is that GLSL doesn't support anything smaller than a 32-bit int. So you'd either need 243*5 ints, or 243 ints which you unpack with shift/and operations (which might not be any faster than unpacking with div/mod), or 243/2=122 ints (packing 2 16-bit words into each int). But I don't know how small you'd need to make the array to get the fastest behaviour.

A texture has the advantage that it can store 8-bit values (or even smaller, e.g. 4:4:4:4); I have no idea whether that will help overall.

Originally Posted by barneypitt
Problem with testing the methods is I have no idea if I would get the same result on different hardware - though I do have upto three GPUs I can test, I suppose.
I strongly suspect that what method is fastest could vary wildly between different types of hardware, possibly even between different high-end desktop GPUs and almost certainly between the high-end desktop GPUs and integrated or mobile GPUs.

If I was doing this, the first thing I would do would be to figure out an interface that allows the unpacking to be abstracted out so that the method can be changed without affecting the rest of the code. Then I'd implement all of the plausible methods and allow the choice to be made at run time (i.e. during shader compilation). For extra credit, have the program profile the various techniques (either each time or, if it can remember settings between runs, the first time it's run).

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•