optimized mipmap generation routines

seandepagnier · August 17, 2015, 12:45am

I am generating mipmaps in software then compressing them into either dxt1 or etc1 formats. I cannot find a way to use glGenerateMipmaps efficiently as I would have to upload then download which is already slower than compressing using the cpu, and it slows down the application in the meantime. Doing the work on the cpu, I can benefit from using idle priority background threads and so the user will not notice a slowdown (initially there just aren’t mipmaps)

I have written the following, but I would like to know of alternate implementations.

I am questioning if it can be faster somehow (of course using avx or avx2 which I don’t have) and if other implementations exist which use additional (sse4.2 etc) or fewer instructions to improve performance on those cores.

My output is:

gcc -mssse3 -msse2 -std=gnu99 -O3 -o test test.c -g -lpthread
~ $ ./test
generic 24bpp: 384ms
ssse3 optimized 24bpp: 76ms
generic 32bpp: 380ms
sse optimized 32bpp: 89ms
sse2 optimized 32bpp: 45ms

Albatross · August 17, 2015, 8:19am

It’s better to avoid assigning the result of a SSE “call” to a variable. Instead nest calls as much as possible, less readable but much more performing!

I admin I haven’t read your code deeply, but if you’re averaging larger instructions could definitely help. Another option would be to use OpenCL on the CPU.

seandepagnier · August 19, 2015, 2:59am

[QUOTE=Albatross;1278940]It’s better to avoid assigning the result of a SSE “call” to a variable. Instead nest calls as much as possible, less readable but much more performing!

I admin I haven’t read your code deeply, but if you’re averaging larger instructions could definitely help. Another option would be to use OpenCL on the CPU.[/QUOTE]

I don’t understand how the compiler wouldn’t automatically produce the same code whether or not I assign the results… I know the compiler isn’t very intelligent but this is really basic, I am sure it does it. Anyway, I am using the largest instructions I can, so 128 bits at a time. I don’t have avx support on this computer but likely will add an option for that if I have a machine with it. Unfortunately I cannot find a suitable byte shuffle instruction to allow working with 24bit data for avx neither.

I still can’t figure out why gcc isn’t smart enough to emit the same instructions (or hopefully even faster ones ) so that I could simply use the generic version of the algorithm and recompile it with different cpu settings which are then detected at runtime. Maybe it just doesn’t realize adding 4 numbers and a constant of 2 then dividing by 4 could instead use the average instruction and be vectorized as well.

Albatross · August 21, 2015, 4:30am

I don’t think the compiler is that smart. Maybe your case is very simple, but when using vector units things get complicated quite fast. Nesting the calls helps to keep the data in the registers as much as possible, which is the key to fully exploit the vector units.

For a school project I had to implement a color average algorithm which worked on 24bpp images. Take a look to the _mm_shuffle_epi8 instruction, it could help you.

GCC supports automatic vectorization and it is default with O3. If you arrange your for loops properly then the compiler will try to vectorize them. I don’t remember the exact rules that a for loop has to follow in order to be vectorized, but you can easily find them on Google. You can then know about what loops have been successfully vectorized with the flag -ftree-vectorizer-verbose=n.

However the target instruction set can only be decided at compile time. And I am confident that well, hand-written instrinsics will outperform the ones produced by the compiler.