I have implemented code in a way that stuffs certain lookup values in the shared memory, so that the rest of the threads do not need to look up the same values from global memory.

However, i find that while the first run of the compute shader is fast when using the shared memory versus not using the shared memory, subsequent compute shader dispatches are not. Any reason why? Does there need to be *freeing* or *cleanup* of memory on the shared memory after each compute shader dispatch?