size of layout

Hi

I have a gtx570M which is of GF114 architecture.

As per the whitepaper on Fermi released by Nvidia, fermi architecure has a dual warp scheduler.

As per the paper

“SM schedules threads in groups of 32 parallel threads called warps. Each SM has 2 warp scheduler. Fermi’s dual warp scheduler selects two warps and issues one instruction from each warp to a group of 16 cores.”

My actual architecture has 336 cuda cores and 7 SMs. That works out to 48 cuda cores per SM.

My question:

What is the best way for me to set out the layout(x=,y=,z=) in the shader code to maximize performance?

By the whitepaper, we are encouraged to use multiple of warp size (fermi 32 ).

  1. So I should set the x,y and z dimensions in the layout keyword of the shader code to be such so that the product of xyz dimensions is a mutiple of 32?
    2.How does it work when the number of cores are 48 per SM as in my case?
  2. Would it be better if the layout was set so that the product was a multiple of 16 (since each warp works on 16 cores and there are 2 warp schedulers?)
    4.Where can I find more specifics on my exact architecture (GF114 gtx570M) in terms of core layout and warp schedulers instead of relying on the white paper for fermi since there can be differences in what is stated generally in the paper and what actually exists on the hardware (example: number of cores per SM. Whitepaper says 32 cores per SM, gtx570M has 48 cores per SM).

thanks