Deferred shading on embedded device

I am thinking of implementing deferred shading on mobile platform. But not sure whether there will be performance gain with this technique because of memory bandwidth. I have different types of light sources, meshes with material properties and I will have to store all the material properties in textures. Is there any optimization I can use here, to save the memory ? As I will have to store the normals, texture data, specular, diffuse, ambient, reflection, opacity of the material for each pixel. If I store all this data, it will be taking large amount of memory. Is there anything I can do here?
Ultimate question, is it good idea to use deferred shading here or I should just go with normal rendering?

[QUOTE=debonair;1280529]I am thinking of implementing deferred shading on mobile platform. But not sure whether there will be performance gain with this technique because of memory bandwidth.

I have different types of light sources, meshes with material properties and I will have to store all the material properties in textures. Is there any optimization I can use here, to save the memory ?

As I will have to store the normals, texture data, specular, diffuse, ambient, reflection, opacity of the material for each pixel. If I store all this data, it will be taking large amount of memory. Is there anything I can do here?

Ultimate question, is it good idea to use deferred shading here or I should just go with normal rendering?[/QUOTE]

Are you using a tile-based rendering GPU (PowerVR, Mali, or Adreno)? If so, possibly.

You’re wise to get to the bottom of this first. On desktop GPUs (sort-last architecture), deferred shading can demand a lot of GPU memory and memory bandwidth (because fragment data is repeatedly written to/read from GPU memory as the frame is rasterized). On tile-based GPUs (sort-middle), it doesn’t have to (because frequently you can keep all of these writes/reads in fast on-chip framebuffer tile cache). But you need to make sure your mobile GPU supports that first.

What you’re looking for is support for tile-based on-chip MRT support in your GPU. That is when the GPU is rendering your screen tiles, you want to be able to rasterize to multiple render targets (MRTs) and then read from those MRTs to generate a final image, all without the GPU ever writing those MRTs to GPU memory (which is often just standard CPU DRAM).

For a concept primer (and details for ARM Mali GPUs), see:

Some Imagination Tech PowerVR GPUs support on-chip MRT as well, but in a different way. IIRC with PowerVR, you don’t need to use EXT_shader_pixel_local_storage to utilize the on-chip MRTs. You can just set up an FBO to render MRT as usual, render your G-Buffer into them, and then instead of doing the usual “rebind G-Buffer textures as input to shaders” thing for your lighting pass, you use EXT_shader_framebuffer_fetch in your lighting shader to read in the last-written MRT values for the G-buffer attachments to apply lighting without any FBO reconfig. Then before the end of your render pass, just Invalidate/DiscardFramebuffer your G-Buffer MRT attachments so that they never get written out to GPU memory. Result: Only your final framebuffer gets written to GPU memory. And if you clear your framebuffer at the beginning of the frame, it never gets read from GPU memory in the beginning.

Both of these approaches give you pretty efficient use of GPU memory bandwidth. The downside to the latter is that you need to allocate GPU memory to back your G-buffer attachments, but that memory is never actually used except as a placeholder. There may be a trick to avoid that that I’m not aware of.

For Qualcomm Adreno GPUs, you’re on your own. Don’t know much about these yet. Keep in mind that some of these can run in “desktop” (sort-last) or “tile-based” (sort-middle) mode. My guess is you’re going to want to run tile-based.

For GPU and GPU driver-related details, check with the GPU vendor.

I was thinking on this algorithm, let me know your thoughts on it:

  1. Render with 2 passes. First pass will be depth pass without any lighting or color calculations, it will write the depth buffer. Disable color write in this pass.
  2. in second pass, render all the objects with color writes and all lighting calculations in shader but include “layout(early_fragment_tests)” in shader to force ealry z test. this will prevent execution of FS for unwanted pixels.
    This will save memory used for MRTs may be 5-6 textures.

Ok, so you’re basically saying forgo deferred rendering and instead do a Z prepass followed by a forward shading lighting pass.

You could. Do you really have that much overdraw? Are you fragment shading bound currently? If so, it “might” help, but it depends on your system. In my experience, GL renderers are more commonly CPU bound or vertex bound. If that is your case as well, submitting and transforming your geometry twice is not going to help. I would check that first.

Also, keep in mind that on PowerVR GPUs, the hardware effectively does this “Z prepass” for you without you having to submit in two passes. Basically it can determine per-pixel Z first based on all submitted fragments, and then only shade the closest fragment (if opaque). Just submit your geometry in this order to get the benefit: opaque, alpha-test, alpha-blend.