MSAA resolve: Blit vs texture fetches in shader

Dear all,

I know the topic of efficient MSAA in deferred shading has been discussed in many threads. However, I am looking specifically for advice/experiences with the performance characteristics of low end hardware (ES 3.1 mobile and Intel chips).
The setup is simple. Geometry pass renders to MS textures. Depending on whether a sample covers all samples or not, a 0,0 or 1.0 is stored either in a 8bit component of a 32bit texture or in a separate 8bit texture.
The lighting pass then shades all non-MS pixels and outputs to the default framebuffer. MS pixels are discarded and marked in a stencil buffer. A second lighting pass shades the marked MS pixels (per sample).
Since I only have very limited access to non-desktop ES 3.1 hardware, I am looking for advice on two questions regarding the detection of MS pixels in the lighting pass:

  1. I have two options for how exactly MS pixels are detected:
    -Either I use blitting to copy the texture containing the sample coverage information into a non-MSAA texture (anything >0.0 would indicate an MS pixel). This would allow me to do just one texture fetch in the lighting pass.
    -Or I skip the blitting and fetch all 4 (or 8 or 16) samples in the lighting pass.
    This decision depends on how much faster (if at all) blitting is on mobile compared to 3 texture fetches

  2. I am unclear on the performance characteristics for 8bit vs 32bit textures on those platforms (especially for blitting. So I am debating whether to put the coverage information in a spare component of another fbo color attachment or in an attachment dedicated to this.
    Can I expect blitting from a 8bit texture to be faster from a 32bit texture on most mobile platforms?
    Are texture fetches from 8bit textures faster than fetching from a 32bit one (not the case on recent desktop hardware)?

Of course the optimal solution would be to build the stencil buffer while filling the geometry, but that is not possible in Opengl ES as far as I know.