a 16 vertices batch is ridiculously small
No doubt it is. Sry, I wasn't clear about this - I didn't mean that those batches have to be rendered using individual draw calls. I mean that the rendering path have to be clustered into such areas.
Using the strips along the full length of the grid will require to process each vertex twice, while the proposed 16x16 clustering will reduce the transformations count up to ~37%.

Hm, there is more to it. Actually, when rendering grid in diagonal direction, the batches could be as long as you want - 16xN, not necessarily 16x16. In such case, only the top and bottom line of the batch' vertices have to be processed twice. In such case the total transformations to vertex counts ratio is:

TpV = 1.0 + 2/CacheSize

Code :
CS:  | TpV:
4    | 1.50
8    | 1.25
12   | 1.17
16   | 1.13
20   | 1.10
24   | 1.08
28   | 1.07
32   | 1.06
36   | 1.06
40   | 1.05
1024 | 1.002

In other words, my proposal is to draw the grid not strip-by-strip for the full length of the grid, but draw 16 lines at once, diagonally. This will reduce the vertex transformations by 44% (the theoretical maximum is 50%, in case one goes from two TpV to one). And this is achieved even on hardware with vertex cache size as small as 16 vertices.

The benefit of having the cache size 32 will be reduction of transformations for 47%, which is a negligible improvement comparing to the 44% achieved with CS==16.