This extension would provide the absolute fastest vertex transfer performance, assuming it is done in hardware. Basically, what the user does is allocate some memory (probably uncached so the card can DMA it). In that memory the user creates, essentially, a display list program. Then, when the user wants to render the scene, he simply calls something like glRunDisplayListEXT on that pointer. The hardware then executes the display list.
The key difference between this display list and the standard OpenGL display list (besides the fact that the user generates it himself) is that execution returns to the user immediately after the call to glRunDisplayListEXT. Essentially, this allows all the vertex transfer functions to be performed by the hardware. That means that the CPU will never have to wait on the hardware, and it will have plenty of time for itself.
The key to all of this is for this extension to provide a whole suite of helper functions. The display list language will have to be able to do numerous things.
I would imagine that this extension would need something like a matrix array extension, that allows all of the ARB_BLEND matrices to be loaded directly from an array in DMA-able memory. The CPU would generate these matrices by hand and store them in a list for the Display List EXT to use.
Also, I would imagine that changing textures would be something that the DL needs to be able to do. Fortunately, existing texture object functionality makes this work just fine; all that is needed is a DL instruction to change the texture.
This extension could mesh with vertex_program_NV by allows entire vertex programs to be in an array much like the matrix array mentioned above. Of course, it would be up to nVidia (or ATI for their variant of vertex programs) to apply this Display List extension to vertex programs.
The really big thing is to have some equivalent of vertex_array_range_NV. Not only will vertex arrays have to be in DMA-able memory, but so will vertex array indices.
The only way for this display list to work is for the list to be run completely off of hardware. That means everything, including the DL, must be in DMA-able memory. Also, some limited form of fence_NV should be avaliable so that the user can querry when the display list rendering is finished. Because AGP/video memory is a precious resource, the user may want to load new data into vertex arrays and index arrays and render the scene in multiple stages.