This is just an idea; I haven’t tried it so I don’t know if it will work or not, but I think you might be able to make it do what you want.
Background information: there are several different ways to calculate perspective, but the commonly used perspective projection is the “divide-by-Z” technique. In projection matrix terms, that is implemented by copying the pre-projected Z coordinate of each vertex to the W coordinate position in the point’s vertex vector. Later, in the OpenGL fixed function portion of the pipeline (both programmable and fixed function pipelines), the Perspective Divide is performed which divides the X,Y,Z coordinates of each vertex by that vertex’s W coordinate.
For orthographic projection, rather than copy the Z coordinate to the W coordinate’s position, the W coordinate is left alone (which means it should have a value of one). Later, during the Perspective Divide, dividing X, Y, Z by W=1 leaves them unchanged.
So, to synchronize orthographic projection size with Z perspective projection size, you need to duplicate the change of scale implicit with the Z divide in the orthographic projection.
The issue with Z perspective divide is that each vertex’s coordinates are divided by their own Z coordinate, and since each vertex may have a different Z coordinate, you get different scaling of the vertex’s XYZ depending on the Z. To get orthographic projection, you have to use the same divisor for every vertex. That means, you need to store the same value in the W component of every vertex, so that when the Perspective Divide occurs, you get orthographic projection. Ideally, you’d like that constant W component to be the average Z coordinate of all the vertices in your model.
Here’s how you should be able to implement it: on your CPU, calculate the average Z of all the vertices in your model. Store that one value as the W component in each vertex in your model, and pass those XYZW coordinates to the GPU. Your Modelview transformation matrix needs to do exactly the same transformations for the W component as it does for the Z component. That means the fourth row of your Modelview matrix should be identical to the third row (define the third row the way you ordinarily would, and just copy it to the fourth row, too).
Now, when you transform your model’s vertices with the Modelview matrix, each W coordinate will be the averaged value of all your model’s Z coordinates, and will undergo the same transformations as your model’s vertices’ Z coordinates. Your projection matrix should be the same as it would be for Z divide perspective, except the fourth row should be [0 0 0 1] rather than [0 0 1 0] (so each vertex keeps its own W coordinate rather than being replaced with its Z coordinate as is usually done). Now, when your vertices are processed by the Perspective Divide hardware, each vertex in the model will be divided by the same averaged Z coordinate, giving you orthographic projection but with the same average scale factor that your model would have gotten with a Z divide perspective projection, synchronizing orthographic scale to perspective scale.
I don’t know if I made it sound complicated, but I think it should really be very easy to do, with just a few modifications to the way you currently do it.