I've read various tutorials on "3D picking" around the web, but most of them seem to be either self contradictory, wrong, or gloss over points without explanation. I've searched the forum here and only found references to posts that then link to other poor quality explanations off-site.

Essentially, I'm trying to implement ray picking. That is, I'm transforming a point in OS-specific window coordinates, to GL world coordinates and then using that to cast a ray into the world in order to determine which objects were under the cursor when the user clicked the mouse. I already have working spatial data structures to do ray casting, so the problem is solely in actually doing the correct transformations to get from window-space to world-space.

I'm using the songho.ca page (www.songho.ca/opengl/gl_transform.html) on matrix transformations as a reference.

According to every text I can find on the subject, I need to do the following:

  1. Transform window coordinates to viewport coordinates (by flipping the Y value and removing the viewport translation, for example)
  2. Transform the viewport coordinates to clip-space coordinates
  3. Multiply the clip-space coordinates by the inverse of the (projection * modelview) matrix

The first and third steps are no problem.

The problem is in the second step. In order to get from clip-space to normalized device coordinates, the clip-space coordinates are divided by their own w component. If all I have are 2D viewport coordinates, where do the z and w coordinates come from?