So how does the perspective projection matrix works now in GL4?

I clarify my question:

in the old fixed-function pipeline we used to pass the perspective projection matrix by setting GL_PROJECTION. Now I also understand that by multiplying a point in camera space by this matrix, we would end up having a point with homogeneous coordinates defined in clip space. The clipping would occur, and finally, the points would be transformed back from homogeneous to cartesian coordinates by divided the x y and z coordinates respectively by the point’s w coordinates. Let me know if I don’t get that right, but that seems roughly correct.

This seemed possible to me in this old pipeline because the multiplication of the vertex coordinates by the matrix was taken care of by the GPU. So in essence the GPU could multiply the point by the matrix, then do the clipping, then do the perspective divide.

But how does that work in the new pipeline, now that the vertex transform is done in the shader. I believe I understand the principle but just want someone to confirm:

  • so the point in camera space is “technical” in cartesian space. However after you have multiplied it by the proj matrix, it ends up being a vec4 point … in other words a point in homogeneous coordinates, which is in clip space. The point is then taken care of by the GPU which does the clipping, and then finally convert the vertex coordinates back to Cartesian by doing the division by w. Is that correct?

And final question since I have this opportunity to reach experts.

  • the “canonical viewing volume” is sometimes referred to as the unit cube. Two questions regarding this. Is the “canonical viewing volume” referring to the “cube” when points are in clip space (what is the dimension of that cube by the way if it is one?), or to the cube after the perspective divide (the cube has dimension (-1,-1,-1) (1,1,1))?

  • isn’t it wrong to call this the unit cube? (a unit cube as length 1 not 2?).

  • NDC coordinates generally refer to coordinates in the range [0,1]. Isn’t the terminology NDC misused in the GPU world when it refers to the space in which points coordinates are in the range [-1,1]?

  • Finally could someone confirms in a short answer, WHY clipping occurs in “clip space” rather than after the perspective divide (or in other words in what we call NDC space in the GPU world?). Aren’t the planes of the volume still defining a cube in NDC space? Is it only for arithmetic reasons. Someone told me that in clip space coordinates are defined as integers and that it simplified the computation of the Sutherland algorithm for clipping. It would great if someone could shed some light on this.

Thank you so much.

The value written to gl_Position by the vertex shader is in clip space. Using compatibility variables, the vertex shader can do


    gl_Position = gl_ProjectionMatrix * gl_ModelViewMatrix * gl_Vertex;

to get the same behaviour as the fixed-function pipeline.

The clip volume clips -w <= x <= w (similarly for y and z).

If you perform projective division (division by w) on the vertices resulting from clipping, their positions will all lie within the unit cube, i.e. -1 <= x/w <= 1 (similarly for y and z).

I don’t think so. E.g. I’ve only ever seen the term “unit sphere” used to refer to a sphere with unit radius (x^2+y^2+z^2=1), not unit diameter.

OpenGL’s clip space is -1…+1, and coordinates are “normalised” is with respect to that.

Clipping is performed in homogeneous coordinates so that interpolated attributes are calculated correctly.

In clip space, the mapping between position and attribute value is affine, as all of the transformations up to that point are linear. To perform clipping, you calculate an interpolant t such that the interpolated position lies on the boundary of the clip volume, then linearly interpolate all of the attributes (including the position) with that value.

Once positions are converted from clip coordinates to NDC, you have to perform the same conversion for all attributes (e.g. texture coordinates will have a fourth component even if they didn’t originally).

  • so the point in camera space is “technical” in cartesian space. However after you have multiplied it by the proj matrix, it ends up being a vec4 point … in other words a point in homogeneous coordinates, which is in clip space. The point is then taken care of by the GPU which does the clipping, and then finally convert the vertex coordinates back to Cartesian by doing the division by w. Is that correct?

More or less, yes. The only thing that a vertex shader is required to do (and it’s not even required if there are later vertex processing shader stages after the VS) is output a clip-space position.

the “canonical viewing volume” is sometimes referred to as the unit cube.

I’d be curious to see where anyone actually calls it a “unit cube”. And yes, those people are wrong to call it “unit”.

Is the “canonical viewing volume” referring to the “cube” when points are in clip space (what is the dimension of that cube by the way if it is one?), or to the cube after the perspective divide (the cube has dimension (-1,-1,-1) (1,1,1))?

Either one. That’s the thing about a transformation: it doesn’t actually change anything about a scene. It just moves things around. The perspective divide may be a non-linear transformation, but it’s still just transforming vectors from one space to another.

So while the clip-space volume can’t be called a “cube” (it’s a 4D coordinate system, so it’s hard to call it any kind of 3D object), it’s still the same volume. Just with a different shape.

NDC coordinates generally refer to coordinates in the range [0,1].

Not in OpenGL. NDC space is always [-1, 1] in X, Y, and Z.

  • Finally could someone confirms in a short answer, WHY clipping occurs in “clip space” rather than after the perspective divide (or in other words in what we call NDC space in the GPU world?).

Well, that’s simple. By fiat, a camera-space Z of 0 represents the plane containing the camera. And in the traditional perspective projection matrix, the clip-space W is the negation of the camera-space Z.

So what do you suppose happens to the math when you divide by a clip-space W component that happens to be 0 (ie: at the camera)?

Nothing good.

One of the big reasons clipping exists is to avoid this problem entirely. The near-clip plane, when using the traditional perspective projection matrix, is always some finite distance in front of the camera. Therefore, all vertices that get past clipping must have a clip-space W that is positive.

Homogeneous spaces can handle a vertex position which is “at infinity,” which is mathematically what a 0 clip-space W represents. Cartesian spaces math cannot. So, before converting to NDC (which is Cartesian), the vertices are clipped, so that all remaining vertices are in front of the camera.

[QUOTE=GClements;1263606]Clipping is performed in homogeneous coordinates so that interpolated attributes are calculated correctly.

In clip space, the mapping between position and attribute value is affine, as all of the transformations up to that point are linear. To perform clipping, you calculate an interpolant t such that the interpolated position lies on the boundary of the clip volume, then linearly interpolate all of the attributes (including the position) with that value.

Once positions are converted from clip coordinates to NDC, you have to perform the same conversion for all attributes (e.g. texture coordinates will have a fourth component even if they didn’t originally).[/QUOTE]

In order to implement perspective-correct interpolation, then you have to be able to interpolate effectively in pre-projection space, yet be working in a post-projection space. And if you can do that, you could interpolate clipped vertex attributes in the same way.

So I don’t see how interpolation requires pre-projection clipping.

It doesn’t require it (ultimately, you must get the same result however you formulate it), it just makes it conceptually simpler.

Something which I find many people often aren’t clear on is that projection doesn’t simply remove the “w” component from the position (if it did, you’d get screen-space interpolation rather than perspective-correct interpolation). Rather, it moves it from the numerator of the equations defining the position to the denominator of everything else, changing the system of equations from a linear mapping to a rational mapping.

Whichever representation you use, clipping doesn’t change the mapping, only the region over which it is applied.

By describing the operation of clipping prior to projection, it leaves the relative complexity of projection out of it. If you’re familiar with finding the intersection of a line segment with a plane, understanding clipping in clip space should be straightforward.

Thanks for your very good and complete answer.

I have a few more points through that I’d like to clarify:

I understand the concept of projection matrix and how it is built. But what I don’t understand is what is the value of w? My understanding is that when you multiply a point by a projection matrix, you need to assume that the point has homogeneous coordinates (x,y,z,w=1). Then after the multiplication my understanding is that persp projection matrix would set the value of w to z (or -z more exactly) that when the the points is converted back to homogeneous coordinates to Cartesian coordinates, we would do this by dividing the coordinates of the points by w, and by setting w to -z, then we perform the perspective divide in the process. Is this correct. Then if that’s the case, w is different from every vertex!!???

In which case, what does writing -w <= x <= w mean?

Dos it means, the vertex is visible only if its x coordinates is greater or lower than -w and w respectively which as I just mentioned, is equal to -z? I am sort of lost there. I am sure I am missing something basic but I don’t know what!

[QUOTE=GClements;1263606]Clipping is performed in homogeneous coordinates so that interpolated attributes are calculated correctly.

In clip space, the mapping between position and attribute value is affine, as all of the transformations up to that point are linear. To perform clipping, you calculate an interpolant t such that the interpolated position lies on the boundary of the clip volume, then linearly interpolate all of the attributes (including the position) with that value.

Once positions are converted from clip coordinates to NDC, you have to perform the same conversion for all attributes (e.g. texture coordinates will have a fourth component even if they didn’t originally).[/QUOTE]

I don’t understand this either. Aren’t translation, scaling and rotation affine transformation in 3D space?

I guess I am also confuse by what clipping really does. In my head, it means that if a triangle was overlapping the boundaries of the viewing frustum, it would actually be divided into two bits of geometry, the bit that is visible and the bit that is not (which can be discarded). In fact, from what I understand it seems like there is some “safety” band around the frustum and that if a triangle partially lies on the frustum and outside of it then it is kept as is, and if it is not totally outside the frustum and the safe band then it’s discarded directly.

Anyway I think I am really totally confused about the whole GPU rendering pipeline between the moment the geometry is processed by the vertex shader to the moment the fragments are processed by the fragment shader. I found lots of information on the Web but they are really confusing.

  1. point in eye space (vec3): veye
  2. multiply point in eye space by persp proj matrix. At this point it assumes the point is vec4(veye,1) and thus that w = 1.
  3. vclip = veye * Mpersp;

That multiplication is done in the shader, but then vertices are passed to the GPU in clip space - The GPU owns them and this point and process them. If I understand the persp proj matrix properly, then vclip.w = -z.

  1. The GPU performs clipping. What does this do exactly? I understand that it only keeps the vertex if
    -w <= x <= w && -w <= y <= w && -w <= z <= w, and that at least 1 one of the points making up a triangle passes this test. But then if the triangle overlaps the clip volume what does it do?

  2. after clipping, the vertices coordinates are divided by w (perspective divide).

At this stage, vertices are contained in the canonical viewing volume". But do we still have vertices outside this volume at all (those which are connected to vertices which are in the volume)?

  1. triangles are rasterised i.e. converted into fragments which are then shaded.

Sorry, but I would REALLY appreciate if someone could explain me clearly what happens between 3 and 6. Thanks a lot again.

[QUOTE=Alfonse Reinheart;1263607]Well, that’s simple. By fiat, a camera-space Z of 0 represents the plane containing the camera. And in the traditional perspective projection matrix, the clip-space W is the negation of the camera-space Z.

So what do you suppose happens to the math when you divide by a clip-space W component that happens to be 0 (ie: at the camera)?

Nothing good.

One of the big reasons clipping exists is to avoid this problem entirely. The near-clip plane, when using the traditional perspective projection matrix, is always some finite distance in front of the camera. Therefore, all vertices that get past clipping must have a clip-space W that is positive.

Homogeneous spaces can handle a vertex position which is “at infinity,” which is mathematically what a 0 clip-space W represents. Cartesian spaces math cannot. So, before converting to NDC (which is Cartesian), the vertices are clipped, so that all remaining vertices are in front of the camera.[/QUOTE]

That seems interesting but I don’t understand, sorry ;-((( What do you mean by “by fiat”? Is it a typo?;-). Now you say camera space of 0 represents a plane containing the camera? This is not clear to me?! I thought points where projected onto the image plane, which in the GPU world is located at the near clipping plane. I understand that this is the case (the image plane is at the near clip plane) because of the way the persp matrix is built. But regardless, the image plane is not at 0? it’s some distance away from the eye?

When you “in the traditional persp proj matrix the clip space W is the negation of the camera-space Z”, do actually refer to the fact that the persp proj matrix is built in such a way that it sets vclip.w to -w? So that later, when points are converted back from homogeneous coordinates to cartesian coordinates, the division of the points’ coordinates (x,y,z) actually performs the perspective divide?

Homogeneous spaces can handle a vertex position which is “at infinity,” which is mathematically what a 0 clip-space W represents. Cartesian spaces math cannot. So, before converting to NDC (which is Cartesian), the vertices are clipped, so that all remaining vertices are in front of the camera.

That seems interesting in fact. Could you give an example. For example you are talking about a point which would project to the eye? right? Wouldn’t that only happen when the image plane onto which we project the vertex on, is located at the eye position, in other words, when ZnearClipPlane = 0? So to say things different, if you can guarantee that the near clipping plane or the image plane is ALWAYS some distance from the eye (centre of projection), I’d think there would never be a situation in which a point is projected to infinity? Am I wrong?

Thanks for you help and being patient, I REALLY WANT TO UNDERSTAND THIS.

OK, this has moved into a general conversation of spaces and how the perspective projection actually works (rather than just how it gets done in shaders). For that, I really can’t explain things any better than this. Because I wrote it. So if you read through chapters 4 and 5, I would hope you have an understanding of how it works.

That’s correct. Conversion from homogeneous coordinates maps [x,y,z,w] to [x/w,y/w,z/w], so a projection matrix which sets w=-z effectively implements a perspective transformation.

It means that the primitive (line, triangle) is clipped to the half-spaces x <= w and x >= -w (i.e. against the planes x=w and x=-w).

Clipping modifies the entire primitive (which is conceptually an infinite set of points), so that it consists only of points which lie inside the clip volume. In practical terms, this means that if all of the primitive’s vertices lie inside the clip volume, the primitive is unchanged by clipping. Vertices which lie outside of the clip volume will be replaced with vertices which lie on the boundary of the clip volume (each such vertex may be replaced with multiple vertices; if two edges share a common vertex, and that vertex lies outside of the clip volume, the replacement vertex will be different for each edge).

Only for a perspective transformation. But clipping doesn’t care about how the clip-space vertices were generated (perspective, orthographic, something else).

They are affine in 3D. But OpenGL works in 4D homogeneous coordinates for everything up to and including the clipping stage. For operations other than perspective projection, the matrices have [0,0,0,1] as the bottom row; for operations other than translation and perspective projection, the matrices have [0,0,0,1] as the right-hand column. When both of these apply, the w coordinate neither affects nor is affected by the transformation, so it can be viewed as a linear 3D transformation (multiplying a 3x3 matrix by a 3x1 vector). Including translation, the operation can be viewed as an affine 3D transformation (i.e. a linear transformation plus addition of a 3x1 vector).

But the calculations are done using 4x4 matrices and 4x1 vectors.

There is no “safety band”. Clipping modifies the primitive so that the clipped primitive lies entirely within the clip volume.

None of this is specific to shaders.

It is not required that w=1. Given a point p=[x,y,z,w], you can divide by its w component to get p’=[x/w,y/w,z/w,1] then transform, or transform then divide the transformed point by its w component. The result is the same either way.

OpenGL’s convention has the matrix as the left operand of the multiplication and a column vector as the right operand (DirectX has a row vector on the left and the matrix is transposed so that any translation is in the bottom row).

If the eye vector’s w component is 1, that’s correct. But more generally, vclip.w = -veye.w * veye.z.

It’s simpler to consider the case of clipping a line segment.

A line segment is an infinite set of points defined by p=p0*(1-t)+p1*t, where t is the interval [0,1] (i.e. 0<=t<=1) and p0 and p1 are its endpoints. For the purposes of OpenGL’s clipping operation, p, p0 and p1 are 4D homogeneous coordinates in clip space.

When clipping against a half space (i.e. the portion of the space on one side of a given plane; e.g. the plane x=w divides the space into half-spaces x<w and x>w), if both endpoints are within the half-space, the entire line segment is within the half-space, so clipping leaves it unmodified. If both endpoints are outside the half-space, the entire line segment is outside the half-space, and clipping will result in it being discarded.

If one endpoint is inside and one endpoint is outside, then clipping will produce a modified line segment by replacing the endpoint which lies outside the half-space by a new endpoint which lies at the intersection of the line segment and the clip plane. This is done by substitution the above equation for the line into the equation defining the plane to obtain an equation of the form a*t+b=0, solving for t, then substituting that into the equation for the line to obtain the point of intersection.

But as each vertex typically has attributes other than the position, it is necessary to also calculate the correct attribute values for the new vertex. This is done in exactly the same manner as for the position. E.g. if the vertices have texture coordinates, each point on the line has texture coordinates defined similarly to the position, e.g. q=q0*(1-t)+q1*t. The values for the vertex generated by clipping can be obtained by substituting the same value of t that was used for calculating the position.

A triangle (or other convex polygon) can be clipped similarly by clipping each edge as if it was a line segment then (if necessary) joining the vertices generated by clipping with new edges to close the polygon.

Correct, although the w must be retained because it affects the mapping between the position and the other attribute values.

If you were to simply take a primitive’s vertex positions after perspective projection and divide by w, while leaving all of the attribute values unchanged, rendering the resulting primitive would result in attributes such as texture coordinates being interpolated in screen space rather than in object space (i.e. it wouldn’t be “perspective correct”).

This is why clipping is normally described as occurring in homogeneous clip space (against -w<=x<=w, -w<=y<=w, -w<=z<=w), rather than in NDC (against -1<=x’<=1, -1<=y’<=1, -1<=z<=1,where [x’,y’,z’] = [x/w,y/w,z/w]).

In effect, you’re not actually clipping against a cube, but against a hyper-pyramid whose projection to NDC is a cube (its base is a 4D cube with vertices (±1,±1,±1,1) and its apex is (0,0,0,0)).

If you’re not entirely comfortable with homogeneous coordinates, it may help to consider equivalent cases in 2D, where a point [x,y] becomes the line [wx,wy,w], which intersects the w=1 plane at [x,y,1]. you can visualise a 3D [x,y,w] space more easily than trying to visualise a 4D [x,y,z,w] space. Points become lines, edges become planes, clipping against a square becomes clipping against a pyramid.

No. Those have been discarded and (in some cases) replaced by new vertices lying on the boundary of the clip volume.

Don’t be desperate Alfonse. People DO READ;-). I read your document carefully, and they brought me most answers I was looking for. This was very useful and explained things well.

I regretted that you introduced the concept of clipping space and the coordinates W without really explaining what it is clearly first,(being actually set to -z) while being different for each vertex (so you don’t explain why W is different for each vertex (assuming they have a different w coordinate), when you introduce the concept of clip space, which is introduced before the concept of perspective projection).

But yet, if you take the time to read the whole set of pages, they it unfolds bits by bits. So very good indeed overall.

Thank you again.

Thank you for taking the time to make such a detailed answer. …

I am assuming the perspective projection is the most common/default one :wink:

[QUOTE=GClements;1263620]They are affine in 3D. But OpenGL works in 4D homogeneous coordinates for everything up to and including the clipping stage. For operations other than perspective projection, the matrices have [0,0,0,1] as the bottom row; for operations other than translation and perspective projection, the matrices have [0,0,0,1] as the right-hand column. When both of these apply, the w coordinate neither affects nor is affected by the transformation, so it can be viewed as a linear 3D transformation (multiplying a 3x3 matrix by a 3x1 vector). Including translation, the operation can be viewed as an affine 3D transformation (i.e. a linear transformation plus addition of a 3x1 vector).

But the calculations are done using 4x4 matrices and 4x1 vectors.[/QUOTE]

yes certainly. I do get that, however I found often that when explained that way to people, it’s rather confusing. It’s indeed important to make people aware that when they multiply a vec3 by a mat4, what they actually do is “implicitly” using a vec4 which w coordinate is equal to 1, etc. But anyway…

Sure, I am just talking in the context of the perspective projection here.

Good point, it’s more generic indeed. Thx for noting this.

Excellent, thank you.

[QUOTE=GClements;1263620]Correct, although the w must be retained because it affects the mapping between the position and the other attribute values.

If you were to simply take a primitive’s vertex positions after perspective projection and divide by w, while leaving all of the attribute values unchanged, rendering the resulting primitive would result in attributes such as texture coordinates being interpolated in screen space rather than in object space (i.e. it wouldn’t be “perspective correct”).[/QUOTE]

That’s the bit I am still unclear about. Okay for doing the “clipping” in clip space. Though while you explain HOW it is done, let me try to clarify the “WHY it is done in clip space”.

REASON 1) Now from what Alfonse Reinheart writes in his tutorial, he suggests that ONE of the main reasons is to avoid possible division by 0 in the perspective divide. By clipping “BEFORE” the perspective divide is done, you can avoid this case. However someone suggested to me that clipping altos use something close to the Sutherland clipping algorithm in which the author showed that the MATHS for the clipping where easier in clip space than other space (or did he JUST show how to do it clip space?). Not sure if someone could confirm that. Apparently the maths on the GPU are also done not a floating points numbers but on integers (not sure of this is right either?).

REASON 2) But to come back to your point, while clipping can be done in clip space, I do understand that you need to “interpolate” the other vertex data (if your triangle were clipped). But you suggest that in fact this interpolation needs to be done while the geometry is in SOME SORT OF OBJECT space, because once a triangle is projected, the relative “sizes” of the triangle edges with respect to their size in object space, may have changed? thus, interpolating vertex data within that space (NDC) would lead to incorrect results? Is that correct?

Very good suggestion, however :wink: not sure why the cube becomes a pyramid. Could you please explain how you get to the pyramid shape?

Thanks again SO much.

[QUOTE=mast4as;1263670]
I regretted that you introduced the concept of clipping space and the coordinates W without really explaining what it is clearly first,(being actually set to -z) while being different for each vertex[/quote]

Actually, there’s a very important meta point I was making when I did it that way. Namely, that I don’t want you to think of the clip-space W as “being actually set to -z”. First, it’s simply inaccurate. It’s only -cameraZ if you’re doing perspective projection in the traditional style (ie: camera-space at the origin). If you’re doing perspective projection a different way, or are doing orthographic projection, it is something very different.

The more important point I was something that the entire book should hopefully teach. Namely, that features are tools, and you should think of them as tools first and foremost. If you are taught to think of, for example, the clip-space W as always being -cameraZ, then it limits how you might find clever uses for the clip-space W.

So I talk about the basic feature first, then explain the most common use of it.

If you were to perform the division by w for the vertex position while leaving all of the other attributes unchanged, interpolation would be in screen space rather than in object space. Similarly, the clipping would generate attributes which are correct for screen-space interpolation rather than object-space interpolation.

I’m not sure that it’s really correct to say that clipping is performed “in” clip space or “in” NDC. The final result must be the same either way.

Rather, the equations are more straightforward if they are derived from the mapping between clip-space positions and other attribute values, rather than from the mapping between NDC positions and other attribute values. And, as Alfonse points out, you don’t have to worry about division by zero.

Any point in 3D space becomes a line through the origin in homogeneous coordinates. Specifically, the point [x,y,z] becomes the line [wx,wy,w*z,w] (for all w), which passes through the points [x,y,z,1] and [0,0,0,0]. So almost any 2D or 3D shape becomes some form of hyper-pyramid in homogeneous coordinates, with the original shape forming the base and the origin the apex.

[QUOTE=Alfonse Reinheart;1263672]

The more important point I was something that the entire book should hopefully teach. Namely, that features are tools, and you should think of them as tools first and foremost. If you are taught to think of, for example, the clip-space W as always being -cameraZ, then it limits how you might find clever uses for the clip-space W.

So I talk about the basic feature first, then explain the most common use of it.[/QUOTE]

Okay thanks for clarifying this point, though I would argue the other way around from a pedagogical point of view is better. I prefer when someone tell me “here is how it works within this particular case” and then try to move to the more generic case. It seems to me that if you at least understand perfectly well how something workd within a particular context, it is then easier to maybe see how it applies to different case and then derive the generic form.

Generic forms are always more abstracts than practical or actual cases, thus harder to teach and less interesting for the students who is more into getting a result ;-).

Just my 2 cents though. I still very much appreciate your effort.

[QUOTE=GClements;1263677]If you were to perform the division by w for the vertex position while leaving all of the other attributes unchanged, interpolation would be in screen space rather than in object space. Similarly, the clipping would generate attributes which are correct for screen-space interpolation rather than object-space interpolation.

I’m not sure that it’s really correct to say that clipping is performed “in” clip space or “in” NDC. The final result must be the same either way.

Rather, the equations are more straightforward if they are derived from the mapping between clip-space positions and other attribute values, rather than from the mapping between NDC positions and other attribute values. And, as Alfonse points out, you don’t have to worry about division by zero.[/QUOTE]

Sure though the clipping could also happen in eye-space rather than clip space. The equation of the planes making up the frustum are easy to find. The equations for intersecting lines with planes are also easy. Thus we could just use this approach to actually clip triangles if needed (and easily interpolate the triangles vertex attributes).

Thus one of my questions still stands: why doing the clipping in clip/homogeneous space rather than cartesian space (assuming points coordinates are defined in eye space).

  1. is this because as Alfonse said, it avoids possible division by 0 - but this is a false argument because in fact you don’t want to interpolate in screen space for the reasons you just explained anyway. So in fact we NEVER want to clip geometry after the perspective divide (unless of course you write a renderer that performs shading in screen space and screen space only)

  2. does it somehow simplify calculations?

  3. is that just because vertices are defined with homogeneous coordinates after the perspective projection matrix (at least their w coordinates is likely to be different than 1), that you do perform clipping while the vertices are in that state?

  4. … ?

So I am reading this book “Game Engine Architecture” from J. Gregory and p. 436 it says:

In clip space, the canonical view volume is a rectangular prism extending from –1 to +1 along the x- and y-axes. Along the z-axis, the view volume ex- tends either from –1 to +1 (OpenGL) or from 0 to 1 (DirectX). We call this coordinate system “clip space” because the view volume planes are axis-aligned, making it convenient to clip triangles to the view volume in this space (even when a perspective projection is being used).

This seems plain wrong to me or am I missing something again (please HELP ;-))).

so, it says that the clip space IS the canonical viewing volume and that this is cube whose min and max extents are (-1,-1,-1), (1,1,1)?! So from our discussion, it was clear that in 4D it can hardly be described first of all, and it’s certainly not bound to be in the range [-1,1] in any axis at this stage.

It’s only once the points have been divided by w and then remapped to [-1,1] that they are contained in this cube. And still then what we call a cube to my opinion is misleading. All we do really is remapped the x-y coordinates of the projected point onto the image plane from [-1,1] and remap the point z-coordinate to [0,1] (directX) or [-1,1] in OpenGL. So it’s more like a 2D point + a depth value, but okay this can be interpreted as a point in a cube if we want.

So am I right to say the definition above is wrong, or am I missing something again?!

  1. is this because as Alfonse said, it avoids possible division by 0 - but this is a false argument because in fact you don’t want to interpolate in screen space for the reasons you just explained anyway. So in fact we NEVER want to clip geometry after the perspective divide (unless of course you write a renderer that performs shading in screen space and screen space only)

Of course, the fact that we do interpolate in screen space makes that irrelevant. The whole point of perspective-correct interpolation is to be able to interpolate within screen space as if you were interpolating in a pre-projection space. And this is something that graphics cards have been able to do since the Voodoo 1 days (rasterization happens in screen-space, and that’s where interpolation happens, so clearly it must be doing this). So there’s no reason to expect that the interpolation in the clipper would be incapable of being perspective-correct.

And thus, interpolation cannot be a reason why clipping needs to be done before the perspective divide.

It should also be noted that the OpenGL specification is a specification of behavior, not of implementation. Therefore, it follows the “as if” rule. Namely, the OpenGL spec states that clipping happens before the perspective divide. But if an implementation could implement clipping afterwards, and still make it appear as if it happened before, then that would be a perfectly conforming implementation.

After all, that’s the whole point of guard-band clipping. The implementation doesn’t actually clip such triangles at all. But since it has hardware to make sure you can’t tell the difference… it’s still conforming OpenGL behavior.

So your question doesn’t really matter.

Except that the vertex shader generates positions in clip space, and the implementation itself doesn’t know what the eye-space positions were (if such a thing even exists; if you don’t need eye-space positions for lighting calculations, you can use a combined model-view-projection matrix to convert directly from object space to clip space).

Object space and eye space are also homogeneous. Nothing requires the w component of the object-space vertex position to be 1. Nothing actually requires the model-view matrix to be affine (although fixed-function lighting won’t work correctly if this is isn’t the case), meaning that even if object-space w is 1, eye-space w need not be.

Vertex positions are homogeneous in object space, eye space and clip space. But in clip space, the clipping planes are fixed. If you performed clipping in object space or eye space, you’d have to transform the clip-space clipping planes by the inverse of the projection or model-view-projection matrix (which assumes that those are actually invertible). And it would mean clipping against arbitrary planes rather than the more convenient case where all of the coefficients are 0, 1 or -1.

Also, it wouldn’t work with shaders. Or rather, it would require the vertex shader to calculate the interpolants itself (i.e. you would need the equivalent of gl_ClipDistance even for the predefined clip planes). And the shader got it wrong, the end result is anyone’s guess.

In NDC, it’s a unit cube. In clip space, it’s the homogeneous representation of that cube.

After clipping and division by w (in either order), they are contained within that cube.

Whether it’s actually “wrong” would depend upon the context. Homogeneous coordinates aren’t 4D coordinates, they’re a 4D representation of 3D coordinates. So it’s not exactly “wrong” to talk about the clipped homogeneous coordinates as lying within a cube, as the 3D coordinates which they represent do actually lie within a cube.

Thank very much GClements ;=). I think I got all your points.

Because I come from a CPU/RT bg and haven’t done any GPU for a very very long while, this aspect of the rendering pipeline has always been a complete mystery to me. I think I have made some good progress if your explanations. The idea that “triangles” are projected onto the screen and that thus, vertex data need to be interpolated in a given way is something I haven’t thought about before, because, with algorithm such as Ray Tracing you just always deal with attributes in object space. Which is why I had a hard time to understand this “properly interpolated issue” in the first place.

As for the clip space, thingy your last answer is good. I guess my difficulty is that I am a visual person and like to visualise what things are in 3D space, and that concept of 4D coordinates is something abstract to a point where I can’t visualise it properly. Hence the struggle.

Last couple of questions hopefully:

  • what algorithms are used to perform clipping using 4D homogeneous coordinates?
  • I spend a fair amount of time on the OpenGL wiebsite. I found the reference card, and the core profile specs, etc. but was wondering (you spoke about the specs yourself), if there was any document that was detailing what’s actually expected in terms of behaviour (what is the output supposed to be). I couldn’t find that! I understand the implementation is left to the constructor/driver developer however I wonder if this exists?
  • Also is there any GPU constructor that would have documented what algorithms they are actually implementing for doing things such as clipping, etc.?

Thanks again, you have been incredibly patient and your answers were very valuable.