Part of the Khronos Group
OpenGL.org

The Industry's Foundation for High Performance Graphics

from games to virtual reality, mobile phones to supercomputers

Page 1 of 2 12 LastLast
Results 1 to 10 of 11

Thread: Optimize grid rendering for post T&L cache

  1. #1
    Junior Member Regular Contributor
    Join Date
    Dec 2010
    Location
    Oakville, ON, CA
    Posts
    107

    Optimize grid rendering for post T&L cache

    Hi!
    I am rendering a large terrain (which is stored as heightmap texture) using triangle strips for the full length of the map. The optimization I am going to make is the viewing frustum culling. So instead of drawing the map as a whole, I will draw it piece-by-piece, selecting only the parts trapped inside the viewing volume.

    Those pieces are going to be a standard rectangular grids of relatively small size. The question is what would be the most optimal dimensionality of those batches (9x9 points, 17x17 points...) and the best drawing algorithm to ensure that each vertex of the batch is processed only once with the help of post-transform cache? Are the optimizations for post T&L cache still relevant at all or am I just outdated?..

    As I understand, the best way is to draw the grid using the strips progressing in diagonal direction (so the first one would actually be a single triangle, the next wil have 3 triangles and so on). In that case the required capacity of post-T&L cache will be equal to the number of points in a grid, right? But how to determine the size of post T&L cache and what is it in general and what would it be in near future?
    Last edited by Yandersen; 08-06-2014 at 06:36 PM.

  2. #2
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,220
    Quote Originally Posted by Yandersen View Post
    ...viewing frustum culling. So instead of drawing the map as a whole, I will draw it piece-by-piece, selecting only the parts trapped inside the viewing volume. Those pieces are going to be a standard rectangular grids of relatively small size. The question is what would be the most optimal dimensionality of those batches (9x9 points, 17x17 points...)
    It depends on your hardware. You're trading culling cost (typically CPU) against large batch sizes (typically GPU, though batch dispatch overhead matters if the batch is too small). Best bet: test and see.

    Are the optimizations for post T&L cache still relevant at all or am I just outdated?..
    Very relevant.

    As I understand, the best way is to draw the grid using the strips progressing in diagonal direction (so the first one would actually be a single triangle, the next wil have 3 triangles and so on). In that case the required capacity of post-T&L cache will be equal to the number of points in a grid, right? But how to determine the size of post T&L cache and what is it in general and what would it be in near future?
    Read these:

    * This page is a humorous but informative introduction.
    * Tom Forsyth's Linear-Speed Vertex Cache Optimisation.
    * Ignacio Castaņo's Optimal Grid Rendering. More on that here and Forsyth's comments on that here.
    * Check out the ACMR (average cache miss ratio) stats on Ignacio's page. Also see his spiel on why ACMR might not be the absolute best metric to use (but is better than most), suggesting average transform to vertex ratio (ATVR) instead.
    Last edited by Dark Photon; 08-07-2014 at 06:05 PM.

  3. #3
    Junior Member Regular Contributor
    Join Date
    Dec 2009
    Posts
    212
    Quote Originally Posted by Yandersen View Post
    But how to determine the size of post T&L cache and what is it in general and what would it be in near future?
    This answer cannot be answered, because there is no simple FIFO post T&L cache in modern desktop GPUs. A FIFO cache is inherently serial, GPUs like to work in parallell, so you can't say the size of the cache is X.

    I think the common approach used today is to split the primitive stream into multiple "packets", so that the number of distinct vertices within every packet is less than or equal to the GPU's workgroup size, and then those packets are transformed independently. So you get full sharing of vertices within a packet, but no sharing between packets.

  4. #4
    Junior Member Regular Contributor
    Join Date
    Dec 2010
    Location
    Oakville, ON, CA
    Posts
    107
    Quote Originally Posted by Dark Photon View Post
    Read these:

    * This page is a humorous but informative introduction.
    * Tom Forsyth's Linear-Speed Vertex Cache Optimisation.
    * Ignacio Castaņo's Optimal Grid Rendering. More on that here and Forsyth's comments on that here.
    * Check out the ACMR (average cache miss ratio) stats on Ignacio's page. Also see his spiel on why ACMR might not be the absolute best metric to use (but is better than most), suggesting average transform to vertex ratio (ATVR) instead.
    Thanks, Photon, I've been to all of these Google-top search materials already. I was just wondering this info might be outdated, so posted the q checking if anything changed since the publication days.
    The Forsyth's general optimization techniques can't do better then the algorithm developed strictly for a particular well-defined situation (regular grid in my case), would you agree? And Castello's proposal of strips' direction reversing is not as good as diagonal drawing in case if the the post T&L cache can fit N vertices, where N is the width of the grid, vertex-wise. In other words, the width of the diagonally-drawn grid should be equal to the quantity of transformed vertices the cache can fit, so the only question left - what is that size and how to determine that.
    However, the question becomes blurry considering that:

    Quote Originally Posted by mbentrup
    This answer cannot be answered, because there is no simple FIFO post T&L cache in modern desktop GPUs. A FIFO cache is inherently serial, GPUs like to work in parallell, so you can't say the size of the cache is X.

    I think the common approach used today is to split the primitive stream into multiple "packets", so that the number of distinct vertices within every packet is less than or equal to the GPU's workgroup size, and then those packets are transformed independently. So you get full sharing of vertices within a packet, but no sharing between packets.
    Have to admit, wasn't that deep in the new functionality yet... Does it in some way means that the effective size of the cache may vary depending on the user's configuration of the rendering pipeline?..
    Last edited by Yandersen; 08-09-2014 at 01:53 PM.

  5. #5
    Senior Member OpenGL Guru Dark Photon's Avatar
    Join Date
    Oct 2004
    Location
    Druidia
    Posts
    3,220
    Quote Originally Posted by Yandersen View Post
    Thanks, Photon, I've been to all of these Google-top search materials already. I was just wondering this info might be outdated, so posted the q checking if anything changed since the publication days.
    Vertex cache optimization is still very relevant, but the main thing that's changed (AFAIK) is there is no longer dedicated memory for the post-T&L (post-vertex shader) vertex cache on the GPU. Circa 2007 when the GPU world converted to generic stream processors, dedicated shared memory was put on the cards to store/load intermediate computations including transformed vertices. The more varyings (interpolators) output by the vertex shader, the fewer vertices you could have in your post-vertex shader (transformed vertices) cache.

    However, even back-in-the-day, you had a whole 16KB of space per GPU multiprocessor (now it's up to 64KB), so space for storing transformed vertices is not really a problem. And even before GPUs flipped to generic stream processors, the fixed function GPUs at the time had a post-vertex shader cache size of 45 vertices. That's "plenty" big enough to get near optimal performance. With a good triangle indexer, the incremental benefit of increasing the cache size is very small.

    The Forsyth's general optimization techniques can't do better then the algorithm developed strictly for a particular well-defined situation (regular grid in my case), would you agree?
    It's been a while since I've implemented Forsyth's algorithm, but I think you're probably right because you're employing information his algorithm doesn't know about (and doesn't want to assume) to maximize reuse. However, the whole point is moot if your rendering is not vertex transform bound. I suspect even if you are, your performance won't be that far from what you can get applying his technique, but try both and see! It'll be fun to check this.

    And Castello's proposal of strips' direction reversing is not as good as diagonal drawing in case if the the post T&L cache can fit N vertices, where N is the width of the grid, vertex-wise.
    Do you really know what the size of the cache is? Or do you want to impose requirements on what at must at least be?

    If not, that's the benefit of a cache size oblivious method. It's not going to just fall on its face if you happen to run on the wrong GPU.

    Does it in some way means that the effective size of the cache may vary depending on the user's configuration of the rendering pipeline?..
    Yes. Related to what I was talking about about, it depends on which GPU you're running on (how much GPU shared memory per SM), potentially which driver you're running (i.e. how it choses to divvy up that shared memory), and how much varying data your vertex shader outputs (which you have complete control over).
    Last edited by Dark Photon; 08-09-2014 at 04:42 PM.

  6. #6
    Junior Member Regular Contributor
    Join Date
    Dec 2009
    Posts
    212
    Quote Originally Posted by Yandersen View Post
    Have to admit, wasn't that deep in the new functionality yet... Does it in some way means that the effective size of the cache may vary depending on the user's configuration of the rendering pipeline?..
    Most DX10+ GPUs have mutliple processing units, thus the draw batches are split into smaller batches and each is processed on a different unit. There is caching/sharing of vertices within a processing unit, but not between different processing units. So if you render a batch of 100 triangles and it gets split to 50/50, then triangle 50 would have all vertices of triangle 1-49 cached, but triangle 51 would have none of the vertices accessible in the cache.

    So you can no longer say that the last "x" vertices are cached. However, in general it is still useful to optimize the vertices, because you benefit from the caching within the split batches.

  7. #7
    Junior Member Regular Contributor
    Join Date
    Dec 2010
    Location
    Oakville, ON, CA
    Posts
    107
    Thank you, Dark Photon! Still, IMO the Forsyth's general algorithm is better then the Castello's approach. Probably will use it, because my diagonal drawing will work only if all segment's vertices are processed by a single unit, and if any vertices are sent to another one, it will break the chain.
    Quote Originally Posted by mbentrup View Post
    Most DX10+ GPUs have mutliple processing units, thus the draw batches are split into smaller batches and each is processed on a different unit. There is caching/sharing of vertices within a processing unit, but not between different processing units. So if you render a batch of 100 triangles and it gets split to 50/50, then triangle 50 would have all vertices of triangle 1-49 cached, but triangle 51 would have none of the vertices accessible in the cache.

    So you can no longer say that the last "x" vertices are cached. However, in general it is still useful to optimize the vertices, because you benefit from the caching within the split batches.
    I can see the possibility of splitting and sharing the triangle processing in case of the triangle array, but I don't understand how is it possible at all in case of the index buffer with primitive restart indices - splitting triangles in that case would require scanning and mapping the buffer. Do I need to move from strips with PRI to the indexed triangles and disable the PRI in order to get my triangles automatically shared between the processing units? Or I understand it wrong and that functionality must be configured manually in some way?
    Last edited by Yandersen; 08-10-2014 at 06:48 AM.

  8. #8
    Junior Member Regular Contributor
    Join Date
    Dec 2010
    Location
    Oakville, ON, CA
    Posts
    107
    I was going to implement Tom Forsyth's optimization, but found the problem to be too interesting for code's copy-pasting - want to "reinvent" that on my own.
    But before I start, I want to understand how the vertex cache works exactly.

    Say, I rendered a triangle with indices {1,2,3}, so they are all in the cache now.
    Q1: does the cache look like {3,2,1}?
    Then another triangle with vertices {4,2,3} comes in.
    Q2: how the cache look like now? Is it {4, 3,2,1} or {3,2,4, 1}?
    In other words, what happens with the reused vertices which reside in the cache already - are those being relocated to the top of the cache or they maintain their previous positions and only the new vertices are added to the top of the cache?

    So far I assume that the vertex cache buffer corresponds to simple FIFO model. If so, then the class emulating it should look like this:
    Code :
    //The class emulating the GPU's vertex cache
    class CVertexCache{
     
     unsigned int Size; //Quantity of indices the cache can fit
     int* Indices;      //FIFO buffer storing the vertex indices
     
    public:
     
     int VerticesProcessed;   //Stats: total amount of entered indices
     int VerticesTransformed; //Stats: amount of indices pushed into the buffer
     
    private:
     
     //Do not allow copying as there is a dynamic array inside
     void operator = (CVertexCache){};
     CVertexCache(const CVertexCache&){};
     
    public:
     
     //Destructor frees the dynamic array
     inline ~CVertexCache(){ free(Indices); memset(this,0,sizeof(CVertexCache)); }
     
     //Check the index-capacity of the cache
     inline int GetSize()const{ return Size; }
     //Set the desired cache-capacity; returned value is true unless malloc fails
     bool SetSize(unsigned int QuantityOfIndexSlots){
      this->~CVertexCache();
      if(!QuantityOfIndexSlots)return true;
      if((Indices=(int*)malloc(QuantityOfIndexSlots*4))==NULL)return false;
      Size = QuantityOfIndexSlots;
      for(unsigned i=0;i<Size;i++)Indices[i]=-1;
      return true;
     }//SetSize---------------------------------------------------------------------
     
     //Default constructor
     inline CVertexCache(){ memset(this,0,sizeof(CVertexCache)); }
     //Constructor that allows to specify the cache size at initialization
     inline CVertexCache(unsigned int QuantityOfIndexSlots){
      memset(this,0,sizeof(CVertexCache));
      SetSize(QuantityOfIndexSlots);
     }//Constructor-----------------------------------------------------------------
     
     //Empty the cache buffer and reset the statistics
     void Clear(){
      VerticesProcessed=0; VerticesTransformed=0;
      for(unsigned i=0;i<Size;i++)Indices[i]=-1;
     }//Clear-----------------------------------------------------------------------
     
     //Return the index value stored in the specified slot; -1 returned for
     //out-of-bounds requests or if the specified slot is empty
     inline int operator[](unsigned int SlotID)const{
      if(SlotID<0||SlotID>=Size)return -1;
      else return Indices[SlotID];
     }//Operator[]------------------------------------------------------------------
     
     //Return the number in range [1.0...0.0] representing the relative position of
     //the given vertex in the cache; the value 1.0 corresponds to the first
     //position, the value (1.f/Position) is for the last one and 0.0 returned if
     //the given index is not in the cache
     double GetPosition(int Index)const{
      for(unsigned i=0;i<Size;i++)if(Indices[i]==Index)
       return double(Size-i)/double(Size);
      return 0.0;
     }//GetPosition-----------------------------------------------------------------
     
     //Introduce a new index; if it is not in the cache already, it will be pushed
     //into the FIFO buffer shifting all indices by one position; if the index is
     //the restart index, it is silently ignored
     void InputVertex(int Index, int PrimitiveRestartIndexValue=0xffffffff){
      if(Index==PrimitiveRestartIndexValue)return;
      VerticesProcessed++;
      if(!Size)return;
      for(unsigned i=0;i<Size;i++)if(Indices[i]==Index)return;
      VerticesTransformed++;
      for(unsigned i=Size-1;i>0;i--)Indices[i]=Indices[i-1];
      Indices[0]=Index;
     }//InputVertex-----------------------------------------------------------------
     
     //Process all the indices of the given buffer; refer to the stats afterwards;
     //the returned value indicates how many times in average each different vertex
     //was transformed
     double Process(int* IndexBuffer, unsigned int IndexCount,
                    int PrimitiveRestartIndexValue=0xffffffff){
      Clear();
      if(!IndexBuffer||!IndexCount)return 0.0;
      unsigned VertexCount=0;
      for(unsigned i=0;i<IndexCount;i++){
       int Index = IndexBuffer[i];
       if(Index==PrimitiveRestartIndexValue)continue;
       InputVertex(Index,PrimitiveRestartIndexValue);
       unsigned s = i;
       for(unsigned v=i+1;v<IndexCount;v++)if(IndexBuffer[v]==Index)s=v;
       if(s==i)VertexCount++;
      }
      if(!VertexCount)return 0.0;
      return double(VerticesTransformed)/double(VertexCount);
     }//Process---------------------------------------------------------------------
     
    };//CVertexCache________________________________________________________________

    As I understand the post TnL-targeted mesh optimization, the goal is to have each vertex transformed no more than once. Therefore, my class CVertexCache has a function Process() which emulates the GPU's behavior dealing with the given index buffer and returns the ratio of the transformation operations to the total quantity of vertices the indices of the buffer have referenced to. According to my checking of a regular grid with row-progressing triangles, the average transformation ratio is close to 2 per vertex even with the cache size as small as 4. That means that ideal optimizing of such mesh will result in at most 2 times faster rendering. It does not seem to be a really big benefit, IMHO...
    Last edited by Yandersen; 08-14-2014 at 06:58 PM.

  9. #9
    Junior Member Regular Contributor
    Join Date
    Dec 2010
    Location
    Oakville, ON, CA
    Posts
    107
    Oh, it seems like I understood the grid-rendering problem enough to come to the conclusion. The most optimal grid rendering could be achieved only if the quantity of points in the grid is not higher than the quantity of vertices the post-transformation cache may hold. If so, then the drawing must be performed in diagonal direction, starting from one of the corners, progressing to the opposite corner of the mesh. As the size of the cache is typically very small (~24, right?), then the solution is to divide the grid onto small batches and render them individually.

    The goal is to have each vertex transformed only once. For each of the grid batch this is true for all points except for the border ones: each edge is shared between two neighbouring batches, so those vertices processed twice in total; each of the 4 corner points is shared between 4 batches, so processed 4 times. Let's calculate how many transformations are made in total and divide it by the quantity of vertices in a batch - this will tell us how many transformation operations are done for each individual vertex (in average). In best theoretical case (if the size of TnL cache is unlimited), this value is 1 (each vertex processed only once). Let's see how fast the number of redundant transformations grows along with the reduction of the cache capacity:

    GridWidthMax = CacheSize;
    TotalVertices = GridWidthMax*GridWidthMax = CacheSize*CacheSize;
    VerticesTransformed = { 1 * (GridWidthMax-2)^2 } + { 2 * (GridWidthMax-2) * 4 } + { 4 * 4 } = (CacheSize+2)^2
    TransformationsPerVertexAvg = VerticesTransformed / TotalVertices = { (CacheSize+2)^2 } / { CacheSize^2 } = (1+2/CacheSize) ^ 2

    Now let's apply that formula to some cache sizes:

    Code :
    Cache size: | TransformationsPerVertex:
    4           |  2.25
    8           |  1.56
    12          |  1.36
    16          |  1.27
    20          |  1.21
    24          |  1.17
    28          |  1.15
    32          |  1.13
    36          |  1.11
    40          |  1.10
    1024        |  1.004

    Looking at the table I see almost no use for caches larger than 16 vertices. Taken this is a bit lower than the average card has, the answer for the topic, IMO, should be: divide the grid onto smaller batches 16x16 points and render them continuously in diagonal fashion.

    I think, the specific grid-optimizing algorithm could take cachesize==16 as a predefined constant and go well with it.
    Last edited by Yandersen; 08-14-2014 at 09:04 PM.

  10. #10
    Senior Member OpenGL Pro Aleksandar's Avatar
    Join Date
    Jul 2009
    Posts
    1,158
    I'm sorry if I'm rude, but I think you are wasting your time.

    The post-transform cache size vary from GPU to GPU, from generation to generation. There is no purpose to make something that works well only on your machine (or maybe there is?). Aside the fact that a 16 vertices batch is ridiculously small.

    All algorithms used in last decade are cache oblivious. They are usually based on the assumption of (modified) LRU policy. Try to find Mark Kilgard's "Modern OpenGL Usage: Using Vertex Buffer Objects Well" from 2008. It is also an old document, but it will serve the purpose to understand basic concepts.

    Forsyth's algorithm is quite old. GPU architecture changed a lot since those days. Many years ago I tried to implement several algorithms for optimizing vertex post-transform cache, and while they have some impact on G80 GPUs, there was no visible improvements on GF100 (Fermi). What I'm using since then is a regular strip-like drawing with primitive restart at the end of each strip and it serves me well.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •