"gl-streaming" - Lightweight OpenGL ES command streaming framework

Hi,
I wrote an OpenGL ES command streaming framework for embedded systems - “gl-streaming”.

It is intended to make it possible to execute OpenGL programs on a embedded system which has no GPU.

It is a server-client execution model of OpenGL like the function of GLX, but it is completely independent of X server and GLX, so it runs on a embedded system which doesn’t support X and GLX.

The client executes a OpenGL program, but does not execute OpenGL commands.
It simply sends the OpenGL commands to the server over network, so the client system does not need to have a GPU.

The server recieves OpenGL commands, executes them, and displays graphics to the monitor connected to the server system.

gl-streaming is…

“fast”
It runs at 60 frames per second.

“simple”
The source tarball size is below 30KB !

“lightweight”
The gl_server consumes only 2MB RAM!

“low latency”
Its performance is suitable for gaming.

source code & demo

If you are interested in this project, please post a comment.

Thank you.

Looks extremely interesting!

This look effectively fine :slight_smile:

What about the possibility to transfert the graphic output from the server side to the client side via something like a glse_SwapBuffers() call at the end of each frame ?
(for to can display back each GPU computed picture on the server side to the screen on the client that haven’t a GPU)

And about to regroup set_server_adress/set_server_port and set_client_adress/set_client_port ?


void set_server_address_port(server_context_t *c, char * addr, uint16_t port)

{

  strncpy(c->server_thread_arg.addr, addr, sizeof(c->server_thread_arg.addr));

  c->server_thread_arg.port = port;

}


void set_client_address_port(server_context_t *c, char * addr,  uint16_t port)

{

  strncpy(c->popper_thread_arg.addr, addr, sizeof(c->popper_thread_arg.addr));

  c->popper_thread_arg.port = port;

}


About glse_SwapBuffers(), I have see this into gl_client/main.c


gls_cmd_get_context();

  gc.screen_width = glsc_global.screen_width;

  gc.screen_height = glsc_global.screen_height;

  printf("width:%d height:%d
",glsc_global.screen_width,glsc_global.screen_height);

  init_gl(&gc);

=> we can perhaps add something like some minimalistics GLUT-like calls for to can set a screen_width and screen_height that are adapted to the size of the client screen for the recopy of each GPU picture (that are computed from the server side) to the client side ?

Thanks!
The gl_client program can be run on an ordinary PC, so please try it if you have an Raspberry Pi. :slight_smile:

[QUOTE=The Little Body;1255697]This look effectively fine :slight_smile:

What about the possibility to transfert the graphic output from the server side to the client side via something like a glse_SwapBuffers() call at the end of each frame ?
(for to can display back each GPU computed picture on the server side to the screen on the client that haven’t a GPU)

And about to regroup set_server_adress/set_server_port and set_client_adress/set_client_port ?


void set_server_address_port(server_context_t *c, char * addr, uint16_t port)

{

  strncpy(c->server_thread_arg.addr, addr, sizeof(c->server_thread_arg.addr));

  c->server_thread_arg.port = port;

}


void set_client_address_port(server_context_t *c, char * addr,  uint16_t port)

{

  strncpy(c->popper_thread_arg.addr, addr, sizeof(c->popper_thread_arg.addr));

  c->popper_thread_arg.port = port;

}


[/QUOTE]

Thanks!
It’s good to simplify like that. I’ll improve.

It’s possible to get the rendered image by gl_client, like glGenBuffers(glclient.c) does.
But, bandwidth may be a challenge. :slight_smile:
I’ll try to implement glReadPixels and check the performance.

For to minimize the size of data needed for to transmit the rendered image on the network, we can:

1) compress the rendered image on the server side
2) transmit this compressed image on the network
3) decompress this compressed image on the client side 

For example, each image can to be compressed on a JPEG format if the client can handle this directly in hardware
(wavelet/filtering + huffman compression can to be employed instead if the client haven’t the necessary hardware for to handle JPEG pictures)

Or to use a MPEG, MJPEG or anothers motion picture compression schemes if the client side have the necessary hardware for to handle one of them
(on the PSP platform, we can use the MJPEG hardware decompression engine for example)

[QUOTE=The Little Body;1255707]For to minimize the size of data needed for to transmit the rendered image on the network, we can:

1) compress the rendered image on the server side
2) transmit this compressed image on the network
3) decompress this compressed image on the client side 

For example, each image can to be compressed on a JPEG format if the client can handle this directly in hardware
(wavelet/filtering + huffman compression can to be employed instead if the client haven’t the necessary hardware for to handle JPEG pictures)

Or to use a MPEG, MJPEG or anothers motion picture compression schemes if the client side have the necessary hardware for to handle one of them
(on the PSP platform, we can use the MJPEG hardware decompression engine for example)[/QUOTE]

Raspberry Pi has hardware accelerated h264 decoder and encoder, so these may possibly be useful to reduce transfer bandwidth.
But, h264 encoder usually causes long latency, and reading data from GPU memory is usually very slow.
Furthermore, h264 decoding is very heavy task for non-accelerated clients.
So, MJPEG may be rather efficient in some cases.

JPEG compression is lossy, and something like gl-streaming can’t automatically know whether this is acceptable. You’d need a glHint() (or a custom equivalent) to allow the application to control whether lossy compression can be used. Clearly, it shouldn’t be used for depth or stencil buffers, or integer colour buffers. In general, JPEG compression is a poor fit for anything with hard edges (e.g. wireframe).

MJPEG is just a container for multiple distinct JPEG frames. It doesn’t offer any additional compression over JPEG itself.
MPEG offers significant compression, but is also lossy, and requires that the frames are samples of a single animated image (also, the relative timing must be known for motion estimation to be useful). Consecutive calls to glReadPixels() don’t have to use the same source rectangle, nor can you be sure that they even refer to the same “scene” (the client is free to use a framebuffer as a “scratch” area for e.g. generating texture data). IOW, consecutive calls to glReadPixels() don’t necessarily constitute a single video stream.

We can limit the compression to only make the Huffman part for to have a lossless compression

And I think that a RGB[A] to YCbCr[A] color space conversion, instead to directly compress/decompress on a RGB[A] color space, can a little improve the compression ratio of the Huffman part

Note that with a relatively smooth quantization step (that only loose one or two bits on the standard 8 bits depth for each color component for example), the Huffman compression can to be very more efficient
(this make a little loose, but I don’t think that this can be really distinguishable)

You seem to be assuming that the pixel data only needs to be “visually” correct, but that isn’t always the case. It’s entirely possible that the data will be subject to additional processing which can magnify any errors. If you’re going to add lossy compression (even if it’s only “slightly” lossy), it needs to be optional.

I see too in gl_client/glclient.c a lot of deported glFuncs with systematicaly a very similar first line of code :


gls_glBindBuffer_t *c = (gls_glBindBuffer_t *)(glsc_global.tmp_buf.buf + glsc_global.tmp_buf.ptr);
gls_glBlendFuncSeparate_t *c = (gls_glBlendFuncSeparate_t *)(glsc_global.tmp_buf.buf + glsc_global.tmp_buf.ptr);
...
gls_glDrawElements_t *c = (gls_glDrawElements_t *)(glsc_global.tmp_buf.buf + glsc_global.tmp_buf.ptr);
gls_glBindAttribLocation_t *c = (gls_glBindAttribLocation_t *)(glsc_global.tmp_buf.buf + glsc_global.tmp_buf.ptr);

=> I think that this can to be replaced with a more generic #define pattern like this :


#define GLS_POINTER(PTR, FUNC) gls_FUNC_t *PTR = (gls_FUNC_t *)(glsc_global.tmp_buf.buf + glsc_global.tmp_buf.ptr)

That you can use for the first line for a lot of yours deported gl* functions :


GLS_POINTER(c, glBindBuffer);
GLS_POINTER(c, glBlendFuncSeparate);
...
GLS_POINTER(c, glDrawElements);
GLS_POINTER(c, glBindAttribLocation);

The same thing about push_batch_command() calls that can use this #define for example


#define GLS_PUSH(FUNC) push_batch_command(sizeof(gls_FUNC_t))

and push_batch_command() calls to be generated by this #define


GLS_PUSH(glBindBuffer);
...
GLS_PUSH(glUniform1f);
...
GLS_PUSH(command);

This make a very more compact source code for the deported glBindBuffer() call by example :


GL_APICALL void GL_APIENTRY glBindBuffer (GLenum target, GLuint buffer)

{

  GLS_POINTER(c, glBindBuffer);

  c->cmd = GLSC_glBindBuffer;

  c->target = target;

  c->buffer = buffer;

  GLS_PUSH(glBindBuffer);

}


By default, we can not to make any compression at all for the transmission of the picture and add one or more optional(s) argument(s) at the command line for to can handle the type of the compression between the client and the server for example

@shodruk,

This can work from a Linux box to another Linux box for that I make some tests ?
(I does not have a Raspberry Pi :frowning: )

The GLS_POINTER / GLS_PUSH scheme can to be a little more ehanced for to make a more compact source code :
(but any gain about the size or speed of the binary executable code because this only use #define)


#define GLS_PTR_FUNC(PTR, FUNC) gls_FUNC_t *PTR = (gls_FUNC_t *)(glsc_global.tmp_buf.buf + glsc_global.tmp_buf.ptr); \
PTR->cmd = GLSC_FUNC;

#define GLS_PUSH_FUNC(FUNC) push_batch_command(sizeof(gls_FUNC_t))


GL_APICALL void GL_APIENTRY glBindBuffer (GLenum target, GLuint buffer)
{
  GLS_POINTER_FUNC(c, glBindBuffer);
 
  c->target = target; 
  c->buffer = buffer;
 
  GLS_PUSH_FUNC(glBindBuffer);
}

That is an interesting indication.
I hit on an idea to use this library not only rendering but also GPGPU.
Fortunately, it is possible to run gl_server on many host, theoretically.
This enables to build GPGPU super computer using numbers of Raspberry Pi ! :surprise:

The Little Body,

Thanks!
I’m ashamed that I didn’t use that macro.
Maybe I love copy&paste. :smiley:

[QUOTE=The Little Body;1255732]This can work from a Linux box to another Linux box for that I make some tests ?
(I does not have a Raspberry Pi :frowning: )[/QUOTE]

gl_client runs on normal PC, but gl_server don’t, because this library is hard-binded to RPi’s header source and using OpenGL ES 2.0 for now.
I’ll make the PC & OpenGL version later.

[QUOTE=shodruk;1255735]gl_client runs on normal PC, but gl_server don’t, because this library is hard-binded to RPi’s header source and using OpenGL ES 2.0 for now.
I’ll make the PC & OpenGL version later.[/QUOTE]

I have download the gl-streaming-master.zip file
=> I begin to see how gl_client/glclient.c and gl_client/glclient.h parts can to be ehanced for to be handle by anothers plateforms than the RPI platerform

Note that the GLS_POINTER_FUNC / GLS_PUSH_FUNC scheme seem relatively similar that what is used into GLX protocol after I have see this on http://utah-glx.sourceforge.net/docs/render_buffer_interface.txt :slight_smile:


__glx_Vertex3f( GLfloat x,  GLfloat y,  GLfloat z) 
{    
    char* buffer=NULL;    
    __GLX_GET_RENDER_BUFFER(buffer, 70, 16, 0); 
    __GLX_PUT_float(buffer, x);    
    __GLX_PUT_float(buffer, y);    
    __GLX_PUT_float(buffer, z); 
}

void __glx_DrawPixels(GLsizei width, GLsizei height,
                  GLenum format, GLenum type, const GLvoid*pixels)
{
    char* buffer = NULL;
    int s=GLX_image_size(width, height, format, type);
    __GLX_GET_RENDER_BUFFER(buffer, 173, 40, s);
    __GLX_PUT_PIXEL_DATA_ARGUMENTS;
    __GLX_PUT_sizei(buffer, width);
    __GLX_PUT_sizei(buffer, height);
    __GLX_PUT_enum(buffer, format);
    __GLX_PUT_enum(buffer, type);
    __GLX_PUT_buffer(buffer, pixels, width, height, format, type, s);
}

But the additions of more clients plateforms supports is perhaps not a good idea because of the big increase of the source code that this implies :frowning:
(where the client part is where OpenGL’s commands are sended and the server part where they are executed)

The use of gettimeofday() and get_diff_time() does’t seem to have the best resolution/speed on alls plateforms :frowning:

For the wait_for_data() call into gl_client/glclient.c, we can perhaps use clock_gettime() instead ?
(as indicated on the Linux Clock portion at http://tdistler.com/2010/06/27/high-performance-timing-on-linux-windows)

And/or use a clock_nanosleep() instead usleep() ?
(clock_nanosleep)

Honestly, this is the first time I see their code. :slight_smile:
I used structs, they used macros, but there are some similarity between the codes, because the objective is similar.
Their codes are very useful, but I love reinventing the wheel. :smiley:

Thanks. I’ll consider using nanosleep().