Let me give an example I experienced on some hardware. Running a standard code that uses a very basic functionality to render textured quads, resulted in wrong colors and shades, even textures were not showing. Sometimes black screen background, when cube mapping is used. Then how would a conformance test be able to detect such bugs? What you saying is a conformance test that checks the interface, not the implementation details that can be as low as setting or feeding the wrong values to hardware registers, which is not detectable by just being compliance with the specification.
Maybe two-stage test will do it, where the first is the conformance or API verification. The second stage is to check the result of a test-case program with a pre-rendered image. This is not a per-fragment check. With approximation algorithms a good test package can verify that both images are close so that the driver is not producing weird colors from Mars