however older Nvidia hardware (GeForce 3 and to some degree GeForce 4) decompress the DXT1 textures in worse quality than the DXT3/5 ones.
At this point, that’s hardware so old that it doesn’t bear worrying about. Certainly not to the point of doubling my compressed texture sizes to deal with it.
when I use “glCopyTexSubImage2D()” to copy a buffer,the cpu usage is rised observably,
Why would you possibly expect otherwise?
First, any copy whether the source or the target are uncompressed and the other one is compressed (or compressed with different formats) will need to either compress or decompress the image data. That requires the CPU. There’s no getting around it, because no IHV is going to bother extracting the DXT decompression logic from the texture unit and making it available elsewhere.
Second, even if the formats are both the same, DXT is a block-based format. You can’t simply do a server-side memory copy of the data unless both the source and destination rectangles are 4-pixel aligned.
Third, quite simply, nobody copies compressed textures. It’s an unexercised piece of code, and therefore unoptimized. Most people simply have no need to copy bits of compressed textures around.
I need realtime texture compression like decompression,so is it possible?
That sentence makes no sense; compression is not like decompression.
And you are getting realtime texture compression. Drivers will compress your textures as fast as you could expect them to. And since no IHV is going to bother to put any form of texture compression into their hardware, you’re getting software decompression.
It’s the best you can reasonably expect.