[OpenCV] GPU CUDA Performance Comparison

Since I am a big fan of super-multi-threaded GPU computing (using NVIDIA’s CUDA), i made a comparison or benchmark of some imaging functions. The OpenCV platform by WillowGarage has optimized many image processing functions for the NVIDIA GPU’s, so I thought it would be nice to compare the performance of a CPU versus GPU for the same function. I did this for multiple GPU’s. I did not however use NVIDIA’s tesla, plainly because although the Tesla’s have a lot of RAM, they are priced at >>€ 1000 but are as slow as any € 200 GeForce.

Note i am still (2016) updating this thread! See the google sheet for more details:
https://docs.google.com/spreadsheets/d/1hNzS1-8wBZSx-SQ9ayrqKNmyCKmyzbzp5JkFjflPN4s/pub?output=html

Performance Table

For where i work, i had to test a lot of different GPU’s with come OpenCV GPU CUDA C++ functions, the GPU’s will end up in some rackservers. Anyhow, i tested some functions. You can press the figure to get a link to the performance table that is hosted with Google Docs.

Models tested

Computing Model Cuda V. Cores Frequency [MHz] Speedup (avg.)
Intel i2600K ~ 4 (1 used) 3400 1
Intel Xeon E5620 ~ 4 (1 used) 2400 0.68x
NVIDIA GTX 560 ASUS 2.1 336 810 22.4x
NVIDIA GTX 570 EVGA 2.1 480 810 31.9x
NVIDIA GTX 670 EVGA 3.0 1344 950 34.96x
NVIDIA GTX 680 EVGA 3.0 1536 1058 34.90x
NVIDIA GTX 770 EVGA 3.0 19.8x
NVIDIA GTX Titan X EVGA 3.0 83.6x

*I used Ubuntu 12.04, CUDA 4.2, Opencv 2.4 C++ (latest svn snapshot), NVIDIA 295.51 driver.

Functions tested

matchTemplate, minMaxLoc, remap, dft, cornerHarris, integral, norm, meanShift, BruteForceMatcher, magnitude, add, log, mulSpectrums, resize, cvtColor, erode, threshold, pow, projectPoints, solvePnPRansac, GaussianBlur, filter2D, pyrDown, pyrUp, equializeHist, reduce.

Link to the full spreadsheet (Google Docs Spreadsheet)

Graphs Per Function

Intel i2600 vs GPU's (click on image for large view)

GPU's vs GPU's (click on image for large view)

Test Setup

The (L) Asus GTX560-plain and (R) EVGA 570 SC

The EVGA GTX570 & GTX680 GPU's in the Dell R5500 racksever

Conclusion

In terms of value for money, the GTX 670 (€400) with 2Gb of RAM is very nice. There is absolutely no reason to buy the GTX 680 since it costs € 100 more. Then again, the GTX 570 costs €300, which is nice, but only has 1,25Gb RAM, which can be dangerous when working with large images (nasty errors).
It is clear that GPU computation is BLOODY fast. But i HAVE to note, that only a SINGLE core of the CPU’s were used for the normal CPU functions. These algo’s have not really been optimized for multithreaded if I’m not mistaken. On the other hand, speed increases of >20x is too much for any intel CPU to catch up with. GPU Computing is a must if fast image processing is important.

GPU + GPU = Multi GPU

Multi GPU? Yes! Using 2xGTX670’s, you can use 2688 CORES. That means that if you don’t keep your GPU’s on a leash it might become self aware. You have been warned.
Oh yes, MULTI GPU! OpenCV only natively supports 1 GPU per function, but ofcourse you can use more if you want. OpenCV themselves suggest Intel’s TBB (thread building blocks) for some reason. OpenCV once started with OpenMP (open source parallel/multithread processing), but do not support that any more. Luckily, If you know your way around OpenMP, it is quite easy to implement.
You can use more GPU’s in OpenCV, there are some functions wich you can use with it. I tend to use OpenMP, make a simple parallel loop with some conditions, and within the thread just use the “gpu::setDevice” C++ function to set which device to use within that thread. For example, when you have two GPU’s, it is a good idea to let OpenMP set “num_threads(2)”, so each GPU has got its own thread, and with the setDevice function, you just use ‘gpu::setDevice(omp_get_thread)’ for example. I got a speed increase of 40~80% using 2 GPU’s, see the nice setup i had in my desktop where i tried it. It will eventually end up in the rackserver, purely for GPU computation, for which they are ideal.

Testing two EVGA GTX670-SC GPU's. ***2688 CORES*** LOL

Code for Multi-GPU in OpenCV with OpenMP

  1. bool useMGPU=true;
  2. bool useMP=true;
  3. int numGPUs=gpu::getCudaEnabledDeviceCount();
  4. omp_set_nested(1); //Turn on nested MP (to use parallel loops in your loop)
  5. #pragma omp parallel if (useMP) num_threads(2)
  6. {
  7. #pragma omp for
  8. for (int i=0;i<10;i++){
  9. 	//If Multiple GPU support is on, assign based on threadnr
  10. 	int threadID = omp_get_thread_num();
  11. 	if (useMGPU && numGPUs>1){
  12. 		cout << "Setting GPU#" << threadID << " for i#" << i << endl;
  13. 		gpu::setDevice(threadID);
  14. 	}
  15. 	//Your GPU code here. The device has been set
  16.  
  17. 	//..
  18.  
  19. 	//Test to see if the GPU has been properly set throughout the loop (device should be == threadID)
  20. 	if (useMGPU){
  21. 		cout << " Had set GPU#" << gpu::getDevice() << " with tID#" << threadID << " (i#" << i << ")" << endl;
  22. 	}
  23. }
  24. }

Tim Zaman

MSc Biorobotics. Specialization in computer vision and deep learning. Works at NVIDIA.

You may also like...

9 Responses

  1. Tim Zaman says:

    i was just running the internal opencv benchmark that you can compile with the examples

  2. CudaWarped says:

    Great article. I realise that this article is quite old now but as you have updated the times including results for a GTX770 and Titan X I have a couple of questions if you have time?
    I am trying to get similar results to yours but I am not having much luck. For example my integral image on my i5 and i7 takes 55ms and 33ms respectively on a 4000×4000 8U image and I would expect both to be quicker than the CPU’s you tested on in 2012. Were you definitely using a 4000×4000 image for this test?
    Were you using OpenCV with the IPP’s installed or without?

    • Tim Zaman says:

      I am unsure. But in my vast OpenCV experience (seriously), IIP isnt that interesting anymore, but TBB is. Check if you build with tbb. TBB gives really great speedups. I am unsure whether i a have used those. Do not underestimate the 2600K though, it worked with fast memory and it’s as decent as any i5. It will not easily beat an i7 though.

      • CudaWarped says:

        Thanks for the quick reply, as you have a lot of experience with OpenCV I would like to pick your brain. My set up is an i7 6700HQ, DDR4, GTX 980M on windows, with OpenCV3.0 compiled with TBB and CUDA SDK 7.5.
        I think their may be something wrong with the calculation of the integral image using the CPU on my set up, i7 6700HQ, DDR4. I have built OpenCV3.0 with TBB but it does not make any difference to the calculation of the integral image, the calculation just uses a simple loop on a single core, but the CPU usage doesn’t ever exceed 30%. My execution time is 21.1ms, this is a lot slower than your Intel i2600k (3.4GHz), do you think that this is a problem with my configuration?
        My real interest is the CUDA comparison which I am having difficulty recreating on my set up, did you use a host or GPU timer?
        I have a 980m which should be in the same ball park as the GTX770, however when I calculate the integral image on the GPU the execution of the kernel’s takes ~2ms but the time before control is returned to the host because of memcopy and other overheads is ~30ms. Your times on the GTX770 are 10 times quicker than this, can you share your test code?

      • CudaWarped says:

        Sorry I forgot to mention that for certain sizes 4000×4000 being one the calculation of the integral image using CUDA is incorrect, did you validate the results?

      • CudaWarped says:

        Quick update, OpenCV3.0’s integral function has an optimised SSE version, if the integral image type is an integer, I am assuming that your output image type was 32S?
        My laptop i7 time is ~7.4ms when the integral image is an integer and ~21.1 when the integral image is a float.

  3. Konstantin Dols says:

    Nice comparison and great results!

    Do you know a source that explains and compares other processing steps that needed for image stitching, like the feature matcher, bundle adjustment and SIFT feature extraction?

    Actually my code takes at leas 25sec on a 6core CPU, and I would like to get a speedup using an nVidia 560GTX.

    Regards, Konstantin.

  4. Vartan says:

    Thanks for the nice article. Could you by any chance make the benchmark source code available. I’d like to run it on some older gpu’s I have.

  1. October 5, 2013

    […] a blog post at http://www.timzaman.com/?p=2256 , author compares various GPUs for few openCV functions. He concludes as, “ In terms of value for […]