A few weeks ago, I decided to implement my own convolution operations for the GPU. My motivation was the need for an implementation that could be easily modified. Unfortunately, most implementations available online are either slow or a big mess code-wise:
I played around with all three above and even tried to do my own vanilla CUDA implementation (Big mistake! The performance was ~8 times slower than cuda_convnet).
I then discovered the paper Fast Training of Convolutional Networks through FFTs, which was quite an interesting read (if only I had found it earlier!). FFT-based convolutions had crossed my mind before, but I suspected the filter sizes were too small for the convolutions to be worthwhile in Fourier domain. As it turns out, FFT-based convolutions are quite competitive; mainly for the following reasons:
I also discovered that Sander Dieleman had experimented with FFT convolutions for Theano. Unfortunately, his implementation does not currently include back propagation of gradients. Moreover, it is written in high-level Theano, which I suspect is not flexible enough for an efficient implementation.
After the above failed attempts at doing my own convolutions, the FFT approach was a refreshing angle. It took some time to figure out the functioning of the batched cuFFT operations with advanced data layout (which, btw., I'd prefer any day over fiddling with indexing errors in ordinary convolutions!).
I now have a working implementation with the following highlights:
The implementation is still WIP but looks promising in terms of speed. It even comes with a crude Theano wrapper. Benchmarks will follow as soon as the Theano integration is done. I have yet to figure out how to properly handle buffers and reusing FFTs in back propagation functions.