-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of GPUs #474
Comments
Jump in with a quick note: the Achilles heal of GPU compute (excluding new platforms like the Apple M1 with really fast integrated graphics) is the time to transfer to/from the GPU. Thus, it may be hard to beat the CPU when dealing with single, relatively small patterns (1k x1k probably is near the smallest one might consider). In PyEBSDIndex this is mitigated by inherently performing many calculations on a large batch of patterns. However, it may well be worth it - some simple tests would help. Yes, take a look at my kernels. Be aware that I interleave the batches of patterns, thus each pattern looks like it has N channels, and is WxH in 2D size (vs being WxH with N slices in a volume). In my case this significantly reduced my global memory fetches within the GPU. Also look at the gputools package. https://github.com/maweigert/gputools they might have a lot of what you want. They inspired a lot of my initial efforts. Final note: the GPU compute landscape is a mess. Yes, OpenCL is currently the most cross-platform framework, but Apple has said that OpenCL is officially depreciated (but not yet removed from latest OS). CUDA is NVIDIA only. Apple and NVIDIA still hate each other. Windows and OpenCL can be done, but not as easy as others ... I think a lot of commercial software that is cross platform will rewrite for the multiple frameworks/platforms. A lot of the machine learning community says NVIDIA/CUDA or nothing. You might want to take a look at Vulkan/MoltenVK. It is definitely more geared towards rendering rather than compute, but that also might serve your needs well. I have not fully comprehended the interplay between OpenCL and Vulkan, but I think there is something there, and thus why I hold out hope that OpenCL can be a long time solution. |
Thank you for this valuable input, @drowenhorst-nrl.
This could be easily adopted in kikuchipy I think, since we use Dask to spread the workload on all available CPUs. Dask does this by operating on chunks of the full pattern array. A chunk is typically 100 MB in size, and the signal (detector) axes are never chunked. Thus it would seem like sending chunks on to the GPU would make sense.
You're describing what you're doing in the following, right? I think I understand what you're doing, that you're "allocating" a (16 or more patterns, n detector rows, n detector columns) 32-bit floating point array on the GPU. Passed to the CL kernel Our Dask chunks are usually always 4D, so in our case
Looks like a good reference, and perhaps something we could depend on for some functionality. |
We should try to take advantage of GPUs by writing some GPU kernels. I don't have an NVIDIA GPU available, so my choice would be to use PyOpenCL instead of CuPy.
@drowenhorst-nrl have written some kernels in PyEBSDIndex that we could take inspiration from, e.g. the static background subtraction, used in one of the Radon transform functions. Such a kernel could be an alternative to our background subtraction.
In general, I think more per pattern operations in the
kikuchipy.pattern
module could be replaced by PyOpenCL kernels, like image rescaling. We have CPU acceleration from Numba here, but it would be good to test GPU acceleration as well.Other resources:
The text was updated successfully, but these errors were encountered: