As part of any major project, it occasionally happens that you assume something is a ‘solved problem’ when its really not.
The issue appears when you’re trying to do this stuff within a specific hardware environment (CUDA), and you cannot call host functions from the device, and the cuBLAS libraries cater only to large matrices processed in parallel
Anyway, without going into architectural specifics, say for whatever reason you needed a small dense matrix solver, including LU Decomposition with maximal (or complete) pivoting for numerical stability, here it comes.
Before I code dump, I have to give serious thanks to all the folks on the stackexchange network for their help in this, especially talonmies who seemed to pop up in every question I had!
Moving on; I’ve got two versions of this code; one is my cpu-bound experiments, and the second is my ‘proof of concept’ gpu code implementing the cpu bound algorithm. I know its not parallelised and I’m probably missing lots of optimisation tricks, but it works.
I had a serious amount of pain to understand this problem, so hopefully this will make someone elses life easier.
Today you, tomorrow me.