Post:LU Decomposition in C (and under CUDA)

LU Decomposition in C (and under CUDA)

As part of any major project, it occasionally happens that you assume something is a ‘solved problem’ when its really not.

In my case it was solving small linear systems, of the form Ax=B, where A is an nxn matrix, B is a n vector. This is a problem that’s been solved in libraries such as LAPACK, LINPACK, BLAS, etc etc.

The issue appears when you’re trying to do this stuff within a specific hardware environment (CUDA), and you cannot call host functions from the device, and the cuBLAS libraries cater only to large matrices processed in parallel

Anyway, without going into architectural specifics, say for whatever reason you needed a small dense matrix solver, including LU Decomposition with maximal (or complete) pivoting for numerical stability, here it comes.

Before I code dump, I have to give serious thanks to all the folks on the stackexchange network for their help in this, especially talonmies who seemed to pop up in every question I had!

Moving on; I’ve got two versions of this code; one is my cpu-bound experiments, and the second is my ‘proof of concept’ gpu code implementing the cpu bound algorithm. I know its not parallelised and I’m probably missing lots of optimisation tricks, but it works.

CPU (ALT)

GPU (ALT)

I had a serious amount of pain to understand this problem, so hopefully this will make someone elses life easier.

Today you, tomorrow me.

10 Comments

  1. Vo Hung

    Why u include gsl lib? i run code with no lib and it run same. i beginner so dont understand why lib is included. i no include gsl_matrix gsl_vector…. and the code run same for GPU

  2. Vo, you’re dead right; the gsl includes are hangovers from an experiment. Thanks for the heads up. If you come up with any improvements or optimisations I’ve be more than happy to include them here (and in my own projects! :P ) Thanks for reading.

  3. Vo Hung

    Ok. When I have time, i try to see.

  4. Wertos

    Seems like the link to your GPU implementation is broken unfortunately.

    • Apologies; that’s what you get for mixing pastebin systems. Updated with both paste.ubuntu.com and actually logged-in pastebin versions. Thanks for letting me know.

      YMMV, let me know if its broken, haven’t looked at the code in a few weeks.

  5. Daniel

    Hi,
    how is that in your gpu code you use fabs function,
    isn’t that a host call ?

    • Nope, fabs (and fabsf) have been in the device API since at least CUDA 2.0

      As in all things CUDA related, consult the guide

  6. frank

    hi, I think the last step to solve x=Qz should be x[q[i]]=xtmp[i];

    • AFAI Can remember, the point of this section is just a simple vector transposition; Q containing the mapping of x->x (in this case), so for the final matrix x, initial matrix xi, and transposition matrix Q, x[i] should contain the value of xi[where Q[i] points to ].

      May be wrong, but its been a long day already! And it gave correct results the last time I tested it.

  7. gamar

    Hi I really like what you did but have a question isn’t this code
    doing A*B = X
    I run your code after I set my virabile and got something not understand it.

    float a[]={4,-2,1,-3,-1,4,1,-1,3};
    const float b[]={0,1,2};

    Here is the result :
    0:x:A|B
    -1.000:4.0,-2.0,1.0,|0.000
    -1.000:-3.0,-1.0,4.0,|1.000
    -1.000:1.0,-1.0,3.0,|2.000

    TPB:1,BPG:1
    Ran solve
    time: 0.00015488 s

    0:x:A//from where this number coming??!
    0.500:4.0,1.0,-2.0,
    1.500:-0.8,4.8,-2.5,
    1.000:0.2,0.6,0.9,
    ….
    I guess I do understand this part but the rest !!no
    A =

    4 -2 1
    -3 -1 4
    1 -1 3

    B= [0 1 2]

    B =

    0 1 2

    X=A\B’

    X =

    0.5000
    1.5000
    1.0000

    Thanks for sharing

Leave a Reply

%d bloggers like this: