Saturday 21 May 2011

CUDA, random numbers inside kernels

Evolutive algorithms have an intrinsic stochastic nature, therefore they make large use of random numbers generators, for instance the C/C++ rand() function. Anyway, when GPGPU comes into play, using random numbers could be tricky.

The naive solution of create and pour them into GPU's global memory is a bad idea, because of the huge bandwidth that would be wasted. Therefore, it takes a device-bound generator. Nvidia's CURAND library makes it easy to generate random numbers directly inside CUDA kernels.

A bit of theory...

Obviously, it is impossible to generate really random numbers on a deterministic machine. Any random function just applies some kind of transformation on another number, determining a succession that looks like a random one. Anyway, two successions starting from the same number (the seed) will be completely identical. If our kernels start from the same seed, regardless the algorithm, they will produce the same results. They must be different. Moreover, you have to store the previous numbers (let's call them global states) in order to produce the next ones during the following execution of each kernel. Fortunately, CUDA allows the storage and update of our partials directly inside GPU's memory.

...and some sample code
I post the example code - comments will follow.

__global__ void setup_kernel ( curandState * state, unsigned long seed )
{
    int id = threadIdx.x;
    curand_init ( seed, id, 0, &state[id] );
} 

__global__ void generate( curandState* globalState ) 
{
    int ind = threadIdx.x;
    curandState localState = globalState[ind];
    float RANDOM = curand_uniform( &localState );
    globalState[ind] = localState; 
}

int main( int argc, char** argv) 
{
    dim3 tpb(N,1,1);
    curandState* devStates;
    cudaMalloc ( &devStates, N*sizeof( curandState ) );
    
    // setup seeds
    setup_kernel <<< 1, tpb >>> ( devStates, time(NULL) );

    // generate random numbers
    generate <<< 1, tpb >>> ( devStates );

    return 0;
}


Obviously, you have to include not only CUDA's includes and libraries, but CURAND kernel's as well (curand_kernel.h). Let's see what happens in the code, starting from main.

First, we create a curandState pointer, that will point to our global states.

Kernel setup_kernel will invoke curand_init(), that takes some seeds (I used the seconds since the Epoch, but it's free to the user) and sets the global states.

Generate() kernel creates N threads. Each thread will have its own random floating point number, by using a local copy of its global state. In this case, the random number it's sampled from (0,1) with a uniform probability, but CURAND gives the possibility to sample from a standard normal distribution as well.

Finally, we store the new seed into the global memory, and return.

32 comments:

uhooi said...
This comment has been removed by a blog administrator.
Anonymous said...

Hi - I am certainly glad to find this. cool job!

Eisenhiem said...

I need to generate random INTEGERS using curand. so how can we do that? also, what if I need to generate random integers within a specific range? are their any sample codes for that? appreciate Your help

Unknown said...

I think that you have two possible ways to do it:

1) use curand_uniform to obtain a random floating point number from a uniform distribution, then map it to your integer interval. Pseudocdode:

float rnd_number = curand_uniform();
int rnd_integer_from_A_to_B = A + rnd_number * (B-A);

I use something similar for choosing reactions inside my biochemical simulations.

2) calling curand() should return "the next quasirandom 32-bit element"; I never tried it, I think you can do something like this:

int rnd_integer_from_A_to_B = A + curand() % (B-A);

let me know :)

Eisenhiem said...

Hey ... Thanks for the quick reply. I have tried both. And i think I would go ahead with the second option.
Thanks again.

Pedro Rezende said...

Hello! Thanks for the tutorial!
Don't you have any performance issues generating the curandState vector? Here seems to be very low calling curand_init so many times, and I'm looking for a solution for this.
Thank you!!!

Unknown said...

Hi Pedro,

you don't have to call curand_init() several times - just the first one. It's like the srand() function: you call it in order to initialize the seeds; after that, you fetch new pseudo-random numbers inside kernels by means of curand_uniform().

I hope I answered your question. Feel free to ask otherwise. :)

Unknown said...

Wow, thx a lot about the example!
It helps me a lot about my gpu implementation~!

Unknown said...

Wow, thx a lot about the example!
It helps me a lot about my gpu implementation~!

FromRussiaWithLowe said...

Hello dear sir, sorry for my english, there is one question: can i use curand_init(...) inside same kernel with curand_uniform(...) call? In such way i can save some time instead of spending it to invoke "init" kernel.
And one more: what is the meaning of sequence number in curand_init? (may be i am dummy a little but i can't find info for this question)
Thx for your attention!

Unknown said...

Hi! first of all: sure, you can use curand_init in the same kernel.

About the sequence number, it ensures that the states in various threads won't have any correlation. If you use the same seed in each thread then the successions will be identical. By specifying a sequence number (here referred to the thread index) your successions "will not have statistically correlated values".

Please check the CURAND guide for further informations!

Anonymous said...

Very helpful!!! Thx a lot! :)

Anonymous said...

Nice work, regards

Unknown said...

This is very informative blog. It has solved my problem of generating RANDOM numbers on device. Do i have to keep this code in a while loop if i need to generate random numbers in each run of the code?

Anonymous said...

thanks for your help .

offspring said...

Hello and well done for your great Job. I used your curand code and made some measurements 3 very large data structures (4000000 float numbers) and it almost took one minute to produce the pseudo random numbers! Do you know why is this might happening?

Thanks beforehand.

offspring said...

Hello and well done for your great Job. I used your curand code and made some measurements 3 very large data structures (4000000 float numbers) and it almost took one minute to produce the pseudo random numbers! Do you know why is this might happening?

Thanks beforehand.

Anonymous said...

Can you suggest some method to generate normally distributed random numbers on CUDA C device?

Unknown said...

@V. Sumati: It depends on how you have implemented your code. Can you post an example?

@offspring: how did you generate your samples? all inside a single kernel, or by spawning multiple threads in a parallel fashion? RNG is a computationally expensive task, btw.

Talking about normally distributed random deviates, you can easily generate them by using

curand_normal()

which wasn't supported at the time of this post, but now it is.

http://docs.nvidia.com/cuda/curand/index.html

Joren said...

If you rely on the statistical properties of the RNG, then DON'T use the modulo operator to map the number on to your interval!! This has caused me some headaches in the past :-)

Joren said...

If you rely on the statistical properties of the RNG, then DON'T use the modulo operator to map the number on to your interval!! This has caused me some headaches in the past :-)

Anonymous said...

I am wondering how to generate 2 random numbers per thread ?

can I use curand_uniform( &localState ) twice to get two diff random numbers in the same thread
(I don't want to double the threads to init the states)

Thanks for the nice post to explain the random number no. generation in gpu.

Unknown said...

sure, you can use it twice, it updates the local state and you're fine. you can also use different functions for other distributions, exploiting the same local state.

Unknown said...

can we capture seconds in GPU and supply it as seed ??

Unknown said...

can i capture seconds in GPU and can input them as seed to generate random numbers in device one by one

Unknown said...

can we use seconds as input instead of local state

Anonymous said...

You are my hero, dude.

rngStates[id] = localState;

Thats what i was missing all the time.

Anderson Oliveira Sousa said...

Thanks Man!
I've changed the kernel code to:
"float RANDOM = curand_uniform( &globalState[idx] );"

and the kernel finished it ran 2x faster!

However, running the setup_kernel is very slow (miliseconds). Why? Is there any ohter option?

sunipkm said...

Sorry if I am asking a foolish question, but I did not get the line "dim3 tpb(N,1,1)". Could someone kindly explain?

Unknown said...

Whenever you launch a kernel you must specify how many threads per block (i.e., tpb) you want, and how many blocks you want in the grid (i.e., bpg).

In this oversimplified and non-compiling example, I create a single block with N threads per block.

Naturally, you want to replace N with an arbitrary number of threads, if you really want to compile this example. What is really important is that you get the relationship between N and the rest of the data structures.

sunipkm said...

So i basically launch kernel with something like <<<48,2>>>, and that will be it, right?

Unknown said...

Yes. In such a case, the number of states is equal to N=48*2=96.