**random numbers**generators, for instance the

**C/C++ rand()**function. Anyway, when GPGPU comes into play, using random numbers could be tricky.

The naive solution of create and pour them into GPU's global memory is a bad idea, because of the huge

**bandwidth**that would be wasted. Therefore, it takes a device-bound

**generator. Nvidia's CURAND library makes it easy to**

**generate random numbers**directly

**inside CUDA kernels**.

A bit of theory...

Obviously, it is impossible to generate

*really random*numbers on a deterministic machine. Any random function just applies some kind of transformation on another number, determining a succession that

*looks like*a random one. Anyway, two successions starting from the same number (the seed) will be completely identical. If our kernels start from the same seed, regardless the algorithm, they will produce the same results.

**They must be different**. Moreover, you have to

**store the previous number**s (let's call them

*global states*) in order to produce the next ones during the following execution of

**each**kernel. Fortunately, CUDA allows the storage and update of our partials directly inside GPU's memory.

...and some sample code

I post the example code - comments will follow.

__global__ void setup_kernel ( curandState * state, unsigned long seed ) { int id = threadIdx.x; curand_init ( seed, id, 0, &state[id] ); } __global__ void generate( curandState* globalState ) { int ind = threadIdx.x; curandState localState = globalState[ind]; float RANDOM = curand_uniform( &localState ); globalState[ind] = localState; } int main( int argc, char** argv) { dim3 tpb(N,1,1); curandState* devStates; cudaMalloc ( &devStates, N*sizeof( curandState ) ); // setup seeds setup_kernel <<< 1, tpb >>> ( devStates, time(NULL) ); // generate random numbers generate <<< 1, tpb >>> ( devStates ); return 0; }

Obviously, you have to include not only CUDA's

**includes**and libraries, but CURAND kernel's as well (curand_kernel.h). Let's see what happens in the code, starting from main.

First, we create a

**curandState pointer**, that will point to our

**global states**.

Kernel setup_kernel will invoke

**curand_init()**, that takes some seeds (I used the seconds since the Epoch, but it's free to the user) and sets the global states.

Generate() kernel creates N threads. Each thread will have its own random floating point number, by using a local copy of its global state. In this case, the random number it's sampled from (0,1) with a

**uniform**probability, but CURAND gives the possibility to sample from a standard normal distribution as well.

Finally, we store the new seed into the global memory, and return.

## 32 comments:

Hi - I am certainly glad to find this. cool job!

I need to generate random INTEGERS using curand. so how can we do that? also, what if I need to generate random integers within a specific range? are their any sample codes for that? appreciate Your help

I think that you have two possible ways to do it:

1) use curand_uniform to obtain a random floating point number from a uniform distribution, then map it to your integer interval. Pseudocdode:

float rnd_number = curand_uniform();

int rnd_integer_from_A_to_B = A + rnd_number * (B-A);

I use something similar for choosing reactions inside my biochemical simulations.

2) calling curand() should return "the next quasirandom 32-bit element"; I never tried it, I think you can do something like this:

int rnd_integer_from_A_to_B = A + curand() % (B-A);

let me know :)

Hey ... Thanks for the quick reply. I have tried both. And i think I would go ahead with the second option.

Thanks again.

Hello! Thanks for the tutorial!

Don't you have any performance issues generating the curandState vector? Here seems to be very low calling curand_init so many times, and I'm looking for a solution for this.

Thank you!!!

Hi Pedro,

you don't have to call curand_init() several times - just the first one. It's like the srand() function: you call it in order to initialize the seeds; after that, you fetch new pseudo-random numbers inside kernels by means of curand_uniform().

I hope I answered your question. Feel free to ask otherwise. :)

Wow, thx a lot about the example!

It helps me a lot about my gpu implementation~!

Wow, thx a lot about the example!

It helps me a lot about my gpu implementation~!

Hello dear sir, sorry for my english, there is one question: can i use curand_init(...) inside same kernel with curand_uniform(...) call? In such way i can save some time instead of spending it to invoke "init" kernel.

And one more: what is the meaning of sequence number in curand_init? (may be i am dummy a little but i can't find info for this question)

Thx for your attention!

Hi! first of all: sure, you can use curand_init in the same kernel.

About the sequence number, it ensures that the states in various threads won't have any correlation. If you use the same seed in each thread then the successions will be identical. By specifying a sequence number (here referred to the thread index) your successions "will not have statistically correlated values".

Please check the CURAND guide for further informations!

Very helpful!!! Thx a lot! :)

Nice work, regards

This is very informative blog. It has solved my problem of generating RANDOM numbers on device. Do i have to keep this code in a while loop if i need to generate random numbers in each run of the code?

thanks for your help .

Hello and well done for your great Job. I used your curand code and made some measurements 3 very large data structures (4000000 float numbers) and it almost took one minute to produce the pseudo random numbers! Do you know why is this might happening?

Thanks beforehand.

Hello and well done for your great Job. I used your curand code and made some measurements 3 very large data structures (4000000 float numbers) and it almost took one minute to produce the pseudo random numbers! Do you know why is this might happening?

Thanks beforehand.

Can you suggest some method to generate normally distributed random numbers on CUDA C device?

@V. Sumati: It depends on how you have implemented your code. Can you post an example?

@offspring: how did you generate your samples? all inside a single kernel, or by spawning multiple threads in a parallel fashion? RNG is a computationally expensive task, btw.

Talking about normally distributed random deviates, you can easily generate them by using

curand_normal()

which wasn't supported at the time of this post, but now it is.

http://docs.nvidia.com/cuda/curand/index.html

If you rely on the statistical properties of the RNG, then DON'T use the modulo operator to map the number on to your interval!! This has caused me some headaches in the past :-)

If you rely on the statistical properties of the RNG, then DON'T use the modulo operator to map the number on to your interval!! This has caused me some headaches in the past :-)

I am wondering how to generate 2 random numbers per thread ?

can I use curand_uniform( &localState ) twice to get two diff random numbers in the same thread

(I don't want to double the threads to init the states)

Thanks for the nice post to explain the random number no. generation in gpu.

sure, you can use it twice, it updates the local state and you're fine. you can also use different functions for other distributions, exploiting the same local state.

can we capture seconds in GPU and supply it as seed ??

can i capture seconds in GPU and can input them as seed to generate random numbers in device one by one

can we use seconds as input instead of local state

You are my hero, dude.

rngStates[id] = localState;

Thats what i was missing all the time.

Thanks Man!

I've changed the kernel code to:

"float RANDOM = curand_uniform( &globalState[idx] );"

and the kernel finished it ran 2x faster!

However, running the setup_kernel is very slow (miliseconds). Why? Is there any ohter option?

Sorry if I am asking a foolish question, but I did not get the line "dim3 tpb(N,1,1)". Could someone kindly explain?

Whenever you launch a kernel you must specify how many threads per block (i.e., tpb) you want, and how many blocks you want in the grid (i.e., bpg).

In this oversimplified and non-compiling example, I create a single block with N threads per block.

Naturally, you want to replace N with an arbitrary number of threads, if you really want to compile this example. What is really important is that you get the relationship between N and the rest of the data structures.

So i basically launch kernel with something like <<<48,2>>>, and that will be it, right?

Yes. In such a case, the number of states is equal to N=48*2=96.

Post a Comment