## GPGPU Programming with CUDA for Color Space Conversion

General-purpose computing on graphics processing units (GPGPU, rarely GPGP) is the use of a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU) .

From a moderately long time, I had this interest for programming GPUs using CUDA. The problem however was, I did not have access to any GPU. Now, as I am back to school again, I get a chance to play with CUDA to program GPUs. At least I could start. The computers in my lab has NVIDIA GeForce GTX 1050 Ti, which I am also planing to buy in near future. As I am new to this GPU world, I don’t know much to discuss about the architecture of GPU and which one is the best or so, I would quickly move to the programming part.

In this article, I will show a comparative analysis for simple image conversion (from RGB to YUV and vice versa) on CPU and GPU. The process of the work is to understand how the operation is done on CPU, and then I will explain, how that job can be parallelized and transferred to CPU. So our first goal is to understand, what are we doing on CPU. To understand this article do not need the details understanding of YUV and RGB, we just need the conversion formula. Anyone interested in details of these conversion may look at this Wikipedia article.

This piece of code converts an RGB image (in *.ppm format) to a *.yuv format image. The input to the function is an image of *.ppm format, which has values of R, G and B for all the pixels. So if an image is 1000 x 700, there will be 1000 x 700 pixels, and for all the pixels we will have one R, one G and B values between 0 – 255. If wee look at the following C++ code, the conversion formula for how we can get Y, U, V values from R, G, B values is obvious. If we look at the code the, it is clear that, the code runs on each of the pixels in a sequential manner. It runs through all the pixels one by one and converts each of them one by one. So if the image is large, it takes a significant amount of time on CPU.

```//Convert RGB to YUV444, all components in [0, 255]
YUV_IMG rgb2yuv(PPM_IMG img_in)
{
YUV_IMG img_out;
int i;
unsigned char r, g, b;
unsigned char y, cb, cr;

img_out.w = img_in.w;
img_out.h = img_in.h;
img_out.img_y = (unsigned char *)malloc(sizeof(unsigned char)*img_out.w*img_out.h);
img_out.img_u = (unsigned char *)malloc(sizeof(unsigned char)*img_out.w*img_out.h);
img_out.img_v = (unsigned char *)malloc(sizeof(unsigned char)*img_out.w*img_out.h);

for(i = 0; i < img_out.w*img_out.h; i ++){
r = img_in.img_r[i];
g = img_in.img_g[i];
b = img_in.img_b[i];

y  = (unsigned char)( 0.299*r + 0.587*g +  0.114*b);
cb = (unsigned char)(-0.169*r - 0.331*g +  0.499*b + 128);
cr = (unsigned char)( 0.499*r - 0.418*g - 0.0813*b + 128);

img_out.img_y[i] = y;
img_out.img_u[i] = cb;
img_out.img_v[i] = cr;
}
return img_out;
}
```

If we look at the code carefully, we observe one interesting thing. The calculation on each pixel does not depend on any other pixels. So it is possible to do calculations on each pixels separately (and even at the same time if we have the resource) and then collect the result and provide the output. For understanding, lets consider (just theoretically) for converting each pixel from RGB to YUV it takes 1 sec for one person named Cansa Perouli Unagi (CPU), and if we have only one CPU to convert an image of size 100 x 70 would take 7000 sec, which is equivalent to 1.94 hours. However, if we have 7000 person who can convert each pixels in 1 sec, we can convert the whole image in 1 sec deploying this team of 7000 persons named Great Perouli Unit (GPU). This is what we need to do.

GPU has significantly larger number of cores, compared to CPUs. So we can launch multiple threads (thousands) and complete these types of parallelizable jobs significantly faster. To write program for GPU we need to find out the single job that we want to do on different chunks of data. Then we need to call that kernel from CPU. Let us divide the workflow.

• Copy the image from CPU to GPU
• Operate on the image
• Copy result from GPU to CPU

#### Copy the image from CPU to GPU

CPU and GPU are two separate entity. So the first step for doing anything on GPU is to make the data available on GPU. For that first we need to allocate memory for GPU.

```cudaMalloc((void **)&d_r, sizeof(unsigned char)*img_out.w*img_out.h);
cudaMalloc((void **)&d_g, sizeof(unsigned char)*img_out.w*img_out.h);
cudaMalloc((void **)&d_b, sizeof(unsigned char)*img_out.w*img_out.h);

cudaMalloc((void **)&d_y, sizeof(unsigned char)*img_out.w*img_out.h);
cudaMalloc((void **)&d_cb, sizeof(unsigned char)*img_out.w*img_out.h);
cudaMalloc((void **)&d_cr, sizeof(unsigned char)*img_out.w*img_out.h);

cudaMemcpy(d_r, img_in.img_r, sizeof(unsigned char)*img_out.w*img_out.h, cudaMemcpyHostToDevice);
cudaMemcpy(d_g, img_in.img_g, sizeof(unsigned char)*img_out.w*img_out.h, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, img_in.img_b, sizeof(unsigned char)*img_out.w*img_out.h, cudaMemcpyHostToDevice);
```

#### Operate on the image

CPU and GPU are two separate entity. So the first step for doing anything on GPU is to make the data available on GPU. For that first we need to allocate memory for GPU.

```__global__ void rgb2yuvKernel(unsigned char *imgr,unsigned char *imgg,unsigned char *imgb,unsigned char *imgy,unsigned char *imgcb,unsigned char *imgcr) {

unsigned char r, g, b;
unsigned char y, cb, cr;

int index;
index = threadIdx.x + blockIdx.x * blockDim.x;

r = imgr[index];
g = imgg[index];
b = imgb[index];

y  = (unsigned char)( 0.299*r + 0.587*g +  0.114*b);
cb = (unsigned char)(-0.169*r - 0.331*g +  0.499*b + 128);
cr = (unsigned char)( 0.499*r - 0.418*g - 0.0813*b + 128);

imgy[index] = y;
imgcb[index] = cb;
imgcr[index] = cr;
}
```