A Comparison between GPU-based Volume Ray Casting Implementations: Fragment Shader, Compute Shader, OpenCL, and CUDA

Volume rendering is an important area of study in computer graphics, due to its application in areas such as medicine, physic simulations, oil and gas industries, and others. The main used method nowadays for volume rendering is ray casting. Nevertheless, there are a variety of parallel APIs that can be used to implement it. Thus, it is important to evaluate the performance of ray casting in diﬀerent parallel APIs to help programmers in selecting one of them. In this paper, we present a performance comparison using OpenGL ﬁ with fragment shader, OpenGL ﬁ with compute shader, OpenCL, and CUDA.


Introduction
The visualization of volume datasets has gain importance in the last decades in areas such as medicine, physic simulations, oil and gas industries, and others. Several methods has been used to evaluate the volume rendering equation numerically: viewport aligned planes (using 3D texture), object aligned planes (using 2D texture), shear-warp, splatting and ray casting [1]. Ray casting is currently the standard in volume visualization, due to its parallel nature and rendering quality. Since this method is highly parallel, GPUbased ray casting is an attractive option to implement it with interactive frame rates and low cost.
One of the problems to be considered for GPU-based ray casting is the ray/volume intersection. There are two main methods to calculate this intersection: rasterization [2] and ray/box intersection [3]. With the entry and exit point of one ray into the volume, the problem is commonly reduced to sampling the volume at constant steps, classifying the samples and composing them. It is also accompanied with acceleration techniques like early ray termination [4].
There are several APIs which allow the implementation of general purpose parallel programs in the GPU. The most commonly used are OpenCL [5] and CUDA [6]. For ray casting, using OpenGL ® [7] pipeline should be a good option, since it is optimized for graphics problems; however, not all the stages of the pipeline are programmable, and the programmable stages are mainly limited to process geometries, vertexes and fragments [7] . Nevertheless, nowadays OpenGL ® 4.5 allows general purpose computing with compute shader [8]. One of the advantages of using compute shader over CUDA or OpenCL is that it is integrated directly into OpenGL ® , which could be significantly important for a graphic-driven application.
In this paper, we do a performance comparison of four volume ray casting implementations, using: OpenGL ® with fragment shader, OpenGL ® with compute shader, OpenCL, and CUDA. With this study we hope to answer two main question in the developing of a volume ray casting: (1) which of these four versions is currently the most efficient approach for ray casting in a given system? (2) which ray/volume intersection approach is better to use in terms of speed?.
The rest of this paper is organized as follows: Section 2 makes a survey of the related work. Section 3 briefly describes volume ray casting, showing how this method is optimized with parallel APIs. Then, Section 4 shows some details of the volume ray casting implementations used for testing. In Section 5, test datasets and results are presented. Finally, Section 6 discusses the conclusions and future work.

Related Work
GPUs have become an important tool to accelerate calculation due to their versatility and powerful parallel processors. For that reason, traditional volume rendering approaches have been adapted to exploit the use of this hardware. The earliest methods to exploit the GPU were the 2D and 3D texture slicing [9] [10]. The disadvantage of using 2D texture slicing was the lack of tri-linear interpolation between the slices, the necessity to have three copies of the volume, and the bright differences in the final render when changing between copies. In the case of 3D texture slicing, there were differences between volume samples distance per ray due to the perspective projection. That was fixed with the use of spherical shells, but it adds the necessity of more complex geometries. Furthermore, these approaches often exhibit visual artifacts and less flexibility when compared to ray casting methods. For ray casting on GPUs, Roettger et al. [11] and Kruger and Westermann [2] were among the first to introduce this method, using a multi-pass approach. Two years later, Hadwiger et al. [12] implemented a single pass ray casting.
To know which volume rendering method is better, the performance of different methods has been studied; Lacroute [13] compared shear-warp with ray casting and splatting. Nevertheless, our work focuses directly in the performance comparison of volume ray casting, aiming the use of OpenGL ® , OpenCL, and CUDA. Even though those APIs have similar capabilities, they present some differences that could make one API be a better option than others under certain circumstances. Different authors have evaluated the performance of those APIs individually. Garland et al. [14] made a survey of experiences gained in applying CUDA to a diverse set of problems and reported the parallel speedups over sequential codes running on CPU. Shen et al. [15] analyzed the performance of OpenCL in multi-core CPUs, and compare its performance with OpenMP.
Moreover, other authors have compared the performance between two or more GPU parallel APIs. OpenCL and CUDA were compared in different studies with SystemC simulation [16], Monte Carlo simulation of a quantum spin system [17], and several other programs [18] [19]. In general, those works showed that CUDA runs faster than OpenCL in the default mode, and with some optimizations, the performance of OpenCL is comparable with CUDA. Satish et al. [20] implemented a CUDA sorting algorithm which demonstrated to be 4 times faster than a GPU sorting algorithm using the OpenGL ® pipeline. OpenGL ® , OpenCL, and CUDA were also compared with implementations of the cardiac monodomain equations [21] and cloth modelling [22]. Those studies showed that there is not a unique best API choice for all problems, since the best API choice varies from problem to problem.
Some studies have focused in measuring the performance of volume ray casting using a parallel API. Marsalek et al. [23] demonstrated proof of concept of a simple CUDA ray caster. Schubert and Scholl [24] have implemented a multi-volume renderer using CUDA with different mixing techniques, obtaining real time results. The performance increase of a CUDA based volume rendering with respect to a CPU have also been measured in [25] and [26]. Fangerau and Kromer [25] claim that their implementation runs faster with CUDA using a block size configuration of 8 × 8.
Furthermore, other authors have compared the performance of volume ray casting between different implementations. On one hand, Kiss et al. [27] compared a CUDA and OpenCL volume rendering implementation, obtaining similar milliseconds per frame with both technologies. On the other hand, other comparisons between the OpenGL ® fragment shader and CUDA have been made. Weinlich et al. [28] did a comparison with OpenGL ® and CUDA with compute capability 1.1 and 2.0. They concluded that an implementation with compute capability 1.1 is too slow due to the lack of 3D texture, and that an implementation with compute capability 2.0 is slightly better than an OpenGL ® implementation. Shi et al. [29] implemented a volume ray casting with OpenGL ® and CUDA, showing an improved performance of CUDA by about 15%, but they used small datasets in their test. Mensmann et al. [30] took advantage of the CUDA capabilities to implement an OpenGL ® and CUDA slab-based ray casting for bricked volumes. The results showed that CUDA is slightly better than OpenGL ® for their implementation. In general, CUDA showed a better performance than the use of OpenGL ® or OpenCL.
In our work, we make a performance comparison of volume ray casting using fragment shader, compute shader, OpenCL and CUDA. Unlike previous work, we compare more than two ray casting parallel implementations and we take into account the use of compute shader. This shader was not tested for ray casting in previous work to the best of our knowledge.

Volume Ray Casting
Volume rendering is a technique for rendering a three dimensional scalar field by projecting its samples without requiring an intermediate reconstruction of the data [31]. This technique simulates the interaction between the light and a participating media (the volume), and is modeled with the Equation 1.
The values c(x(λ)) and τ (x(λ)) represent the color and absorption of a sample at a distance λ, and x(λ) represents the parametrization of the ray [32]. There are several ways to evaluate the rendering equation. Generally, the ray is quantized in equally-spaced samples, and a transfer function is applied to each sample to perform a composite operation between classified samples. Ray casting evaluates directly the volume rendering equation by casting a ray for every pixel in the final image, with the origin in the position of the eye as can be depicted in Figure 1.
As volume ray casting evaluates the volume rendering equation directly, it commonly achieves better visual results than other approaches. Furthermore, as the calculation is independent for every ray, the process can be easily parallelized.
To start with the traversal of the ray, the entry and exit point of the ray in the volume need to be calculated. The way in which this intersection is calculated could represent an important impact to the overall performance of the algorithm. This intersection is generally calculated with two approaches: (1) rasterization [2], and (2) ray/box intersection test [3].

Intersection using Rasterization
One efficient way to obtain the intersection between view-rays and the volume is using OpenGL ® rasterization. The idea is to use the parallel rasterization power of the graphics cards to generate the volume/ray intersection for each pixel of the framebuffer. A simple color cube, representing the boundaries of the volume, is rendered (see Figure 2) [2]. Since the colors and the texture coordinates are represented in unit space [0, 1] 3 , each color of the cube also represents a 3D texture coordinate of the volume. Thus, the color of each rasterized pixel represents the texture coordinate of the intersection of one ray with the volume. This method is used with back face culling to obtain the first hit of each ray (entry point, Figure 2a), and with front face culling to obtain the last hit (exit point, Figure 2b). Furthermore, the direction of the ray can be obtained as exit point minus entry point (Figure 2c).

Ray/Box Intersection Test
The intersection of one individual ray with the volume, can also be obtained directly by using some mathematics. A divide and conquer algorithm is commonly used. Intersecting a ray with a box can be reduced to intersecting the ray with each slab of the volume [3]. This test has been optimized to work in parallel architectures, avoiding thread divergence [33]. An implementation of this test can be depicted in Listing 1. r e t u r n t f a r > t n e a r ; } Listing 1: General code for a ray/box intersection test optimized to avoid thread divergence

Implementations
The implementation of volume ray casting in different parallel APIs have the same basic algorithm. Nevertheless, every API has its own limitations that is worth mentioning. This Section presents some implementation details of the volume ray casting used in this work, considering fragment shader, compute shader, OpenCL, and CUDA. We only show the implementations with the rasterization method, as the ray/box intersection algorithm was previously presented in Listing 1, and that algorithm could be easily adapted to each API. Furthermore, this Section presents an implementation for adding diffuse lighting to the volume ray casting.

Fragment shader
Given that it is necessary to render some geometry with OpenGL ® to activate a fragment shader program, we can compute the intersection of every ray with the volume during rasterization, with minimum extra overhead. In this case, it will only be necessary to do two rendering passes. The first pass calculates the exit point of the ray. The second pass calculates the entry point of the ray, the ray direction, and numerically evaluates the volume rendering equation, as can be depicted in Listing 2. During the second stage, the front faces of the volume cube are rendered. For each rasterized pixel, we get the entry point of the ray into the volume. It is represented by the interpolated variable first. Required textures are indicated to the shader through samplers; they represent the exit point of the ray per pixel, the transfer function, and the volume itself. The exit point of the ray (lasHit) is fetched with the function texelFetch. With first and lasHit, it is possible to calculate the direction (direction) and the length (D) of the ray. Then, the volume is transversed by the ray, step by step, in a loop. On each step, the volume is sampled at the current ray position, and the position is moved h units in the ray direction. For accessing the textures of the transfer function and the volume, we use the function texture as we want to obtain interpolated texels and voxels respectively. The code is optimized with an early ray termination; it means, when the ray accumulates an opacity value greater than a threshold (opacityThreshold) the loop finishes. Then, the resulting color and opacity is stored in the currently attached framebuffer, which is indicated with the output variable vFragColor.

Compute shader
Listing 3 shows the implementation of the volume ray casting using compute shader. The logic and the code is similar to the one presented for the fragment shader. In this case, the rasterized images with the entry and exit point of the ray are rendered with OpenGL ® and stored in an OpenGL ® texture. Compute shader allows the access of texture in the same way as in the fragment shader using samplers. The only difference is the output texture, that has to be linked as an image with glBindImageTexture function. To launch the parallel kernel, the function glDispatchCompute must be called, indicating the total number of threads and the number of threads per block. Unlike the fragment shader, a kernel is executed for every thread and the variable gl GlobalInvocationID differentiates one thread from another. As the execution of the kernel is asynchronous, the function glMemoryBarrier must be called with the parameter GL SHADER IMAGE ACCESS BARRIER BIT to ensure that the result is written into the texture before it is used in the final rendering. Finally, a full screen quad is rasterized with the output texture into the screen.

OpenCL
OpenCL implementation of the volume ray casting is presented in Listing 4. In this case, a different API than OpenGL ® is used to calculate the volume rendering image result, and some interoperability functions must be used to allow the communication of both APIs. First, all OpenGL ® textures must be binded to an OpenCL image using clCreateFromGLTexture. Before invoking the kernel, OpenCL must indicate to OpenGL ® every texture to be used with clEnqueueAcquireGLObjects, and release them after use with clEnqueueReleaseGLObjects. The function clEnqueueNDRangeKernel must be called to launch the kernel, indicating the total number of threads and the number of threads per block.  As in the compute shader case, the function is executed once per thread and each thread is uniquely identified with the function get global id. Finally, clFinish must be called to ensure that the output image is ready to be rendered by OpenGL ® . Some data type manipulation functions were added to OpenCL to make this code more readable.

CUDA
Finally, CUDA-based implementation is observed in Listing 5. As OpenCL, CUDA has functions to allow the interoperability with OpenGL ® . Almost all OpenGL ® textures must be mapped to a CUDA array with the functions cudaGraphics-GLRegisterImage and cudaGraphicsSubResourceGetMappedArray. Then the array must be mapped to a CUDA texture with the function cudaBindTextureToArray. The only exception to this rule is the output image. CUDA cannot write to a texture, so the array corresponding to the output texture must be mapped to a CUDA surface with cudaBindSurfaceToArray. This mapping between an OpenGL ® and a CUDA texture has only to be performed one time, and then OpenGL ® texture will have the same GPU memory address of the CUDA texture or surface. Thus, CUDA can access the OpenGL output texture without further extra overhead.
As in the compute shader and OpenCL case, the function is executed once for each thread and a thread is uniquely identified with the variables blockIdx, blockDim, and threadIdx. When the kernel is launched, the number of threads per blocks and the number of blocks must be indicated. Finally, cudaDeviceSynchronize must be called to wait until the end of the kernel execution. Some data type manipulation functions were added to CUDA to make this code more readable.

Diffuse Lighting
To illuminate a volume, a normal has to be obtained for every voxel using finite difference. The finite difference in a discrete scalar field as the volume, is calculated as the subtraction of the current voxel with its neighbor in every axis, which gives the gradient vector. This vector can be normalized and used as the normal vector of the volume at the corresponding world position of the voxel. The Listing 6 shows a GLSL code with a function to calculate the diffuse lighting. The same logic can be implemented easily in the compute shader, OpenCL, and CUDA.

Test and Results
In this section we show the results obtained by this research. First, the system configuration and the different datasets used for testing are presented. Then, the methodology used for every test is explained. Finally, the results are shown, alongside with the corresponding discussion.

Environment and Datasets
Test were performed in a conventional PC with Intel(R) Core(TM) i7-3770 CPU of 3.40GHz, 8 GB of RAM, Windows 7 64 bits, and a Nvidia GeForce GTX 660 with a Kepler microarchitecture. That video card supports OpenGL ® 4.5, OpenCL 1.2, and CUDA with compute capability 3.0.
All tested datasets were rendered at a gazing angle with the same rotation. We selected three datasets: visible male CT from the Visible Human project [34] (Figure 3), an explicit equation (Figure 4), and a Stag Beetle [35] (Figure 5). To test different volume resolutions, the datasets were downsampled in 5 resolutions as shown in Table 1.
Furthermore, two transfer functions were used to display the male dataset. The objective of the first transfer function (a) is to define the skin as an opaque surface (male TF1). Thus, most of the rays will stop after hitting the body skin, and ray casting should finish faster. For the second transfer function (b), a semi-transparent low-density tissue and opaque bones is defined (male TF2). Thus, the rays traverse through the semi-transparent area until a full opaque voxel is reached. For the explicit equation we chose two transfer functions: (a) full opacity at first hit of some particular surface (lemni TF1) and (b) many semi-transparent surfaces (lemni TF2). Finally, the beetle dataset was rendered with two transfer functions: (a) opaque surface at the beetle exoskeleton (beetle TF1) and semi-transparent exoskeleton (beetle TF2).

Test Methodology
Test were done to compare the performance in the most fair way. First, the compute shader, OpenCL, and CUDA versions were tested with different block sizes. The block sizes were chosen to maintain a power of two   Table 2). All of these configurations are considered to be tested with different width/height ratio combinations. Other block sizes and configurations were also tested (e.g. 16 × 16 and 32 × 16) in a less exhaustive way, as those configurations did not present better results than the obtained with those shown in the table. Due to lack of space, only the best result for a block size configuration are shown in the result section, for each dataset and transfer function. Furthermore, tests were performed using rasterization, and ray/box intersection test, for all datasets, all transfer functions, and with a fixed screen resolution of 1024 × 768. Each pixel of the screen is mapped into a thread, and the block size indicates how those threads are grouped. If, for example, we have a 4 × 4 block size, then the threads will be grouped in a matrix of 1024/4 × 768/4 = 256 × 192 blocks, with 16 threads per each block. The results are also compared with the ones of the fragment shader. With fragment shader there is no need of choosing a block size.
Finally, an additional test was performed adding diffuse illumination in the volume ray casting. The illumination was only tested with opaque transfer functions (as seen in Figure 6), because for transparent functions the rendering yields incoherent visual effects. This test was done for the fragment shader, compute shader, OpenCL, and CUDA. To calculate the illumination, it is necessary to fetch data from neighbor voxels and to do some extra calculation in the rendering, that could end up in some performance overhead, which may differ depending on the parallel API used.

Results
First, the results of the execution time in milliseconds per frame is presented in Table 3 and Figure 7 for the fragment shader, compute shader, OpenCL, and CUDA. Those results are shown in milliseconds per frame using two methods for ray-volume intersection test: rasterization (R), and ray/box intersection test (RB). Due to lack of space, only the time for the best block size configuration is presented alongside with the corresponding block size configuration for compute shader, OpenCL, and CUDA. In the Appendix, an example table can be depicted with all block configurations for a unique dataset. For measuring R and RB tests, a timer is set at the beginning of the main loop and stops at the end, averaging the result with the number of rendered frames. We also performed some tests with both methods (R and RB), including and excluding the display of the final image into the screen, to measure the display overhead. This overhead is bounded between 0.1 milliseconds o 0.5 milliseconds in the execution time for all cases.
The results show that for larger datasets, the execution time of the volume rendering is higher, as the rays have to transverse larger volume areas. Also, the results are consistent with the transfer function transparency. The more transparent transfer function, the higher execution time, as the rays have to traverse larger volume areas.
In the table, some values are highlighted to indicate the best result between R and RB, for each imple-  mentation, dataset, and transfer function. In the compute shader case, RB is faster than R in 96% of the cases in up to 0.92 milliseconds. RB is faster for the OpenCL case in more than 83% of the cases, but the difference between R and RB is approximately up to 5.26 milliseconds. Nevertheless, in the CUDA case, R is the best option in 86% of the cases, with up to 0.63 milliseconds faster than RB. In general, the best option for the block size is 16 × 8 for the smaller datasets (less than 40 MB), 16 × 8 or 8 × 8 for intermediate volumes (around 150 MB), and 4 × 4 or 8 × 4 for the larger volumes (around 600 MB), in almost all cases. As it can be observed, the best results tend to be block configurations with a width/height ratio of 1 : 1 or 1 : 2. For smaller volumes, selecting a high number of threads per block gives better performance, while when the volume gets larger, a better performance is obtained with fewer threads per block. Now, taking the results obtained from the fragment shader, and the best results between R and RB for compute shader, OpenCL, and CUDA, we can obtain some interesting comparisons. For almost all volumes, compute shader shows the best results. It is from 1.03× to 1.85× faster than the fragment shader, from 1.5× to 7.1× faster than OpenCL, and from 1.05× to 1.3× faster than CUDA. The second best performance is obtained by CUDA, which is from 1.1× to 5.3× faster than OpenCL. Nevertheless, the best option between fragment shader and CUDA varies depending on the size of the volume and the transfer function. CUDA is a better option with larger volumes and with opaque transfer functions (TF1), whereas fragment shader presents better results for smaller volumes and semi-transparent transfer functions (TF2). Furthermore, the worst performance is shown by OpenCL, which is only better with respect to the fragment shader in some of the volumes at their maximum resolution.
Finally, an additional test was conducted to measure the performance of each parallel API with a diffuse illuminated volume ray casting. The results can be depicted in Table 4 and Figure 8. We only chose the first transfer function (TF1) for all volumes respectively, as these transfer functions render an opaque surface. The results of this test show a similar behavior as the previous test. As before, some values are highlighted to indicate the best result between R and RB, for each implementation, dataset, and transfer function. In the compute shader case, RB is faster than R in 93.3% of the cases with little differences in the execution time. RB is faster for the OpenCL case in 86.6% of the cases on the table, again with little differences. Nevertheless, in the CUDA case, R is the best option in 93.3% of the cases, with little differences.
For smaller volume sizes (less than 40 MB), the best block configuration is 16 × 8 for almost all cases. In Finally, for the larger volume sizes (around 600 MB) the best options are 4 × 2, 4 × 4, and 8 × 4. The only exception with larger volumes is the OpenCL case, where the best option is 8 × 8. As in the previous test, block configurations with an 1 : 1 or 1 : 2 ratio tend to be the best choice, with a higher number of threads per block for smaller volumes, and a fewer number of threads per block for larger volumes.
Comparing the best results between RB and R for each parallel API, and the fragment shader, we found that compute shader has the fastest results among all the implementations, with the exceptions of the male 128 × 128 × 311 with respect to fragment shader and CUDA, and the beetle 416 × 208 × 247 with respect to fragment shader. As before, CUDA is the second best option, being between 1.5× and 4.2× faster than OpenCL, and faster than fragment shader in some cases. Fragment shader shows better performance than OpenCL in almost all cases, except with the volumes at maximum resolution, and has better performance than CUDA in some smaller volume cases.
Finally, the increment in execution time for the addition of the diffuse lighting depends of the API. The increment for fragment shader is between 1.3× and 2.8×, for compute shader is between 1.3× and 1.7×, for OpenCL is between 1.08× to 2.3×, and for CUDA is between 1.2× to 1.7×. As can be observe, the overhead for lighting calculation is lower for compute shader and CUDA, than for fragment shader and OpenCL. Furthermore, we can observe than the increments for the male dataset are higher than the ones for the lemni dataset, for almost all cases. We do not obtain conclusive results comparing the bettle dataset with the other two, as the results varies depending of the API and size of the volumes.

Conclusions and Future Work
In this paper, we presented a performance comparison of volume rendering implementations using fragment shader, compute shader, OpenCL, and CUDA. The implementations were tested with three datasets, with different volume resolutions, from smaller volumes (less than 40 MB), through intermediate volumes (around 150 MB), to larger volumes (around 600 MB). Each volume was tested with different transfer functions and block sizes for the GPGPU implementations (4 to 128 threads per block). As expected, the performance of our ray casting implementations decreases with larger volumes, and with semi-transparent transfer function.
For the tests we conclude that compute shader implementation represents the best option to implement a basic volume ray caster in general. Compute shader shows the best results for almost all tested cases, being between 1.03× and 2.9× faster than fragment shader, between 1.5× and 7.1× faster than OpenCL, and between 1.01× and 1.3× faster than CUDA. The following best option is between fragment shader and CUDA. CUDA shows better results than fragment shader for almost all cases of larger and intermediate volumes, with a speed improvement from 1.1× to 2.4×. Whereas, in smaller volumes there is not a global winner between fragment shader and CUDA. Finally, OpenCL has the worst performance for almost all cases which is between 1.01× and 7.1× times slower than any other API. The only exception is for volumes at its highest resolutions, where OpenCL is up to 1.4× faster than fragment shader.
Furthermore, an additional test was performed to measure the overhead of calculating diffuse illumination in each tested volume. The addition of the illumination makes the rendering execution time to be up to 2.8× times slower than the normal rendering, as it has more calculation and more texture fetches. The increment in the execution time for fragment shader is up to 2.8×, for compute shader is up to 1.7×, for OpenCL is up to 2.3×, and for CUDA is up to 1.7×. Notice that the impact of the lighting calculation is smaller for compute shader and CUDA, being consistent with the results without lighting, where they were also the fastest approaches.
We also found that there is not a global winner in terms of execution time between ray/volume intersection methods (rasterization and ray/box). Nevertheless, the ray/box intersection presents better performance for most of the cases with compute shader and OpenCL, while the raster intersection test presents a better performance for most of the cases with CUDA. Thus, the best overall results of our test were obtained by combining compute shader with ray/box intersection test.
In our tests, we observed that the best block size varies with the volume size. For larger volume datasets, a configuration of, for example 4 × 4, 8 × 4, and 4 × 2, are the best choices. It is worth notice the block configurations of 4 × 4 and 4 × 2, as the number of threads is smaller than the CUDA warp size (32 threads). Theoretically, with these configurations, half or more than the half of the threads of the warps will be idle, so there will be a high amount of unused processing power, which should be inefficient. While the volume size becomes bigger, a configuration with more threads per blocks gives better results. For example, 8 × 8 or 16 × 8 for volumes up to around 150 MB. Block configurations with an 1 : 1 or 1 : 2 ratio present better results than the ones with rectangular aspect ratios, which prove that configurations with an 1 : 1 or 1 : 2 ratio makes more coalesce accesses to the volume texture, and a thread execution with less divergence. With block configurations with rectangular aspect ratios, as the threads fetches data from distant areas of the volume that could represent different materials, it is possible that a high number of threads do an early ray termination, which ends up in a high number of idle threads waiting for other threads in the same block to finish execution. Furthermore, for larger volumes, the best option is a fewer number of threads per block (e.g. 16 or 32), while for smaller volumes, the best option is a higher number of threads per block (e.g. 64 or 128). It may be related to cache performance, since larger block sizes require to access larger volume areas at the same time. For smaller volumes, those required voxels could be in the GPU cache, whereas in larger volumes it is more challenging for caching manager. The same cache performance problem could be seen in block configurations with less threads than the warp size for larger volumes. With larger volumes, each thread would fetch distant areas of the volumes, generating cache misses, that could end up in a serial resolution of the fetches. For that reason, even a number of threads fewer than the warp size could be beneficial, as a massive cache miss would be avoided for a warp.
As future work, it would be interesting to test the implementations in a newer GPU, with more capabilities, especially for newer versions of OpenCL and CUDA. Furthermore, as is usual in volume rendering, many rays may be early-terminated due to opacity accumulation; thus, using a method like persistent threads [36] could be beneficial for opaque transfer functions, and the speedup obtained should be studied.
The Table 5 depicts an example of the conducted test for only one dataset and one parallel API. Here every block configuration is tested, and the average execution time to generate a frame in milliseconds is shown. A similar table was generated for every dataset and parallel API tested to determine the best block configuration to render that volume. In this case, we can observe that the opaque transfer functions (TF1) is always faster than the transparent transfer function (TF2), and adding illumination always makes the rendering process slower. Furthermore, block configurations with an 1 : 1 or 1 : 2 ratio, always represent better options than configurations with rectangular aspect ratios. Best block configuration is 4 × 4 for the non-illuminated volumes, and 4 × 2 for the illuminated volumes.