Analytical Evaluation of the 2D-DCT using paralleling processing

One of the current research areas in the field of computer science is distributed computing systems. In distributed systems, software is partitioned into modules and executed using a number of processors concurrently. A major difficulty in using distributed and paralleling computing systems has been ease of use. There is not a clear methodology for programmers for using these systems effectively. This work seeks to assess the viability of using analytic performance analysis to assist in the evaluation of candidate algorithms through its application to a case study. This will help us to estimate the total execution time and the optimal number of processors.


I. Introduction
A major difficulty in using distributed and parallel computing systems has been ease of use.
There is not a clear and simple methodology for programmers for using these systems effectively.This work seeks to assess the viability of using analytic performance analysis to assist in the evaluation of candidate parallel/distributed algorithms through its applications to a case study application to the Discrete Cosine Transform (DCT).Since DCT is computation intensive, we want to focus our attention on developing a DCT algorithm that can be executed on a shared memory multiprocessor architecture considering parallel processing to find an optimal execution and the optimal number of processors.
The Discrete Cosine Transform (DCT) was first proposed by Ahmed et al. (1974), and it has been more and more important in recent years.DCT has been widely used in signal processing of image data, especially in coding for compression, for its near-optimal performance.Because of the wide-spread use of DCT's, research into fast algorithms for their implementation has been rather active [2]- [6], and also, since the DCT is computation intensive, the development of highspeed hardware and real-time DCT processor design have been object of research [7]- [9].This work intends to show different approaches in implementing a two-dimensional DCT using a Sequent system, and to formulate an analytical model for each of these implementations that will help us to estimate the total execution time and the optimal number of processors.
The rest of this paper is organized as follows.Section II gives some background information.
Section III proposes an implementation for the one-dimensional DCT that will be used to estimate the two-dimensional DCT, and two different approaches for the two-dimensional DCT.Results are discussed in Section IV.Finally, conclusions are given in Section V.

II. Background
Different approaches have been used trying to find an efficient algorithm for the computation of the one-dimensional DCT and the two-dimensional DCT.Although the approaches and the resulting algorithms are quite different, the main purpose of achieving speed and accuracy is the common goal.In our case, we also want to achieve these goals but using parallel processing.
The main purpose of these technologies is to perform computations faster that can be done with a single processor by using a number of processors concurrently.
For a given data sequence where m = 0,1,..., N -1, and for m = 0 otherwise The input sequence ( ) { } x i i N : , ,..., = − 0 1 1 can be represented by the column vector x, and the one- Let an (M x N) matrix [g] represent a black-and-white digital picture, where the matrix element g mn may be interpreted as the gray level or intensity of the pixel at the (m,n) location.Let [G] be the two-dimensional DCT of [g].Then the uv-element of [G] is given by where u=0,1,...,M-1, v=0,1,...,N-1, and   where Since loops provide the greatest potential of parallelism to be exploited by multiprocessor systems, it is reasonable and effective to focus our attention on how to determine the optimum degree of parallelism, i.e., how many processors to use to compute the two-dimensional DCT.
However, maximum parallel execution, i.e., executing all the processes in parallel may not provide the least time cost solution due to increased communication costs.

III. Implementation using Parallel Processing
The one-dimensional DCT can be easily implemented with two nested loops: Each iteration of the outer loop can be considered as a module (DCT(i)), and it can be assigned to a processor.Therefore, in the computation of the DCT for a sequence of size N there are N modules or processes each with execution t1D.A DOALL approach can be used to execute all of the N independent modules in parallel using N processors.A main obstacle would be the unavailability of processors.For this alternative, we want to find the total time cost and the optimum number of processors using some of the results of Garg [10].
If p processors are used to compute the DCT then each processor will execute a maximum of N p modules.Fcost is the time cost to execute the fork operation and Jcost is the time cost to execute the join operation.On a shared memory multiprocessor like the Sequent Symmetry system, it was found [11] that the cost to create multiple processes was additive and was directly proportional to the number of processors.Therefore, the time cost to create p processes is p*tx, where tx is the process creation time.tc is the access time in a shared memory multiprocessor.In a sequential communication model, the total time to access the data is given by p*N*tc.This is because each processor needs the complete data sequence to perform its own computation, and it has to be sent sequentially to each processor.The total time to collect the data is N*tc.The completion time for the parallel DCT in a sequential communication model is given by: The optimum number of processors to obtain the least-time-cost solution for this implementation is given by: If the communication model is a broadcast then the total time to distribute the data is N*tc, since the data set size is N, and the data can be sent at the same time to each of the p processors in N*tc.The total time to collect the data is N*tc.Each process will compute one output value and it has to be collected in a sequential manner.The completion time for the parallel DCT in a broadcast communication model is given by: The optimum number of processors to execute N modules in order to obtain the least-timecost solution is given by For the computation of the DCT over the columns, some of the data is already in the process, and some of the data is being computed for other processes.Then the total time to move data between the different processes is ( ) The completion time for the parallel two-dimensional DCT is given by The optimum number of processors to execute an N x N two-dimensional DCT in order to obtain the least-time-cost solution is given by If the communication model is broadcast then the total time to distribute the data is 2 N tc * , since the data size is 2 N and the data can be sent at the same time to each of the p processors.
The total time to collect the data is also 2 N tc * .The total time to move data between the different processes after the one-dimensional DCT over the rows is 2 N tc * .
The completion time for the parallel two-dimensional DCT with a broadcast communication model is given by In the second approach, we want to change the assignment of work to each processor.
If p processors are used to compute the two-dimensional DCT then each processor computes at least N p one-dimensional DCTs over the rows, and after over the columns.There is no need to move data between processes.The completion time for the parallel twodimensional DCT is given by The optimum number of processors to execute an N x N two-dimensional DCT in order to obtain the least-time-cost solution is given by

IV. Evaluation and Results
The objective is to show different alternatives for the implementation of the two-dimensional DCT, and the corresponding performance models to express the estimated execution time and the optimum number of processors to achieve the least execution time.In this section, we want to evaluate some of these alternatives and compare them with the corresponding analytical model.
The methodology used to achieve these objectives was: ♦ Different approaches were implemented on the Sequent system and the actual execution time was recorded.
♦ The actual execution time data obtained from the implementation of the one-dimensional DCT varying the number of processors and the size of the data was used to estimate the execution time of each individual module, the communication time to distribute and collect data, processor creation time and Fork and Join cost.
♦ Some of these data were available using TCAS [11], but we used multiple regression to estimate the different values needed.
♦ The estimated values were used in the two-dimensional DCT models.
Table 1 shows some of the real execution time obtained after varying the number of processors and the data size for the one-dimensional DCT using a sequential communication model.R indicates that our model fits the data in more than 99.9 percent of the cases.Figures 1 and 2 show the estimated execution time and the real execution time.
The second approach of the two-dimensional DCT was implemented.In this alternative, there is no inter-processor communication.The resulting execution times are showed in Table 2. Figures 3 and 4 show the estimation time and the real execution time for the two-dimensional DCT

V. Conclusions
The major objective of this work was to use the Discrete Cosine Transform for a case study in effective use of a shared memory multiprocessor using parallel processing.Once we have an analytical model of the execution time that describes the behavior of a specific algorithm, we can estimate the optimum number of processors to be assigned to obtain the least execution time.
Different algorithm structures were presented in this work for the DCT.Some of them were implemented on the Sequent system and compared with their correspondent performance model with good predictive accuracy.The statistical approach proved to be an adequate technique.
Given the strong relationship between the total execution time and the number of processors, and the data size, it was possible through the use of multiple linear regression to find some of the coefficients of the model, or the relationship between them, such as the access time, the execution time of each iteration , the process creation time, and the Fork and Join cost to be used in our predictions.Using these primitive function performance estimates, it was possible to explore the behavior of alternative approaches with high accuracy, even beyond the processor limits of the existing Sequent machine.
In the future, it would be interesting to implement the algorithm using pipeling and to compare with parallel processing.It would be also desirable to implement some of the fast algorithms for the DCT using multiprocessors, and compare whether or not the implementation of other single processor algorithms leads to better execution times on multiprocessors systems.
CLEI ELECTRONIC JOURNAL, VOLUME 1, NUMBER 1, PAPER 3, JUNE 1998 dimensional (M x N) DCT can be implemented by M N-point DCTs along the rows of [g], followed by N M-point DCTs along the columns of the matrix obtained after the row transformation.
-dimensional DCT, we will show two different approaches.In the first one, each process consists of two steps.The first step computes a N-point one-dimensional DCT over the rows of the original matrix.The second step computes a N-point one-dimensional DCT over the CLEI ELECTRONIC JOURNAL, VOLUME 1, NUMBER 1, PAPER 3, JUNE 1998 columns of the matrix obtained after the first step.If p processors are used to compute the twodimensional DCT then each processor computes a one-dimensional DCT over N p rows, that is, processor 0 computes one-dimensional DCT over rows 0, p, 2p, and so on, processor 1 computes one-dimensional DCT over rows 1, p+1, 2p+1, and so on.When all the processes are completed then each processor computes a one-dimensional DCT over N p columns.t1D(N) is the execution time of a N-point one-dimensional DCT.Fcost is the time cost to execute the fork operation and Jcost is the time cost to execute the join operation.The time cost to create p processes is p * tx, where tx is the process creation time.tc is the access time.In a sequential communication model, the total time to distribute the data for the computation of the onedimensional DCT over the rows is 2 N p tc * * .The total time to collect the data is 2 N tc * .
t1D(N) is the execution time of an N-point one-dimensional DCT.Fcost is the time cost to execute the fork operation and Jcost is the time cost to execute the join operation.The time cost to create p processes is p * tx, where tx is the process creation time.tc is the access time.In a sequential communication model, the total time to distribute the data for the computation of the CLEI ELECTRONIC JOURNAL, VOLUME 1, NUMBER 1, PAPER 3, JUNE 1998 one-dimensional DCT over the rows is 2 N p tc * * .The total time to collect the data is 2 N tc * .

Table 1 :
Execution Time One-dimensional DCT Each processor works independently.The estimated values obtained for the one-dimensional DCT are used in our model to estimate the execution time and the optimum number of processors.