Expanding the VPE-qGM Environment Towards a Parallel Quantum Simulation of Quantum Processes Using GPUs

Quantum computing proposes quantum algorithms exponentially faster than their classical analogues when executed by a quantum computer. As quantum computers are currently unavailable for general use, one approach for analyzing the behavior and re-sults of such algorithms is the simulation using classical computers. As this simulation is ine(cid:14)cient due to the exponential growth of the temporal and spatial complexities, solutions for these two problems are essential in order to increase the simulation capabilities of any simulator. This work proposes the development of a methodology de(cid:12)ned by two main steps: the (cid:12)rst consists of the sequential implementation of the abstractions corresponding to the Quantum Processes and Quantum Partial Processes de(cid:12)ned in the qGM model for reduction in memory consumption related to multidimensional quantum transformations; the second is the parallel implementation of such abstractions allowing its execution on GPUs . The results obtained by this work embrace the sequential simulation of controlled transformations up to 24 qubits. In the parallel simulation approach, Hadamard gates up to 20 qubits were simulated with a speedup of ≈ 50 × over an 8-core parallel simulation, which is a signi(cid:12)cant performance improvement in the VPE-qGM environment when compared with its previous limitations.


Introduction
Quantum Computing (QC) is a computational paradigm, based on the Quantum Mechanics (QM ), that predicts the development of quantum algorithms.In many scenarios, these algorithms can be faster than their classical versions, as described in [1] and [2].However, such algorithms can only be efficiently executed by quantum computers, which are being developed and are still restricted in the number of qubits.
In this context, the simulation using classical computers allows the development and validation of basic quantum algorithms, anticipating the knowledge related to their behaviors when executed in a quantum hardware.Despite the variety of quantum simulators already proposed, such as [3,4,5,6,7,8,9], the simulation of quantum systems using classical computers is still an open research challenge.
Quantum simulators powered by clusters have already been proposed in order to accelerate the computations, as can be seen in [6] and [7].The downside of this approach is the need for expensive computing resources in order to build a cluster powerful enough to handle the computations associated with quantum systems described by many qubits.Moreover, a bottleneck generated by inter-node communication limits the performance of such simulators.Such statements motivate the search for new solutions focused on the modeling, interpretation and simulation of quantum algorithms.
The VPE-qGM (Visual Programming Environment for the Quantum Geometric Machine Model ), previously described in [10] and [11], is a quantum simulator which is being developed considering both characterizations, visual modeling and distributed simulation of quantum algorithms.VPE-qGM presents the application and evolution of the simulation through integrated graphical interfaces.Considering the high processing cost of simulations, this work aims to improve the simulation capabilities of the VPE-qGM environment to establish the support for more complex quantum algorithms.
This work describes the improvements of the VPE-qGM 's simulation capabilities considering two approaches.The first is the implementation of the concepts of Quantum Process (QP ) and Quantum Partial Process (QP P ), in order to reduce the computation during the simulation.The second is the extension of its execution library to allow the use of GPUs to accelerate the computations.CLEI ELECTRONIC JOURNAL, VOLUME 16, NUMBER 3, PAPER 03, DECEMBER 2013 This paper is structured as follows.Section 2 contains the background in quantum computing, GPU computing and some remarks on the VPE-qGM environment.Related works in three different approaches for quantum simulation are described in Section 3. The implementation of Quantum Processes and Quantum Partial Processes is depicted in Section 4. In Section 5, parallel quantum simulation on GPUs is described.Results are discussed in Section 6 together with a detailed analysis of sequential and parallel simulations for different types of quantum transformations.Conclusions and future work are considered in Section 7.

Preliminary
In order to understand and evaluate the contributions of this work, some concepts related to the two main areas (quantum computing and GPU computing) are discussed in the following subsections.

Quantum Computing
QC predicts the development of quantum computers that explore the phenomena of QM (states superposition, quantum parallelism, interference, entanglement) to obtain better performance in comparison to their classic versions [12].These quantum algorithms are modeled considering some mathematical concepts.
In QC, the qubit is the basic information unit, being the simplest quantum system and defined by a unitary and bi-dimensional state vector.Qubits are generally described in Dirac's notation [12] by |ψ⟩ = α|0⟩ + β|1⟩. ( The coefficients α and β are complex numbers for the amplitudes of the corresponding states in the computational basis (space states).They respect the condition |α| 2 + |β| 2 = 1, which guarantees the unitarity of the state vector of the quantum system, represented by (α, β) t .
The state space of a quantum system with multiple qubits is obtained by the tensor product of the state space of its subsystems.Considering a quantum system with two qubits, |ψ⟩ = α|0⟩ + β|1⟩ and |φ⟩ = γ|0⟩ + δ|1⟩, the state space consisting of the tensor product |ψ⟩ ⊗ |φ⟩ is described by ( The state transition in a N -dimensional quantum system is performed by unitary quantum transformations defined by square matrices of order N (i.e., 2 N components since N is the number of qubits in the system).
As an example, the matrix notation for Hadamard and Pauly X transformations are defined by respectively.An application of the Hadamard transformation to the quantum state |ψ⟩, denoted by H|ψ⟩, generates a new global state as follows: Quantum transformations simultaneously applied to different qubits imply in the tensor product, also named Kronecker Product, of the corresponding matrices, as described in (3).
Besides the multi-dimensional transformations obtained by the tensor product, controlled transformations also modify the state of one or more qubits considering the current state of other qubits in a multi-dimensional quantum state.
The CNOT quantum transformation receives the tensor product of two qubits |ψ⟩ and |φ⟩ as input and applies the NOT (Pauly X ) transformation to one of them (target qubit), considering the current state of the other (control).Figure 1(a) shows the matrix notation of the CNOT transformation and its application to a generic two-dimensional quantum state.The corresponding representation in the quantum circuit model is presented in Figure 1(b).Controlled transformations can be generalized (Toffoli) in a similar way [12].
By the composition and synchronization of quantum transformations, it is possible to execute computations exploring the potentialities of quantum parallelism.However, an exponential increase in memory space usually arises during a simulation, and consequentially there is a loss of performance when simulating multi-dimensional quantum systems.Such behavior is the main limitation for simulations based on matrix notation.Hence, optimizations for an efficient representation of multi-dimensional quantum transformations are required to obtain better performance and reduce both memory consumption and simulation time.

GPU Computing
The GPGPU (General Purpose Computing on Graphics Processing Units) programming paradigm has become one of the most interesting approaches for HPC (High Performance Computing) due to its good balance between cost and benefit.For suitable problems, the high parallelism and computational power provided by GPUs can accelerate several algorithms, including the ones related to QC.
The parallel architecture of the GPUs offers great computational power and high memory bandwidth, being suitable to accelerate many applications.The performance improvements are due to the use of a large number of processors (cores).The GPU's architecture is composed by SMs (Streaming Multiprocessors), memory controllers, registers, CUDA processors and different memory spaces that are used to reduce bandwidth requirements and, hence, achieve better speedups.
The CUDA parallel programming model [13] provides abstractions such as threads, grids, shared memory space, and synchronization barriers to help programmers to explore efficiently all resources available.It is based on the C/C++ language with some extensions that allow the access of the GPU's internal components.A program consists of a host-code and a device-code.The host-code runs on the CPU and consists of a nonintense and mostly sequential computational load.It is used to prepare the structures that will run on the GPU and eventually for a basic pre-processing phase.The device-code runs on the GPU itself, representing the parallel portion of the related problem.
Although the CUDA programming model is based on the C/C++ language, other languages and libraries are also supported.In the specific case of this work, the extension for the Python language, named PyCuda [14], was chosen over a lower-level, better-performance language due to two main reasons: • Prototyping with the Python language results in a faster and easier development due to the low coding restrictions imposed; • The host-code is comprised by methods for the creation of the basic structures that later are copied to the GPU.Such creation is based on string formatting and manipulation of multidimensional structures, which can be easily prototyped with Python.On the other hand, the device-code performs a more restricted and intensive computation, which is still implemented in the C language as a regular CUDA kernel.
By using features of PyCuda such as garbage collection, readable error messages, and faster development, the technical part of the development processes becomes easier and greater attention is given to the algorithmic problem.A basic PyCuda workflow is shown in Figure 2. Binary executables are obtained from a C-like CUDA source code (CUDA kernel) generated by PyCuda as a string, allowing runtime code generation.The kernel is also compiled during runtime and stored in a semi-permanent cache for future reuse, if the source code is not modified.

VPE-qGM Environment
The VPE-qGM environment is being developed in order to support the modeling and distributed simulation of algorithms from QC, considering the abstractions of the qGM (Quantum Geometric Machine) model, previously described in [15].

qGM Model
The qGM model is based on the theory of the coherent spaces, introduced by Girard in [16].The objects of the processes domain D ∞ (see, e.g.[17,18]) are relevant in the context of this work once they can define coherent sets that interpret possibly infinite quantum processes.
The qGM model substitutes the notion of quantum gates by the concept of synchronization of elementary processes (EPs).The memory structure that represents the state space of a quantum system associates each position and corresponding stored value to a state and an amplitude, respectively.The computations are conceived as state transitions associated to a spatial location, obtained by the synchronization of classical processes, characterizing the computational time unit.Based on the partial representation associated with the objects of the qGM model, it is possible to obtain different interpretations for the evolution of the states in a quantum system.
In the qGM model, an EP (elementary process) can read from many memory positions of the state space but can only write in one position.For example, the application H|ψ⟩, described in (2.1), is composed by two classical operations: A Quantum Process (QP ) that represents the H transformation is obtained when two EPs, associated to the operations described in (i) and (ii), are synchronized.Such construction is illustrated in Figure 3.The parameters of the EPs define a behavior similar to the vectors that comprise the corresponding definition matrix.During the simulation, both EPs are simultaneously executed, modifying the data in the memory positions according to the behavior of the quantum matrix associated to the H transformation, simulating the evolution of the quantum system.The interpretation for the concept of QP P s is obtained from the partial application of a quantum gate.Consider the gate H ⊗2 , defined in (3).Each line (i) of the corresponding matrix in H ⊗n , where n = 2, is characterized by a EP i with a computation defined in Eq.( 4), where h represents one element of H ⊗2 , indexed by i (line) e j (column).Therefore, the synchronization of the EP s i , for i ∈ {0, 1, 2, 3}, is the equivalent in the qGM model to the computation generated by the matrix H ⊗2 .
All possible subsets of EPs interprets different QP P .A QP P corresponds to a matrix with a subset of defined components and a disjunct subset of undefined components (indicated by the bottom element ⊥).
Considering as context the elements of the computational basis (|00⟩, |01⟩, |10⟩, |11⟩), it is possible to obtain a fully described two-qubit state |Φ 1 ⟩ by the union (interpreting the amalgamated sum on the process domain of qGM model) of four partial states, defined as follows from Eq. (5) to Eq. ( 8): Hence, the state |Φ 1 ⟩ is approximated by all states |Φ i 1 ⟩ ⊥ , with i ∈ {0.0, 0.1, 1.0, 1.1}, resulting in the state Considering as context the values (0 or 1) of the first qubit, the following partial states are obtained as in Eqs. ( 9) and ( 10): Now, the states |Φ 0.x 1 ⟩ ⊥ and |Φ 1.x 1 ⟩ ⊥ are other possible partial states of |Φ 1 ⟩.Although it is not the focus of this work, the qGM model provides interpretation for other quantum transformations, such as projections for measure operations.

Related Work
Today, quantum simulators with different approaches are available.The most relevant are described in the next sections, representing the best solutions achieved so far for sequential and parallel simulation of quantum algorithms.

QuIDDPro
The QuIDDPro simulator proposed in [9] was developed in the C language and explores structures called QuIDDs (Quantum Information Decision Diagrams).QuIDDs are an efficient representation of multidimensional quantum transformations and states that are defined, in matrix notation, by blocks of repeated values.QuIDDs are capable of identifying such patterns and create simple graphs that represent data and assure low memory consumption and data access.
A QuIDD is a representation based on decision graphs, and computations are performed directly over this structure.For states with the same amplitude, an extremely simple QuIDD is obtained.For states with many different amplitudes, no compression can be reached.However, such states are not usual in QC.
Quantum transformations and quantum states are represented as shown in Figure 4.The solid edge leaving each vertex assigns the logic value 1 to the corresponding bit that comprises the index of the desired state.The dashed edge assigns the logic value 0 to its respective vertex.When a terminal node is reached, an index points to an external list that stores the amplitude generated by the values of each edge in the traveled path.Results obtained by QuIDDPro are shown [19].Instances of the Grover Algorithm up to 40 qubits were simulated, requiring 407 KB of memory.Other simulation packages were limited to systems up to 25 qubits.However, due to the sequential simulation, 8.23 × 10 4 seconds were required.

Massive Parallel Quantum Computer Simulator
The Massive Parallel Quantum Computer Simulator (MPQCS) [6] is a parallel simulation software for quantum simulation.Simulations can be performed over high-end parallel machines, clusters or networks of workstations.Algorithms can be described through a universal set of quantum transformations, e.g.{H, S, T, CN OT }.Although these transformations can be combined in order to describe any quantum algorithm, such restricted set imposes limitations in the development process, since more complex operations must be specified only in terms of those transformations.However, this simplicity allows the application of more aggressive optimizations in the simulator, as computation patterns are more predictable.
Since the MPQCS explores the distributed simulation to improve the simulation, the MPI (Message Passing Interface) is used to communication.The downside of this approach is the overload of the interconnection system due to the large amount of data transfered during the simulation.As the interconnection is used to send data related to the state vector of the quantum system to the corresponding processing nodes, and such state vector grows exponentially, a high capacity interconnection is preferred.
The MPQCS simulator was executed on supercomputers such as IBM BlueGene/L, Cray X1E, and IBM Regatta p690+.The main results point to a simulation of algorithms up to 36 qubits, requiring approximately 1 TB of RAM memory and 4096 processors.
In [20], the simulation of Shor's algorithm was performed on the JUGENE supercomputer, factoring the number 15707 into 113 × 139.Such task required 262, 144 processors.The execution time and memory consumption were not published.

General-Purpose Parallel Simulator for Quantum Computing
The General-Purpose Parallel Simulator for Quantum Computing (GPPSQC) [7] is a parallel quantum simulator for shared-memory environments.The parallelization technique relies on the partition of the matrix of the quantum transformation into smaller sub-matrices.Those sub-matrices are then multiplied, in parallel, by sub-vectors corresponding to partitions of the state vector of the quantum system.
The simulator also considers an error model that allows the insertion of minor deviations into the definition of the quantum transformations ito simulate the effects of the decoherence in the algorithms.
By using the parallel computer Sun Enterprise (E4500) with 8 UltraSPARC-II (400MHz) processors, 1 MB cache, 10 GB of RAM memory and operational system Solaris 2.8 (64 bits), systems up to 29 qubits were supported.The results containing the simulation time (expressed in seconds) required for Hadamard gates, from 20 to 29 qubits are shown in Figure 5. Speedups of 5.12× were obtained for a 29-qubit Hadamard transformation using 8 processors.

Quantum Computer Simulation Using the CUDA Programming Language
The quantum simulator described in [8] uses the CUDA framework to explore the parallel nature of quantum algorithms.In this approach, the computations related to the evolution of the quantum system are performed by thousands of threads inside a GPU.
This approach considers a defined set of one-qubit and two-qubit transformations, being a more general solution in terms of transformations supported when comparing to the proposals of [6] and [7].The use of a more expressive set of quantum transformations expands the possibilities for describing the computations of a quantum algorithm.
As main limitations of the quantum simulation with GPUs, memory capacity is the more restrictive one, limiting the simulation presented by [8] to systems with a maximum of 26 qubits.As an important motivation towards this approach, the simulation time can achieve speedups of 95× over a very optimized CPU simulation.

Quantum Processes in the VPE-qGM Library
The execution library of the EPs, called qGM-Analyzer, contains optimizations to control the exponential growth of the memory space required by multi-dimensional quantum transformations [21].These results showed a reduction in the memory space requested during the computations, simulating algorithms up to 11 qubits.However, this approach was still demanding a high computational time due to the exponential growth in the number of EPs executed.
The execution of EPs considers a strategy that dynamically generates the values corresponding to the components of the matrix associated with the quantum transformation being executed.Starting from these optimizations, this work extends the qGM-Analyzer in order to establish the support for the simulation of quantum transformations through QP s and QP P s.
The main extensions consider the representation of controlled and non-controlled transformations, including all related possible synchronizations.The specifications of these and other new features are described in the following subsections.

Non-Controlled Quantum Gates
The QP is able to model any multi-dimensional quantum transformation.Figure 6 shows a QP associated with a quantum system comprised of 3 qubits (q = 3), including its representation using EPs and the structure of such component in the qGM-Analyzer.M L stores the matrices associated with two-dimensional quantum transformations.Each line in M L is generated by the functions indicated in the second column of the QP T able.These functions (U 0, U 1 and U 2) describe the corresponding quantum transformation of the modeled application in the VPE-qGM.
The tuples of each line are obtained by changing the values of the parameters x1 and x2.The first value of the tuple corresponds to the value obtained by the scalar product between the corresponding functions.The second indicates the column in which the value will be stored.
The matrix-order (n) in M L is defined from the number of functions (U k ) grouped together.In Figure 6, the first matrix in M L, indicated by M 1 , has n = 2. Similarly, M 2 has n = 1.
Figure 6: QP and its representation by applying EPs.
It is interesting that the order of each matrix in M L can be arbitrarily determined but it is always consistent with the conditions for multi-dimensional quantum transformations.However, it is important to remember that when n is large enough, e.g.n > 10, the memory consumption becomes a limitation.Hence, a balance between the order and the number of matrices in M L (|M L|) interferes directly in the performance of the simulation.
Each line in all matrices in M L has a binary index, a string with n bits.For instance, in Figure 6, the index 000 selects the first line of each matrix (m 00 from the first matrix and m 0 from the second matrix), allowing the computation of the associated amplitude to the first state of the computational basis of system (|000⟩).
Besides M L, it is necessary to create a list (see (11)) containing auxiliary values for indexing the amplitudes of the state space, which must be multiplied by each values of the matrices in M L. In that list, q indicates the total number of qubits in the quantum application.
The computation that results in the total evolution of the state vector of a quantum system is defined by the recursive expression in Eq.( 12), considering the following notation: • |M L| is the number of matrices in M L; • P is a base position (starting in P = 0) for indexing the amplitudes of the states in the quantum system; • m indicates a matrix in M L (starting in m = 1); • l m is a line index l of a matrix m; • n m is the order n of a matrix m; • SL stores the SizesList; • T ′ is the tuple indexed by M L m,lm,c lm ; • k is the amplitude of one state in the computational basis.
According to specifications of the qGM model, a QP can be represented as a synchronization of QP P s.In this conception, it is possible to divide the QP described in Figure 6 in two QP P s, as presented in Figure 7. QP P 0 is responsible for the computation of all new amplitudes of the partial states in the subset {0, 1, 2, 3} of the computational basis and it is represented by the graphical component in the left side of the Figure 7. Similarly, on the right side, the graphical component associated to the QP P 1 is shown, computing the amplitudes in the subset {4, 5, 6, 7} of the computational basis independently from the execution of the QP P 0 .
Figure 7: Two possible QP P s generated from the QP described in the Figure 6.
The QP P s contribute with the possibility of establishing partial interpretations of a quantum transformation, allowing the execution of the simulation even in the presence of uncertainties regarding some parameters/processes.Therefore, in the local context of the computation of each QPP, it is possible to generate only a restricted subset of elements associated to a quantum gate.Complementary QPPs (that interpret distinct line sets) can be synchronized and executed independently (in different processing nodes).The bigger the number of QPPs synchronized, the smaller is the computation executed by each one, resulting in a low cost of execution.

Definition of Controlled Quantum Gates
For non-controlled quantum gates, it is possible to model all the evolution of the global state of a quantum system with only one QP .However, this possibility can not be applied to controlled quantum gates.The main difference can be seen in Figure 8, in which the following conditions are described: • In the generation of the transformation H ⊗2 , the expression (H, H) is maintained for all vectors, changing only the corresponding parameters; • In the description of the CNOT transformation, different expressions are required.This difference occurs due to the interpretation of the CNOT transformation. 1his interpretation can be extended to multi-dimensional transformations.
The complete description of the CNOT transformation is obtained by the expressions in Eq. ( 13), which defines a set of QP P s called QPP Set.QP P s for the CNOT transformation have their structures illustrated in Figure 9.The QP P 1 shown in Figure 9(a) and associated to Exp 1 describes the evolution of the states in which the state of the control qubit is |1⟩ (requiring the application of the Pauly X transformation to the target qubit).The evolution of the states in which the control qubit is |0⟩ is modelled by Exp 2 and generates the QP P 2 , illustrated in Figure 9(b).As these states are not modified, the execution of the QP P 2 is not mandatory.In general, |QP P Set| = |Exp| = 2 nC , where nC is the total number of control qubits in all gates applied.However, it is only necessary the creation/execution of the QP P s in a subset (QPP Subset) of QPP Set.If only one controlled gate is applied, then |QP P Subset| = 1.When synchronization of controlled gates is considered, |QP P Subset| = 2 nC − 1.As an example, consider the synchronization of two CNOT transformations shown in Figure 10(a).In this configuration, there are the following possibilities: In the VPE-qGM environment, this configuration is modelled using the expressions in (14).Hence, |QP P Set| = 4.However, the QP P 4 , associated to the expression Exp 4 , does not change any amplitude in the system and should not be created/executed.
When controlled gates are synchronized with non-controlled gates (different from Id ), all the amplitudes are modified.Therefore, QP P Subset = QP P Set.The configuration illustrated in the Figure 10(b) is modelled through the expressions defined in (15).Now, two QP P s, identified by QP P 1 and QP P 2 , respectively associated to the expressions in Exp 1 and Exp 2 , are considered.However, it is not possible to discard the execution of the QP P 2 as it modifies the amplitudes of some states.Those changes are due to the H transformation, which is always applied to the last qubit, despite the control state of the CNOT transformation.

Recursive Function
After the creation of all necessary QP P s, a recursive operator is applied to the matrices in M L for computing the amplitudes of the new state vector of the quantum system.This operator dynamically generates all values associated to the resulting matrix, originally obtained by the tensor product of the transformations, defining the quantum application.The algorithmic description of this procedure with some optimizations is shown in Figure 11.
The execution cost of this algorithm grows exponentially when new qubits are added.Despite the high cost related to temporal complexity, it presents a low cost related to spatial complexity, once only temporary values are stored during simulation.This section describes the extension of the qGM-Analyzer library focused on the support for the parallel execution of QP P s on a GPU.As such efforts are in their initial steps, only non-controlled transformations are considered for now.
The computation required by each CUDA thread comprehends the individual computation of 4 amplitudes of the new state vector of the quantum system.

Constant-Size Source Matrices and Auxiliary Data
Following the specifications of the Section 4, a QPP is defined by building small-size matrices that are combined by an iterative function in order to dynamically generate the elements corresponding to the transformation matrix, which is obtained whenever the Kronecker Product between the initial matrices was performed.For the execution of a QPP, the system make use of the following parameters and all of them are stored in a NumPy 'array' object: • List of Matrices (matrices): matrices generated by the host-code; • List of Positions (positions): position of an element in the corresponding matrix is necessary to identify the amplitude of the state vector that will be accessed during simulation; • Width of Matrices (width): number of columns considering the occurrence of zero-values and the original dimension of the matrices; • Column Size of Matrices (columns): number of non-zero elements in each column; • Multiplicatives (mult): related to auxiliary values indexing the amplitude of the state vector; • Previous Values (previousElements): number of elements stored in the previous matrices.

Allocation of Data into the GPU
When allocating structures into the GPU, data must be stored in the most suitable memory space (global memory, shared memory, constant memory and texture memory) to achieve good performance.As the QP parameters remain unchanged during an execution, they are allocated into the GPU's constant memory.Such data movement is performed by PyCuda as described in the following steps: 1.In the CUDA kernel, constant data is identified according with the following syntax: 2. The device-memory address of variable is obtained in the host-code by applying the PyCuda call: 3. Data copy is also performed in the host-code in order to transfer the data stored in the host-memory to the device-memory address corresponding to variable.Such process is performed by the PyCuda call: pycuda.driver.memcpyhtod(address, dataSource).
Notice that dataSource is a NumPy [22] object stored in the host memory.
This procedure is done for all variables cited in the Subsection 5.1.Furthermore, the host-code contains two NumPy 'array' objects that store the current state vector (read-Memory) and the resulting state vector (writeMemory) after the application of the quantum transformations.Additionally, the parameter writeMemory has all its positions zeroed before each step of the simulation.Next, the data related to readMemory and writeMemory are copied to the global memory space of the GPU.The following methods consolidate such process: 1. readM emory = numpy.array(numpy.zeros(2q ), dtype = numpy.complex64,order = ′ C ′ ) is related to a NumPy array creation, with all its values equal to zero and resides in the host side.The desired current state is then manually configured.
2. readM emory gpu = gpuarray.togpu(readM emory) represents the copy of the current (input) state vector from the host-memory to the device-memory through a PyCuda call.
3. writeM emory gpu = gpuarray.zeros(2q , dtype = numpy.complex64,order = ′ C ′ ) is the new (output) state vector of the system, only created in the device side and initialized with all its values equal to zero by a PyCuda call.

CUDA Kernel
The CUDA kernel is an adaptation of the recursive algorithm presented in Figure 11 to become an iterative procedure, as GPUs' kernels may not contain recursive calls.As this kernel is inspired by the Kronecker Product, it operates over an arbitrary number of source matrices.Each CUDA thread has its own internal stack and iteration control to define the access limits inside each matrix.
The computation of each thread can be depicted in seven steps, described as follows.
Step 1: Initialization of variables in the constant memory, which are common to all CUDA threads launched by a kernel.T OT AL ELEM EN T S, LAST M AT RIX ELEM EN T S and ST ACK SIZE are defined in runtime by the PyCuda interpreter.For text formatting purposes, this Subsection considers the symbol ⋄ as a representation of the declaration device constant .

⋄ int lastP ositionsC[LAST M AT RIX ELEM EN T S];
⋄ int widthC[ST ACK SIZE + 1]; Step 2: Shared memory allocation and initialization are both performed by all CUDA threads within a block.SHARED SIZE is defined in run-time by the PyCuda interpreter, which in general will assume the value blockDim.x× 4.

shared cuF loatComplex newAmplitudes[SHARED SIZE];
f or Step 3: Definition of access limits of a matrix, determining which elements each CUDA thread will access depending on its id and resident block.The begin, count and end arrays are local to each thread and help controlling and indexation of the elements of each matrix in matricesC and positionsC.The (thId&(widthC[c] − 1)) operation is analogous to the module operation thId%widthC[c] but performed as a bitwise ′ and ′ that is more efficient in the GPU Step 4: Forwarding in matrices is analogously performed as a recursive step, providing the partial multiplications among the current elements according to the indexes in count.
Step 5: Shared memory writing is related to a partial update of the amplitude in the state vector.c = 0; f or(int j = 0; j < 4; j + +){ res = make cuF loatComplex(0, 0); Step 6: Index changing in previous matrices generates the next values associated to the resulting matrix.This process will occur until all the indexes reach the last element of the corresponding line in all matrices.

Results
The results of this work are divided into: • Performance analysis considering the optimizations for the sequential simulation considering the concepts of QP s and QP P s.Benchmarks consisting in algorithms for reversible logic synthesis, based on controlled transformations, and case studies of Hadamard transformations are used; • Performance analysis of the parallel simulations of Hadamard transformations using GPUs.
In this work, the focus on Hadamard transformations as a benchmark is justified due to its high computational cost, representing the worst case for the simulation in the VPE-qGM once all the transformations are always applied to the global state of the system instead of following a gate-by-gate approach.By doing so, it is guaranteed that our solution is generic and can deal with classical, superposed and entangled states using the same algorithmic solution.

Sequential Simulation Results
For the validation and performance analysis of the sequential simulation of quantum algorithms using QP s and QP P s in the VPE-qGM, two case studies were considered.The first consists of benchmarks, fundamentally composed by controlled gates, selected among ones available in [23].This choice is justified by two main aspects: • Availability of source code for generation of the algorithms; • Quantum algorithms with many qubits and dozens/hundreds of gates.
The validation methodology considers, for each case study, 10 simulations.A hardware where the simulations were performed has the following main characteristics: Intel Core i5-2410M at 2.3 GHz processor, 4GB RAM and Ubuntu 11.10 64 bits.
Execution time and memory usage were monitored.The main performance comparison was made against the previous version of the qGM-Analyzer, which supports the simulation of quantum algorithms using EPs, considering the optimizations described in [21].The main features of each algorithm and the results obtained are presented in Tables 1 and 2, where quantum algorithms up to 24 qubits were simulated.
The memory consumption is similar in both versions of the library since optimizations for controlling the memory usage have already been added to the environment.As the optimizations regarding QP s and QP P s only affect quantum gates, the high-cost of memory is due to the storage of state vector of the quantum system.The simulation time with both QP s and QP P s shows a time reduction when compared to the execution with EPs.
As presented in Figure 12(a), the simulation of controlled transformations using QP P s showed a time reduction of approximately 99% when compared to the simulation with EPs.Such performance improvement is due to the optimization focused on the identification of QP P s that changes any amplitude in the state space.Hence, only a subset of QP P s are executed, i.e., only part of the transformation is applied.
As an example, for each one of the 41 steps in the algorithm gf 2 6 , only the operations corresponding to two vectors in a matrix of order 2 18 are executed.Consequently, only a subset of the total number of amplitudes that contains the state of the quantum system is altered, requiring a smaller number of operations.
When the same algorithm is simulated using EPs, in each step of the simulation, 2 18 EPs are executed.In this approach, all amplitudes of the state space are recomputed, even though there are no changes in most values.Due to the exponential growth in the number of EPs, simulation of algorithms with more than 18 qubits becomes unfeasible in this representation.The algorithms gf 7 , gf 8 and mod1024adder were not included in Figure 12(a) because they are not supported by the older version of the qGM-Analyzer.
The percentage of time reduction for the simulation of Hadamard gates using QP s is shown in the Figure 12(b).In these cases, the improvement was not as significant as in the benchmarks with controlled gates, once that all the amplitudes of the state space are modified.Hence, both approaches result in the generation and computation over the elements associated to 2 q vectors.A reduction of 29% in simulation time may be attributed to the generation of all elements in the same QP .In the EP approach, 2 q different components are executed, resulting in a larger number of operations.

Parallel Simulation Results
The analysis of the parallel simulation using GPUs is based on the simulation of Hadamard transformations, always considering the global state of the quantum system.In order to compare the performance of this new proposal, the results of the parallel simulation obtained by the VirD-GM, which is detailed in [24], are used as reference once they represent the best performance for Hadamard transformations achieved in this project until now.The parallel simulation was performed using a desktop with the following configuration: Intel Core i7-3770 CPU 3, 4GHz processor with hyperthreading, 8 GB RAM, and Ubuntu 12.04 64 bits.The execution algorithm presents the same complexity as the one depicted in the Figure 11 but it is implemented in Java.The methodology for the parallel simulation considers the execution of 15 simulations of each instance of the Hadamard transformations, considering the simulation over 1, 2, 4, and 8 cores.
For the simulation with GPUs, the tests were performed on the same desktop with a NVIDIA GT640 GPU.The software components are: PyCuda 2012.1 (stable version), NVIDIA CUDA 5 and NVIDIA Visual Profiler 5.0.There were simulated Hadamard transformations up to 20 qubits.The data to be analyzed, including the simulation time, was obtained with the NVIDIA Visual Profiler after 30 executions of each application.
Table 3 contains the simulation time, in seconds, for Hadamard transformations ranging from 14 to to 20 qubits, considering the parallel simulation by the VirD-GM and the parallel simulation using GPUs proposed in this work.The results present a significant performance improvement when using a GPU.The standard deviation for the simulation time collected related to the simulation with the VirD-GM has reached a maximum value of 3.9% for the H ⊗14 executed with 8 cores.A complete 8-core simulation of the configurations with 19 and 20 qubits would require approximately 1 and 4 hours, respectively.Due to such elevated simulation time, those case studies were not simulated in the VirD-GM.
The speedups presented in Figure 13 have reached a maximum of ≈ 240× when comparing with the single core execution in the VirD-GM.The GPU-based simulation has outperformed the best parallel simulation in the VirD-GM by a factor of ≈ 50× for the Hadamard of 18 qubits running on 8 processing cores.This improvement is explained by the number of CUDA cores available (384), as well as by its hierarchical memory architecture that allows a high data throughput.As the simulation performed by VirD-GM does not consider further optimizations regarding memory access and desktop computers were used, the performance obtained by the GPU execution was much better.Applications with more than 18 qubits require at least 8 processing Hadamard can be simulated with one mid-end device, such as a NVIDIA GT640.
Regarding the GPU's execution, the NVIDIA Visual Profiler identified a local memory overhead associated with the memory traffic between the L1 and L2 caches, caused by global memory access corresponding to the reads in the readM emory variable, as described in Step 5 presented in Subsection 5.3.

Conclusion and Future Work
The VPE-qGM environment introduces a novel approach for the simulation of quantum algorithm in classical computers, providing graphical interfaces for modelling and simulation of the algorithms.
The sequential simulation of quantum transformations (controlled or not) presented in this work reduces the number of operations required to perform state evolutions in a quantum system.The performance improvement, discussed in Section 6, allows the simulation of quantum algorithms to 24 qubits.
In order to establish the foundations for a new solution to deal with the temporal complexity of such simulation, this work also describes an extension of the qGM-Analyzer library that allows the use of GPUs to accelerate the computations involved in the evolution of the state of a quantum system.The contribution of this proposal to the environment already established around the VPE-qGM is the first step towards a simulation of quantum algorithms in clusters of GPUs.
Although the support for simulation of quantum transformations using GPUs described in this work is in its initial stages, it has already significantly improved the simulation capabilities of the VPE-qGM.Elevated speedups, from ≈ 50× related to a 8-core parallel simulation to ≈ 240× over the single core simulation, were obtained when comparing this novel solution to the simulation provided by the VirD-GM environment.
Simulations of Hadamard transformations up to 20 qubits were performed.Without the contributions of this work, such simulation may not be executed in the VPE-qGM or in the VirD-GM due to the elevated execution time.The Hadamard transformation was chosen as a case study in this work due to its high computational cost and represents the worst case for the simulation in this environment.
When comparing these results with related works, it is expected that other simulators will outperform our current solution due to two main reasons: • Current solutions consider a universal set of quantum transformations, which is a restrict set of quantum operations that simplifies the optimizations for the simulation but imposes restrictions during the development of a quantum algorithm.Our solution consists of a generic approach in order to support any unitary quantum transformation; • The data in the GPU's global memory is often accessed.By moving sections of such data to the GPU's shared memory and partitioning the computation of the CUDA threads in sub-steps to control memory access, performance can be improved.
An important consideration is that not only parallelization techniques are capable of improving performance.Future work also considers several algorithmic optimizations that will be applied to reduce the amount of computations, where the following ones can be highlighted: • As the matrices that define quantum transformations are orthogonal, only the superior (or inferior) triangular may be considered in the computations.The most significant impact expected is the decrease of memory accesses and required storage.Reductions in the number of computations are also possible; • A more sophisticated kernel that predicts multiplications by zero-value amplitudes of the state vector may avoid unnecessary operations.A major reduction in the number of computations is expected.

( a )Figure 1 :
Figure 1: Representations of the CNOT gate

√ 2 1 √ 2
is multiplied by the sum of the amplitudes of the two states and the result is stored in the state |0⟩;• (i) the normalization value is multiplied by the subtraction of the amplitudes of the two states and the result is stored in the state |1⟩.

Figure 3 :
Figure 3: Representation of the Hadamard transformation using EPs.

Figure 5 :
Figure 5: Simulation time in seconds for the Hadamard simulation [7].

Figure 8 :
Figure 8: Parameters of the H and CNOT gates.

Figure 9 :
Figure 9: QPPs for the modeling of the CNOT gate

( a )
Improvement percentage for the simulation with QPPs.(b) Improvement percentage for the simulation of Hadamard gates with QPs.

Figure 12 :
Figure 12: Improvement percentage for the case studies for sequential simulation.

Figure 13 :
Figure 13: Speedups for the GPU simulation relative to the number of cores considered in the parallel simulation by the VirD-GM
NS: Not Supported.Simulation time over 4 hours.

Table 2 :
Quantum algorithms simulated using QPs

Table 3 :
Summary of the simulations using the VirD-GM and the GPU