A Parallel 2t-le Algorithm Refinement with Mpi

A description is given of an implementation of a parallel refinement algorithm based on the sequential 2T-LE algorithm of bisection into two triangles (Longest-Edge). The proposed algorithm allows refining in parallel the triangulation of a given geometric mesh. The parallel implementation was carried out on a cluster with a Linux platform communicated through a Message Passing Interface (MPI). The results of the parallel refinement show that as the size of the problem increases, better performance is obtained from the parallel algorithm compared to the sequential algorithm.


Introduction
In the analysis of complex scientific and engineering problems the use of geometric mesh models is common.Real applications require operation with geometric meshs with a large volume of data (memory limitations) and with high computing power (HCP) requirements, becoming natural candidates for parallel processing [1].Existing parallel methods for mesh generation (triangulation) break up the original problem into N subproblems that are processed simultaneously by a set of processors P [2,3,4].Some authors present decoupled solutions, both in mesh generation and in refinement [5,6].Decoupling of the mesh subproblems does not require synchronization of communication between the processors when triangulation processes are performed.In the literature there are various parallel algorithms for solving this problem.Some are aimed at ensuring the quality of the Delaunay refinement [7], or at solutions in unstructured networks [8].However, these parallel algorithms that are reported solve the problem by decoupling the subproblems, in that way avoiding the synchronization of the boundaries between two processors.Other algorithms deliver a nonconforming solution, which is valid for some situations but not for all cases.A parallel algorithm can be used to refine all the triangles of a mesh or to refine a set of triangles marked on a geometric mesh.In both cases the quality of the geometric mesh must be maintained.Similarly, the meshs refined in parallel must retain the same quality of the elements and properties as the meshs generated and refined sequentially [6].In this paper a study is made of the problem of determining the efficiency of a parallel algorithm to refine a set of triangles marked on a geometric mesh.The refinement propagates to neighboring triangles that can be on other processors, and this must be solved efficiently to retain the validity or conformity of the mesh.This communication between processors is done with MPI (Message Passing Interface).The following section details the problem of refining geometric meshs, and section 3 presents the parallel algorithm for solving the problem.Section 4 shows the testing instances, whose results are analyzed in section 5. Finally, section 6 shows the conclusions.

Refinement of a Triangulation
In general, a refinement of a geometric mesh tries to reconstruct good quality triangulations, where geometric criteria are used to define the quality of the triangles, which is related to their equiangularity (how equilateral the triangles are).The following quality criterion is considered in this paper.
• Definition 1: A triangulation ℑ n has a certain quality q if its minimum angle is bounded by q.If α is the minimum angle, then α ≥ q, where q ≤ 18º.
The most widely used geometric measurement is that of the smallest angle α(ℑ n ) in the triangulation.
From the definition of quality of a triangulation we get: • Definition 2: An algorithm is said to be q-stable if it retains the mesh's initial quality.
The conformity of a mesh (Figure 1) is defined as: • Definition 3: A mesh is conformal or valid if it is true that the intersection of two neighboring triangles is a side or a common vertex.Various refining techniques have been proposed to refine triangle meshs in 2D and tetrahedra in 3D.One of these methods is the 2T-LE algorithm, (bisection into two longest-edge triangles, Definition 4), proposed by Rivara [9,10].This procedure divides each triangle τ i of a set of triangles ℑ n by adding an edge from the midpoint v s of the longest edge to the opposite vertex, creating two new triangles with the same area (Fig. 2).This method propagates the refinement to the neighboring elements of the mesh in such a way that a larger element τ j ∉ ℑ n is also refined to maintain the mesh's conformity.
• Definition 4: 2T-LE divides a triangle τ i * of a mesh of triangles ℑ n drawing a line from the midpoint v s of the longest edge to the opposite vertex, thus forming two new triangles, τ i 1 and τ i 2 with equal area.This method of bisection propagates to the neighboring triangle in such a way that the conformity of the mesh is maintained.Rivara [9,10] proves Theorem 1.The result is a conformal mesh, and the transition between large and small elements is smooth.
• Theorem 1: Given an input triangulation ℑ 0 of acceptable quality, which is the discretization of a PSLG (Planar Straight Line Graph) geometry in 2D, the LE algorithms ensure that the triangulations obtained by means of iterative and arbitrary refinement are of the same quality as the input triangulation, in the sense that the smallest angle of each triangulation is greater than or equal to one half of the smallest angle of ℑ 0 .
Based on the empirical study to characterize the performance of the method of bisection from the longest-edge [11], the following lemma is formulated: 2,… Then the area of the initial triangulation ℑ n covered with almost equilateral triangles increases when n increases.That is, the percentage of almost equilateral triangles also increases.
The algorithms based on LE bisection ensure that: • The triangulation does not degenerate.
• They guarantee its termination because the edges to be bisected decrease in length over a finite number of steps.• The final triangulation is conformal.
• As the triangulation is refined overall, a large number of almost equilater triangles is obtained.
In the implementation of this method, the propagation takes place until non-conformal triangles are no longer produced or until a boundary is reached according to the 2T-LE algorithm described below.

2T-LE Algorithm
Input: * n ℑ triangulated mesh with marked triangles for refining.
while v s is a non-conformal vertex do look for a non-conformal triangle τ j adjacent to edge(v p , v q ) Bisect LE(τ j * ) end while Output: ℑ n conformal triangulated mesh.

Parallel Model of the 2T-LE Algorithm with MPI
The proposed parallel algorithm consists in breaking up domain Ω i into µ i subproblems and assigning each subproblem to a slave processor ρ i .In each processor ρ i the triangles τ i marked in its domain are refined and the conflicts at the interfaces are solved with other processors.Once the local refinements have been made, the triangulations ℑ n from the processors ρ i (Figura 3) are joined in the master.The master processor uses, beside the 2T-LE refinement algorithm, other algorithms required for the process, such as: • Reading input files and loading the data structures.
• Sending and coordination of data with the slave processors: it carries out the exchange of messages for sending and receiving data between and from the slaves.• Generation of output data: it creates the output file with the same structure as the input files.
The slave processors use the following algorithms: • Receiving the data structures.
• Refinement: it corresponds to the 2T-LE algorithm.
• Conformity: it bisects a non-conformal triangle in an interface, derived from the refinement, leaving it conformal.For this it joins the vertex on the non-conformal side (midpoint of the side) with the opposite vertex.• Sending and coordination of messages to exchange data with the master.
A scheme of the proposed model is presented in Figure 3, which is a diagram of the work done by the process called Master as well as by the process called Slave; it also shows by means of arrows the communications carried out between both processes.It also shows the synchronization made between the process identified as Master and the processes called Slaves.
(continuation of the graph)

Message Passing Interface (MPI)
A problem that must be solved is the conflict that arises at the boundaries of the interfaces shared by two slave processors.When a triangle is refined in a slave processor, an interface (boundary with another processor) can be reached and at the same time another neighboring processor will simultaneously reach the same interface refining a triangle in its domain, which shares an edge in common at the interface.
In the parallel refinement model in each partition, there are two types of collisions to be solved: (i) adaptation collisions and (ii) end detection.
(i) Adaptation collisions correspond to the creation of new vertices when bisecting an edge that belongs to an interface, so it is necessary to send a message to the neighboring processor that shares the same affected edge.
The conflict is resolved in two steps: 1. Refinement step: at the start, the master processor sends to each slave processor the partition that it has to process; the latter, during the processing, sends the collision messages to the master processor and continues the refinement of the other triangles marked in its domain.
2. Conformation step: the master processor collects the collision messages, storing them to be sent to the destination processor once it finishes its local refinement.
(ii) End detection corresponds to the instance in which a slave processor has finished its local refinement, sending a message to the master processor reporting that it finished refining the marked triangles in its partition.The slave waits for an end confirmation from the master processor.If there are conflicts in some of its interfaces, the master sends the collision messages in its interfaces to solve the conformation of the shared triangles.Then it sends again an end message to the master, which collects the end messages of all the slave processors, ending the computation.
The refinement and conformation process is considered ended when all the slave processors have finished, have sent their messages, and received the corresponding end confirmations.

Implementation and Test Data of the Parallel Algorithm
The parallel model was implemented in a cluster with 13 Dell PowerEdge 2950 stations with two Intel Dual-Core Xeon 5110 processors at 1.6 GHz, with 4 MB of cache memory, 12-station main memory with 2 GB RAM, and one with 4 GB.A 146-GB (SAS) hard disk and two Gigabit network communications interfaces.
All the stations ran 32-bit Scientific Linux 3.0.8.One station configured with all the services corresponding to a Computing Element GLite/LCG, and Storage Element Glite, and 11 stations configured as Worker Nodes Glite for computing work.

Test Data
To validate the performance of the implementation of the parallel model three classical grids of the refinement problem were used in a triangulation: Key, S, and Africa.Table 1 shows the characteristics of the test grids.Each grid was refined with three percentages of marked triangles: 10%, 20%, and 30%.Figure 4 shows the image of the grids used.

Distribution of Triangles to be Refined per Node
Each test grid of Table 1 was partitioned to distribute the load among the slave processors.The load was distributed into 2, 4, 8 and 16 slaves.As a summary, Figure 5 shows the distribution of the load in the four parallel cases (2, 4, 8 and 16 slave nodes) for the distribution of 10% of the triangles marked on the Key grid. Figure 5 shows that the load distribution is balanced in the different processors not only in terms of the number of triangles of the grid, but also in terms of the marked triangles, which are those that must actually be processed in each slave node.

Results with Sequential 2T-LE Algorithm
With the purpose of comparing results, a sequential version of the 2T-LE algorithm was implemented and was executed in the same cluster in which the parallel model was implemented.The results of the sequential execution of each grid with 10%, 20% and 30% of the triangles marked for refining were obtained.As a summary, Table 2 shows the results for the case of 30% of triangles to be refined in each test grid.

Results with Parallel 2T-LE Algorithm
To validate the implementation of the parallel model of the 2T-LE algorithm, it was executed in the cluster with the load distribution in each case for 2, 4, 8 and 16 slave nodes.The results of the four test grids were obtained with 10%, 20% and 30% of the triangles to be refined.As a summary, Table 3 presents the results of the Key grid with 30% of triangles to be refined in 2, 4, 8 and 16 slave nodes.As a comparison and for the speed up analysis, the results of the sequential algorithm are also shown.As seen in Table 3, as the number of processors increases, computation time is reduced, achieving the speed ups shown in the Speed Up row. Figure 6 presents the results graphically.It is seen that as the size of the problem increases, the speed up of the parallel algorithm improves.That is, the speed up achieved with 30% of triangles to be refined is better than in the case of 20% refinement (of lower computational requirement), and this in turn is better than the case of refining 10% of the triangles.Figure 7 shows a summary of the speed ups for the case of the Key grid in the three cases of refinement (10%, 20% and 30%) as the number of slave nodes is increased from 2 to 16 (passing by 4 and 8).

Figure 2 .
Figure 2. Partition from the longest side of the initial triangle τ 0

Figure 5 .
Figure 5. Distribution of the load of the Key grid with 10% of triangles to be refined.

Figure 7 .
Figure 7. Summary of speed up with Key grid with 10% 20% and 30% of triangles to be refined.

Table 2 :
Results with sequential 2T-LE algorithm for Key grid

Table 3 :
Results with parallel algorithm for Key grid with 30% of triangles to be refined

Table 4
presents a summary with the average execution times for three of the test grids.