A Bi-Objective Clustering Algorithm for Gene Expression Data

Clustering algorithms are a common method for data analysis in many science field. They have become popular among biologists because of ease to discovery similar cellular functions in gene expression data. Most approaches consider the gene clustering as an optimization problem, where an ad-hoc cluster quality index is optimized which can be defined regarding gene expression data or biological information. However, these approaches may not be sufficient since they cannot guarantee to generate clusters with similar expression patterns and biological coherence. In this paper, we propose a bi-objective clustering algorithm to discover clusters of genes with high levels of co-expression and biological coherence. Our approach uses a multi-objective evolutionary algorithm (MOEA) that optimizes two index based on gene expression level and biological functional classes. The algorithm is tested on three real-life gene expression datasets. Results show that the proposed model yields gene clusters with higher levels of co-expression and biological coherence than traditional approaches.


Introduction
In the field of molecular biology multiple techniques have been applied for measuring the levels of RNA, DNA, proteins or metabolites. Microarray is one of the most popular techniques for global and simultaneous collection of thousands of these levels under different conditions: a time series during a biological process, experiments of different tissue samples, among others [1]. In bioinformatics, a gene expression microarray, can be represented with a matrix n × m, that contains the expression of n probes (for example genes) on m conditions (samples, tests, cases, among others). This matrix has the advantage of allowing the study of problems such as differential gene expression [2,3], patterns of genes with (dis)similar expression levels [4,5], prediction of response to treatment [6,7], detection of gene mutations [8,9].
Many of these gene expression matrix have a considerable size, so that to study the biological data it is required the use of appropriate computational methods. Clustering techniques have proved to be useful when handling large volumes of data simultaneously while, in the case of genes with (dis)similar expression, identifying groups involved in the same cellular function or that are regulated in the same way. This process, as stated in [10], becomes the goal of clustering since it aims to group n genes or conditions into k subsets in such a way that similar (according to their expression) elements are grouped together, while different elements belong to different groups. The problem of clustering can be seen as an optimization problem where some cluster quality index (objective) is optimized [11]. Quality index may represent different characteristics of the clusters [12], such as compactness, separation among clusters or connectivity within a cluster.
Popular clustering algorithms such as k-Means [13], Hierarchical clustering [14,15], PAM (Partitional Around Medoids) [16,17] and SOM (Self Organizing Maps) [18] have been used to find patterns of genes in microarray experiments optimizing one validity index. These approaches are known as Single-Objective Clustering (SOC) [19]. However, they have some problems: one index is rarely equally well applicable to different types of dataset, i.e. one criterion may not fit well with data distribution when regions of the data contains clusters of diverse shapes [20]; moreover, many of these algorithms often fall into local optima and sometimes they can deliver results that are meaningless for data analysis (e.g. when is optimized only separation validity index it could yield one group with all genes).
In order to overcome these issues, multi-objective optimization (MOO) approach arises as alternative. The works presented in [21,22,23,24] have explored the application of MOO to the clustering problem. In this case, we called Multi-Objective Clustering (MOC), and it corresponds to the simultaneous optimization of several validity index. This approach is a two-step process: (i) discovery of clusters by any clustering algorithm during the optimization of a quality index, and (ii) construction of an "optimal" set of solutions that correspond to various tradeoffs between different objectives [20]. The work of [25] uses a MOC approach incorporating a method called SiMM-TS (Significant Multi-class Membership -Two Stage). In the first stage, through a genetic algorithm (GA) the clustering process is performed based on expression levels and then SiMM points (genes are located in the regions of overlap of two or more clusters) are identified, which are temporarily removed from the data set. In the last stage, it performs a new clustering process, but without the SiMM points, through a genetic algorithm called MOGA (Multi-Objective Genetic Algorithm) which optimizes Xie-Beni (XB) and J m index. In [23], they also use MOGA but adding a technique of majority vote which identifies genes almost always grouped together in most Pareto optimal solutions. These genes are used as training data for a classifier, whereas the remaining genes are used as test data. Finally, both results are combined to obtain the final clustering of genes. In a more recent work [26], it is proposed a technique for gene clustering which optimizes several validity index (Davies-Bouldin (DB), Xie-Beni (XB), J m , P BM and S) and adds an interactive component through a decision maker (DM). The DM evaluates clustering solutions obtained based on the biological expertise in the area. If they are biologically related they are labeled as high-quality clustering. Clearly, the efectiveness of the technique will depend on domain knowledge and expertise at the biological level of DM.
In previous studies it is possible to note that clusters are generated using information about expression level and only in one work the biological information is used but not during the step of generation of clusters. If both information were used at the same time it could allow to find gene clusters with stronger relationship in expression levels and common biological properties. In this paper we propose a Dual-MOC model that optimizes two index which use information based on gene expression and biological functional classes. The goal is to produce gene clusters with higher levels of co-expression and biological coherence than SOC approaches.

Multi-Objective Clustering
A Multi-Objective Clustering (MOC) approach aims to determine the clustering C * for which: Where Ω is the set of feasible clustering, C is a clustering of a given dataset E, and P t , t = 1, . . . , m is a set of m different objective functions (quality index), i.e. that clustering C * corresponds to a cluster solution that has the best optimized m criteria P [27,19]. In SOC problems a solution with a higher objective function value is better than (in maximization case) one with a small objective function value. However, in MOC problems, there is no natural ordering in the objective space and often objectives functions tend to be in conflict [28], i.e. improve one involves worsening another. Because of that it is required to reach a "tradeoff" situation where all objective functions are satisfied to an acceptable degree. Such situation implies to find a set of best solutions, which are called non-dominated solutions [19].

Non-dominated solutions
Given two clustering solutions C 1 and C 2 ∈ Ω, solution C 1 is said to dominate solution C 2 (denoted as In other words, C 1 C 2 if and only if: C 1 is better than C 2 in at least one objective function P t and there is no worse objective function P t where C 2 is better than C 1 .

Pareto optimal set
Pareto optimal set Π is defined as: In other words, a Pareto optimal set Π corresponds to a set of non-dominated solutions, such that there is no other solution C within or outside the set that dominates any of them.

Proposed Model
In this section, we describe in detail our bi-objective gene clustering model called Dual-MOC. Figure 1 shows a basic schema of the model. The proposed model consists of two phases. In the first, called information integration, distance matrices of expression and biological functional similarity are combined. In the second, clustering step, a popular Multi-Objective Evolutionary Algorithm (MOEA) called NSGA-II (Nondominated Sorting Genetic Algorithm) [29] is used to optimize two objective functions.

Phase I: Information integration
In this phase, we use a distance matrix that quantifies how genes are correlated regarding expression value profiles. We measure such correlation using Pearson correlation (ρ) coefficient [30]. This matrix, is called distance of expression profiles (D EP ). In addition, we use a similarity matrix that measures the relationship among genes by functional similarity of them considering biological annotations terms of Gene Ontology (GO). We compute such relation through wang functional similarity (W S) [31]. This matrix, is called distance of biological functionality (D BF ).
Both matrices are combined in a dual similarity matrix according to [32,33] using a weighting parameter α ∈ [0, 1]. So that, a gene pair G 1 , G 2 have a weighted distance D α (G 1 , G 2 ) combining their distances of gene expression profiles D EP (G 1 , G 2 ) and biological functionality D BF (G 1 , G 2 ) as follows: Thus, if the α = 1 then be considering only distance of expression profiles, and if α = 0, would be considered only distances of biological functionality. Other α values consider to a greater or lesser degree the information of matrices of similarity.

Phase II: Clustering step
In this second phase, the bi-objective gene clustering process is performed. It is based on NSGA-II algorithm since it is the standard multi-objective algorithm [25] [23] [34] [35] and it has been demonstrated superiority in performance over other MOEAs when the goal is to yield clustering solutions of gene expression data.
Our Dual-MOC algorithm creates the initial population with clustering solutions produced by single objective approaches. Such initial population is used as input of the NSGA-II algorithm which is detailed above.
• An individual in the population is represented as a vector of integer values, where each one denote a cluster medoid. A cluster medoid is the most central gene located in a cluster, i.e. the sum of the distances to other gene of the cluster is minimum. Thereby, each individual has the same length as the number of clusters k, and each position in this chromosome can have an integer value from 1 to n, n being the number of genes of the dataset. Each element in dataset will be allocated to the nearest cluster medoid according to the dual similarity matrix.
• As selection method, we use a binary tournament along with Pareto ranking and crowding distance (which are characteristic of NSGA-II). To identify the winners individuals of each tournament, the individual belonging to the lower Pareto ranking is chosen and, in the case of a tie, the one with the highest crowding distance is selected.
• As evolutionary operators, we use single-point crossover and a controller-random mutation [36]. In the crossover, a single crossover point on both individuals (parents) is randomly selected. All cluster medoid beyond that point in either integer vectors are swapped between the two individuals. In controllerrandom mutation, a random position is selected from the chromosome and the corresponding gene index is replaced by a randomly chosen gene index from the dataset, so that the chromosome does not contain repeated gene index after mutation.
Our Dual-MOC algorithm performs the clustering process optimizing two objective functions. We use the Co-expression Indicator (CI) [37] and Biological Homogeneity Index (BHI) [38]. They optimise the relationships among genes at expression level and biological functionality, respectively. To compute CI is used the gene expression data, meanwhile to BHI a reference set of functional classes is required.

Computational Experiments
The goal of the experiments is to asses the performance of Dual-MOC to produce gene partitions with better relationships of expression and biologically meaningful clusters.

Datasets
Data used for experiments corresponds to three real-life microarray gene expression datasets: yeast cell cycle, medulloblastoma metastasis and arabidopsis thaliana. The data were normalized so that each row has mean 0 and variance 1. In Table 1, we show a summary of main features of datasets. Here, column original corresponds to the number of genes available in microarray. Column reduced corresponds to the number of genes available after of a pre-processing stage where duplicated genes and missing values of expression levels are remove from microarray. Such reduced number of genes are finally used in computational experiments.

Yeast cell cycle
The yeast cell cycle dataset was collected from http://faculty.washington.edu/kayee/cluster. It shows the fluctuation of expression levels of approximately 6000 genes over two cell cycles (17 time points). Out of these, 384 genes are selected to be cell-cycle regulated and to be unique [26].

Medulloblastoma metastasis
The Medulloblastoma metastasis dataset was contributed by [39]. The dataset was collected from the Gene Expression Omnibus (GEO) using the code GSE468. It contains 2059 probes in 23 primary medulloblastomas clinically designated as either metastatic or non-metastatic samples [40]. Out of these probes, 1465 correspond to genes and 585 are selected to be unique. The data are publicly available at http: //www.biolab.si/supp/bi-cancer/projections/info/meduloblastomiGSE468.htm.

Arabidopsis thaliana
The Arabidopsis thaliana dataset was collected from http://anirbanmukhopadhyay.50webs.com/data. html and it shows the gene expresion values of 138 genes over eight time points (15 min, 30 min, 60 min, 90 min, 3 h, 6 h, 9 h, and 24 h). Out of these, 133 genes are selected to be unique.

Distance Metrics
As previously mentioned, we use two measures of distance, Pearson correlation coefficient to measure the degree of similarity of genes according to their expression level and Wang functional similarity [31] that measures the relationship regarding the biological information using Gene Ontology (GO).

Pearson correlation coefficient
The Pearson correlation coefficient (ρ) is a measure of similarity indicating how and how strongly gene expression of two genes (G x , G y ) for several different conditions are related. This correlation coefficient is computed as follows: where x i , y i are the gene expression of G x and G y in the i-th condition;x,ȳ correspond to the mean of gene expression in all conditions; and, σ x , σ y are the standard deviation of the gene expression of x and y.
In our experiments, we compute 1−ρ Gx,Gy . Thus, values close to 0 would indicate high similarity between the genes G x , G y .

Wang functional similarity
Wang functional similarity (W S) [31] is an information content (IC) based metric which determines the similarity of two GO (Gene Ontology) terms based on both the locations of these terms in the GO graph and their relations with their ancestor terms. First, the measure defines the semantic similarity between one go term and a GO term set GO = {go 1 , go 2 , . . . , go k }, that is the maximum semantic similarity between term go and any of the terms in set GO. Calculated as follows: Where, S GO (go, go i ) is the semantic similarity between go, go i and k is the number of terms in set GO. Therefore, given two genes G x and G y annotated by GO term sets GO 1 = {go 11 , go 12 , ..., go 1m } and GO 2 = {go 21 , go 22 , ..., go 2m } respectively, their Wang functional similarity is defined as: This measure obtains values within the range from 0 to 1. We use 1 − W S(G 1 , G 2 ), so that values close to 0 indicates that gene G 1 and G 2 are similar semantically.

Objective function (CI)
We use the co-expression indicator (CI) as objective function. This indicator was proposed by [37] who defines it as the average of correlations between the genes of the same cluster C g based on expression levels. It is calculated as follows: where M is the number of samples, E mGx and E mGy are the expression for sample m of the genes G x and G y respectively.Ē Gx andĒ Gy are the mean of the M expression values of the genes G x and G y respectively. σ E Gx and σ E Gy are the standard deviation of the M expression values of the genes G x and G y respectively. In our experiments we consider a variation proposed by [33] as follows: where k is max number of clusters, and therefore we measure the average co-expression of the clustering CI(solution). It variation obtains value within the range from 0 to 1, so that values close to 1 indicates that gene in clusters are co-expressed.

Objective function (BHI)
We use the Biological Homogeneity Index (BHI) as objective function. It measures the biological significance of clusters. This criteria was proposed by [38] and it requires a reference set of functional classes which is created using some functional annotation tool. BHI is calculated as follows: where K is the maximum number of partitions of the clustering solution, n j is the number of annotated genes in cluster D j . C(x) and C(y) are the functional classes where gen x and gen y are contained. Now, suppose that two annotated genes x and y belong to the same cluster D produced by some algorithm. Let C(x) and C(y) be the functional classes that contain the genes x and y, respectively. The function I(C(x) = C(y)) will have the value 1 if C(x) matches with C(y) [26], i.e. genes are listed in the same functional class. In this case at least one match is considered sufficient. BHI obtains value within the range from 0 to 1, so that values close to 1 indicates that clustering solution has more biologically homogeneous groups.

Functional Classes
We create functional classes of three datasets using FatiGO [41]. For cell cycle dataset of 384 genes, 243 were annotated and functional classes containing at least 10% genes were selected. It produced 5 overlapping classes corresponding to biological processes. For medulloblastoma metastasis of 585 genes, 484 were annotated and functional classes containing at least 5.4% genes were selected. It produced 11 overlapping classes corresponding to biological processes. For arabidopsis thaliana of 133 genes, 81 were annotated and functional classes containing at least 12.8% genes were selected. It produced 10 overlapping classes corresponding to biological processes. A summary of the functional classes is shown in Table 2.  [25,23,26,42] applied to gene clustering.

Performance metric
To compare the efficiency of Dual-MOC, we use the metric hypervolume (HV) [43], because according to [44] it is the most widely used metric of performance in multi-objective scenarios. HV is a unary metric that measures the volume (area in our case) in the objective function space covered by members of an Pareto set Q. Here, HV for every i ∈ Q computes an area a i regarding a reference point W . Thereby, the union of all areas a i define the hypervolume value.

Contestant algorithms
In our experiments, we aim to demonstrate the advantage of bi-objective clustering that simultaneously optimizes objective function based on expression and biology compared with SOC approaches based on optimizing only one objective at a time. For this purpose, we compare our model against three different algorithms which optimized separately expression and biology criteria. The competitive algorithms are: k-Means: It is a partitioning method that uses a minimum "within-class sum of squares from the centers" criterion to select the clusters [38] i.e., partition the set of n observations into k clusters, so that each observation belongs to the cluster closest to the mean.
PAM: Partitioning Around Medoids is another partitioning method that is used to cluster the types of data in which the mean of objects is not defined or available. This algorithm finds the representative object (i.e., medoid, which is the multidimensional version of the median) of each cluster mean while a cost function is minimized [45].
UPGMA: Unweighted Pair Group Method with Arithmetic mean is perhaps the most commonly used clustering method with microarray data sets [38]. It is an agglomerative hierarchical method that yield a dendrogram can be cut at a chosen height to produce the desired number of clusters [46]. It uses a dissimilarity matrix in order to decide if two expression or biological profiles are close or not.

Development Tools
All the algorithms used in the experiment were implemented using libraries of R [47] and computational tests performed on a computer with Intel(R) Core(TM) i7-3930K CPU 3.20GHz of 6 cores, 16GB RAM running Ubuntu OS, kernel version 3.13.0-87-generic. UPGMA (hclust) is available in the base distribution of R. PAM and k-means are available in the "amap" library [48]. Some features of NSGA-II and HV used were implemented by authors and others are available in the "mco" library [49] and "nsga2R" library [50]. The distance of expression profiles was calculated using functions of the base distribution of R. The distance of biological functionality using the "GOSemSim" library [51]. The objective function CI (Coexpression Indicator) was implemented by authors and the objective function BHI (Biological Homogeneity Index) was computed using the "clValid" library [52].

Results and Discussion
We first show the α parameter incidence in terms of the size of the objective space covered by Dual-MOC in the 20 runs. To do this, we calculate the hypervolume (HV) indicator using as reference point the worst values of the objective functions, i.e. (0, 0). Figure 2 shows the HV values obtained for all datasets.  In Figure 2, the best values of HV are identified by a black circle. In the case of yeast dataset (Fig. 2a), the best value of HV (0.74897) is reached when α = 0.2. In the case of medulloblastoma (Fig. 2b), the best value of HV (0.63658) is reached with α = 0.8. Meanwhile, arabidopsis thaliana (Fig. 2c) obtained an HV value of (0.60392) when α = 0.1. A summary with the best α values and related k configuration found by Dual-MOC in terms of the size of objective space covered (HV) are shown in Table 3. Based on the best configurations in Table 3, we compare our results against three SOC algorithm: k-Means, PAM and UPGMA under three cases: (a) using only distance of expression profile, (b) only distance of biological functionality and (c) combination of both.
For each SOC algorithm, we report the best value found according to both CI and BHI index out of a 20 runs. Note that in the case of k-Means algorithm, two best values are available. The solution labeled as "k-Means-exp" corresponds to the solution with the highest value of CI, on which BHI value was computed. While the solution labeled as "k-Means-bio" corresponds to the solution with the highest value of BHI, on which CI was computed.

Results for yeast cell cycle data
Case a: expression Here, we show the results considering α = 1, i.e., clustering of genes using only distance of expression profile. To facilitate visualization we consider only results for k = {5, 7, 9, 10}. Table  4 shows the CI and BHI values for three SOC algorithms and the non-dominated solutions of Dual-MOC model. Here, it is evident that all Dual-MOC solutions achieve better values (bold) than the SOC algorithms at least in one of the objectives under evaluation. In particular, in the case of k = {7, 10} values (bold and underline) of CI and BHI defeat the best (in italic) SOC algorithm in both index.   Table 5, we show the results using only distance of biological functionality (α = 0). Here, for instance when the dataset is partitioned into k=5 clusters, we have that two of the seven non-dominated solutions achieve better values (in bold and underline) in BHI and CI index than the SOC algorithms. Similarly when the partition has k=7 clusters, we have five (of nine) solutions of Dual-MOC exceed SOC solutions values in BHI criteria. For other cases, solutions found by Dual-MOC achieve better values (bold) than the SOC algorithms at least in one of the objectives under evaluation. In particular, when k = 10 solutions yield by Dual-MOC achieved higher values of BHI than the best (italic) SOC algorithm solution.

Results for medulloblastoma metastasis data
Case a: expression In Table 6, superiority of Dual-MOC regarding the best (in italic) SOC solution is clearly observed. It achieves better values in most non-dominated solutions (in bold) in CI criterion. In the case of k = 7, nine out of ten of the non-dominated solutions outperforms the best SOC solution on the mentioned criterion. For the remaining cases of k, there is also at least one non-dominated solution that is better in both criteria. Meanwhile, for k=10, 11 (in bold and underline) of the 15 non-dominated solutions dominate all the SOC solutions. Table 7 shows the results of considering an α = 0, i.e., clustering of genes using only distance of biological functionality. Here, Dual-MOC achieves better values (in bold) in BHI criterion than

Results for arabidopsis thaliana data
Case a: expression In Table 8, we show the results using only distance of expression profiles (α = 1).
Here, for example in cases k = {5, 7, 9}, our model found at least one solution with the best values in both objectives (bold and underline). For k=10, there are six of twelve non-dominated solutions that are better in CI criterion than the best SOC solution.
Case b: biology In Table 9, we observe that for all k cases our proposed model achieved at least one solution with the best values in both objectives function (bold and underline) regarding the best values (italic) the SOC algorithms. When the dataset was partitioned in k = 10 groups, Dual-MOC demonstrated superiority since it achieved the highest values (bold and underline) for both CI and BHI criteria in 6 out of 15 non-dominated solutions.

Best combination
Here, we show the best α combination according to the HV value achieved. To do this, we use the information in Table 3. Thus, in Figure 3 we show for each dataset, the best set of non-dominated solutions yield by Dual-MOC and the best solutions obtained by competitive algorithms. Here, axes indicate values of both objective functions achieved by the algorithms in their respectively k configuration.       There is only one solution in the approximate Pareto front that does not outperform the solution obtained by k-means-exp. Instead, in CI criterion our proposed model has about five solutions that are better than all SOC algorithm.
As it was previously shown, non-dominated sets yield by Dual-MOC consist of a large number of solutions with tradeoffs in both objectives which in many cases are simultaneously better than solutions yield by SOC algorithm in all criteria optimized. Such situation involves that gene clusters discovered by Dual-MOC have higher levels of co-expression and biological coherence than traditional approaches. The selection of a single solution will depend on the instance of the problem and the expertise of the user that is interested on the results. Automatically recommending a single solution in a multi-objective problem remains an open problem and it is an active field of research called Multi-Criteria Decision-Making [53].

Statistical test
To prove whether Dual-MOC and SOC algorithms perform similarly or not, we have carried out Kruskal-Wallis statistical test [54] followed by Nemenyi post-hoc test [55]. Here, we consider the mean values in each objective function of solutions in non-dominated set with the best combination and SOC algorithms solutions obtained over 20 consecutive runs. In this test, as a null hypothesis, we assume that there is no significant difference between pairwise, whereas the alternative hypothesis is that there is significant difference between them. As p-value threshold, we considered an overall 5% significance level. The results in terms of p-values are reported in Table 10. Here, we observe that all the p-values are lower than 0.0166 (0.05/3). This is strong evidence that Dual-MOC obtains median values of CI and BHI criteria better than SOC algorithms.

Conclusions and Further Work
This paper proposes a bi-objective algorithm for gene expression data that integrates expression data and external biological knowledge controlled by an α parameter and based on Gene Ontology (GO) annotations. The performance of the clustering proposal has been tested for three real-life gene expression datasets and compared with three widely used single objective clustering algorithms. The highest values of the optimized objective functions are achieved in non-dominated sets with an α = 0.2 and k = 10 for yeast cell cycle, α = 0.8 with k = 9 for medulloblastoma metastasis and α = 0.1 and k = 8 for arabidopsis thaliana.
Results indicate that our bi-objective clustering algorithm achieves higher values of co-expression indicator (CI ) and Biological Homogeneity Index (BHI) than competing algorithms.
The statistical test allows to demonstrate that Dual-MOC outperforms traditional single-objective clustering techniques across a diverse range of data sets. It implies gene clusters found by our model have higher levels of co-expression and biological coherence than benchmark algorithms.
Future work will incorporate local search strategies to improve the solutions found by the algorithm and we will set the α value automatically for each dataset used.