Particle Swarm Optimization for Triclustering High Dimensional Microarray Gene Expression Data

 

Swathypriyadharsini P1 , K Premalatha2

1Research Scholar, Anna University, Chennai, India.

2Professor, Department of Computer Science and Engineering, Bannari Amman Institute of Technology, Sathyamangalam, India.

*Corresponding Author E-mail: swa.pspd@gmail.com, kpl_barath@yahoo.co.in

 

ABSTRACT:

The study of high dimensional microarray gene expression data represents the large computational challenge due to its huge volume of the data. Many clustering techniques are applied to extract the coexpressed genes over the samples. Biclustering improved the traditional clustering by grouping the genes that similarly expressed over only a subset of samples. However, to cluster the high dimensional data with three dimensions such as genes, samples and time points, Triclustering technique is employed for grouping the coexpressed genes over a subset of samples under a subset of time points which imposes huge computational burden. In this paper, Particle Swarm Optimization technique is applied to extract the triclusters from the high dimensional data with objective function as Mean Square Residue. The algorithm is applied to three real life microarray gene expression data and the performance of the work is analyzed using the objective function. The biological significances of the extracted triclusters from all the three datasets are also analyzed. The biological significance analysis are also compared with other triclustering algorithms and the proposed work outperforms the other algorithms.

 

KEYWORDS: Particle Swarm Optimization, Triclustering, High Dimensional Data, Microarray Gene Expression Data, Mean Square Residue.

 

 


INTRODUCTION:

Microarray technology is very effective in measuring the expression values of thousands of genes at a time under multiple experimental conditions and it has reformed the study of microarray gene expression data[1]. High dimensional data in microarray technology considered here measures the expression values of thousands of genes under a relatively small number of experimental samples or conditions across different time points in a single experiment. Since the amount of such data is very large, many computational methods are required for analyzing such datasets. Clustering technique is an unsupervised approach that groups a set of objects as clusters in which intra-cluster similarity is high and inter-cluster similarity is low[2,3].

 

 

 

Clustering is one of the machine learning algorithm that are applied in various fields[4,5,6]. The traditional clustering methods like k-means clustering algorithm which fails for clustering genes which are coexpressed over subset of samples[7,8]. This problem can be solved through biclustering which clusters subset of genes that are similar under subset of samples[9]. New algorithms need to be developed for addressing mining problems such as clustering, classification. Disadvantage of this approach is each dimension is reconstructed independently. So, there is loss of implicit data in multidimensional records[10]. Triclustering deals with three dimensional datasets which finds coexpressed genes under subset of experimental conditions across subset of time points or space.

 

A very first triclustering algorithm proposed for mining coherent clusters from three dimensional data is called “triCluster”[11]. It is the extension of the biclustering algorithm which uses heuristic approach to retrieve biclusters with mean square residue as the objective function. A time-frequency based full-space algorithm is proposed which uses a measure of functional correlation set between time course vectors of different genes[12]. Next, the tricluster algorithm is proposed that aims to retrieve groups of genes that have similar expression profiles over a subset of samples during a subset of time points. It also provides set of metrics to assess the quality of the triclusters and tested the approach on real life microarray datasets[13].

 

Later, “gTricluster” algorithm is introduced which extends clique search technique for three dimension data and retrieves triclusters using spearman rank correlation similarity measure[14]. The δ-TRIMAX algorithm is the extension of triCluster algorithm which extracts triclusters with mean square residue score below a threshold value δ[15]. TriGen algorithm implementing the optimization technique of genetic algorithm for extracting triclusters from temporal datasets[16]. The multi-objective optimization algorithm EMOA-δ-TRIMAX is proposed which extends δ-TRIMAX algorithm with the genetic algorithm optimization technique[17]. Common Optimization techniques like genetic algorithms are also applied in medical data mining for producing effective results[18].

 

Particle Swarm Optimization (PSO) is a heuristic search technique which is inspired by the swarming behaviour of biological population such as schools of fish or flock of birds[19]. Each bird or fish is represented as a single solution in the search space and it is called as particle. The PSO and its several variants have been popular because of its ease of implementation, intuitiveness and the ability to solve non linear problems. It is widely used in many branches of engineering including artificial neural networks, wireless sensor networks and many others.

 

MATERIALS AND METHODS:

Particle Swarm Optimization:

Particle Swarm Optimization (PSO) was introduced by Kennedy and Eberhart which is inspired from the controlled motion of swarm of birds analyzing the notion of “collective intelligence”, also known as “swarm intelligence” in biological populations[20]. The most commonly used example is the behaviour of fish school. These animals are characterized by its movement where each individual have only a limited intelligence of focussing only on its position in the swarm. It has some simple rules such as “going with the same speed of neighbours”, “moving in same direction” or “staying close to neighbours”. The global intelligence of swarm is a direct consequence of local interactions between different particles.

 

Initially, a random set of solutions called as initial swarm, propagates in the design space over a number of iterations. Particles adjust their flying directions based on its own flying experience and also its neighbour flying experience. It is inspired by the visualization of flocks of birds, adapting the environment and finding the food sources by implementing information sharing approach. Many revised variants of PSO have emerged with various concepts but the basic technique is population based optimizer, a number of parameters such as inertia weight and confidence factors give the simple version of PSO[21, 22].

 

PSO algorithm:

The steps of PSO algorithm is given as follows

Initialize a population is represented as, where Xi is a particle that moves within the multidimensional search space.  A particles position and velocity are its properties. The particle position represents the candidate solution.  The particle velocity is the information about direction and varying rate. A particle in d-dimensional search space is given as , the velocity for a particle i is given as  and the best previously visited position of  a particle is given as .

 

For each particle, evaluate the optimization fitness function in N variables and compare the particles fitness with its pbest. If the current value is better than the pbest, then set pbest to current location X in N-dimensional space.

 

Update each particles velocity according to the equation(1).

 

 

 

(1)

Where  is the particle velocity of dth component after (k+1)th update.  is currently the best particles solution after (k+1)th update.  is the global best solution of the dth component after (k+1)th update.  and  are the positive constants known as acceleration coefficients which controls the movement of the particles.  is the inertia weight which controls the effect of previous particles velocity on next one.  and are random variables within the range [0,1].

 

Update each particles position according to the equation (2).

                          (2)

Where  is the particles position after (k+1)th update and  is the particles velocity after (k+1)th update. The position and the velocity of each particle are updated consecutively for much iteration until a termination condition is reached[23].

 

Loop to step 2 until stopping criterion is met which may be a good fitness function or reaching a maximum number of iterations.

 

Triclustering:

Traditional clustering algorithms are applied to analyze the microarray data which work in the full dimensional space. For a microarray gene expression dataset say D, clustering technique considers the value of each object in all the dimensions and groups the similar expression patterns together. Biclustering technique emerged to overcome this strict constraint in clustering. Biclustering extracts co-expressed patterns of a subset of genes under a subset of the samples or experimental conditions. Triclustering emerges an an evolution of biclustering by mining gene expression patterns across samples and time points. The clustering of similar genes that exhibit similar behaviour only under a subset of conditions over only some set of time points is known as triclustering. Figure 1 shows the three dimensional matrix with genes in one dimension, samples in next dimension and time points as the third dimension.

 

 

Figure 1: Three Dimensional Data Matrix

 

Objective Function:

Tricluster is given as where , and . The cuboid C represents subset of genes which have similar expression values over a subset of samples during subset of time points. The objective function is the Mean Square Residue (MSR) of the tricluster which is modelled as in equations (3) and (4) [24].

                                   (3)

 

 

           (4)

 

Where  is the mean of genes under samples at a time point,  is the mean of the genes over time under a sample,  is the mean of a gene in time under the samples,  is the mean of the genes under a sample and a time point,  is the mean of the values of a gene at a time point under samples,  is the mean of a gene under a sample at all time points and  is the mean value of all values in the tricluster.

 

Where  is the mean of genes under samples at a time point,  is the mean of the genes over time under a sample,  is the mean of a gene in time under the samples,  is the mean of the genes under a sample and a time point,  is the mean of the values of a gene at a time point under samples,  is the mean of a gene under a sample at all time points and  is the mean value of all values in the tricluster.

 

RESULTS:

Dataset Description:

PSO algorithm is implemented on three real life temporal high dimensional datasets which are obtained from Gene Expression Omnibus (GEO). Three datasets are experiments of various diseases. All the experiments are time series experiments that have the behaviour of genes under different samples across different time points.

 

Dataset 1:

The real life time series dataset GDS5192 is a experiment of Homo Sapiens. It holds 22158 Affymetrix human genome U133A 2.0 probe ids. Cultures of primary fibroblasts from neonatal skin treated by Egr1 and Tgfb1 were measured at 24 and 48 hours.  Two biological replicates per time point were measured.

 

Dataset 2

This time series dataset is obtained from GEO with the accession code GDS1489 which is experiment for Mus Musculus for growth hormone treatment. It holds 12488 genes in Affymetrix Murine Genome U74A Version 2 Array. It has expression profiling of adipocytes treated with growth hormone and control for 30 mins, 4th hour and 48th hour in three independent experiments.

 

Dataset 3:

The temporal dataset GSE74259 is a time course experiment of Saccharomyces cerevisiae which holds 10928 genes Affymetrix Yeast Genome 2.0 Array. The cultures of BY4741 yeast were grown in non-fermentable yeast peptone glycerol ethanol at 30 degree Celsius and treated for 0, 60, 120 and 240 mins that is for two replicates.

 

The PSO algorithm is implemented in MATLAB version 7.14 (R2012a). The number of particles, number of iterations and inertia weight is fixed throughout for all the three datasets [19]. Table 1 shows the parameters and values considered for the PSO algorithm. Table 2 shows the average fitness function over 10 runs for all three datasets. The lower MSR value indicates that the quality of the triclusters is good. Thus, the MSR value is low for all the three datasets indicating that the extracted triclusters are with highly coexpressed genes.

 

Table 1: Parameters and Values considered for CS and PSO

Parameter

Value

Number of particles (p)

50

Number of iterations

20

C1 and C2

2.1

Inertia Weight (ω)

0.9

 

Table 2: The Average Objective Function over 10 Runs

Generation

MSR

Dataset 1

Dataset 2

Dataset 3

1

3.90E-02

1.55E-01

2.12E-01

5

3.85E-02

1.54E-01

2.06E-01

10

3.82E-02

1.53E-01

2.05E-01

15

3.80E-02

1.51E-01

2.04E-01

20

3.77E-02

1.47E-01

2.03E-01

25

3.74E-02

1.46E-01

2.02E-01

30

3.64E-02

1.41E-01

1.99E-01

35

3.57E-02

1.09E-01

1.97E-01

40

3.48E-02

1.06E-01

1.89E-01

45

2.81E-02

1.00E-01

1.80E-01

50

2.62E-02

1.00E-01

1.79E-01

 

Each dataset have I genes, J samples and K time points which is given as input to the proposed work and it produces clusters which has i genes, j samples and k time points for which   and  . Figure 2, 3 and 4 shows the sample triclusters obtained from the proposed PSO algorithm for the three datasets respectively.

 

Figure 2: Sample Tricluster for Dataset 1

Figure 3: Sample Tricluster for Dataset 2

 

Figure 4: Sample Tricluster for Dataset 3

 

DISCUSSION:

Performance Comparison:

The performance of the proposed work with different representations is compared with other triclustering algorithms based on two validation indexes. The first measure is the Triclustering Quality Index (TQI) which is given in equation (5) [25].

 

                                       (5)

Where  is the mean squared residue of the tricluster  and  is the volume of the tricluster . The volume of the ith tricluster is defined as  where ,  and  represent the number of genes, samples and time points of the ith tricluster. A lower TQI represents the better quality of the triclusters.

 

The second measure is the Statistical Difference from Background (SDB) score which signifies whether a set of n triclusters are statistically different from the background data matrix or not [25]. The SDB score is given in the equation (6).

 

 

Where n is the total number of triclusters extracted by the algorithm.  represents the mean squared residue of the ith tricluster and  is the mean square residue of the jth random tricluster having the same number of genes, samples and time points as the ith resultant tricluster. The higher the value of the denominator denotes the better the quality of the resultant tricluster. Hence, lower SDB score signifies better performance of the algorithm. Table 3 shows the comparison of the performance of various algorithms in terms of SDB and TQI indexes.

 

Table 3: Performance Comparison

Algorithm

SDB

Average TQI

Tricluster using PSO for Dataset 1

0.2572

4.21E-09

Tricluster using PSO for Dataset 2

0.3323

3.27E-09

Tricluster using PSO for Dataset 3

0.2095

2.01E-09

δ-TRIMAX

0.4679

3.08E-05

TRICLUSTER

0.4773

3.35E-05

 

Biological Significance

Dataset 1

The biological significance of the genes belonging to the clusters is identified by performing Gene ontology and KEGG pathway enrichment analysis. David ontology tool which is freely available in the internet is used for the gene enrichment analysis [26]. Gene ontology analysis is performed to find the biological significance of the gene belonging to the triclusters. The genes which have lower p-value are selected representing the higher significance level. The functional process of the genes in the tricluster is identified through the functional annotation. The genes from three sample triclusters are given as input for performing functional annotation. Table 4 shows the functional annotation results for Dataset 1. According to the p-value, the genes are ordered. The functional property “Alternative Splicing” has the lowest p-value 3.0E-244 with the gene count of 8819. In the Tricluster 3, there are 4262 genes with the property “Cytoplasm” having the lower p-value 1.7E-32.  In Tricluster 5, “Sequence variant” is the main function having 9566 genes with low p-value 2.6E-30.

 


Table 4: Functional Annotation of Dataset 1

Tricluster

Category

Term

Gene Count

p-value

Benjamini

Tricluster 1

UP_KEYWORDS UP_SQL_FEATURE

UP_KEYWORDS

Alternative splicing

Splice variant

polymorphism

8819

6398

9355

3.0E-244

8.1E-109

1.5E-61

2.2E-241

1.9E-104

3.6E-59

Tricluster 3

GOTERM_CC_DIRECT

UP_KEYWORDS UP_KEYWORDS

Cytoplasm

Disease mutation

Coiled coil

4262

2105

2459

1.7E-32

2.7E-32

2.2E-27

1.3E-29

2.8E-30

2.0E-25

Tricluster 5

UP_SEQ_FEATURE

UP_KEYWORDS  UP_KEYWORDS

Sequence variant

Metal binding

Nucleotide binding

9566

2938

1488

2.6E-30

2.2E-30

1.4E-24

3.1E-26

3.1E-26

1.0E-22

 

Table 5: Functional Annotation of Dataset 2

Tricluster

Category

Term

Gene Count

p-value

Benjamini

Tricluster 1

UP_KEYWORDS

UP_KEYWORDS

GOTERM_CC_DIRECT

Phosphoprotein

Acetylation

Extracellular exosome

3409

1666

1380

7.9E-292

3.5E-208

1.8E-111

4.3E-289

9.4E-206

2.4E-108

Tricluster 3

GOTERM_MF_DIRECT

UP_KEYWORDS

UP_KEYWORDS

Protein binding

Nucleus

Ubl conjugation

2032

1966

747

6.6E-132

3.3E-119

3.1E-68

2.0E-128

6.0E-117

3.4E-66

Tricluster 5

GOTERM_CC_DIRECT

UP_KEYWORDS

UP_KEYWORDS

Cytosol

Isopeptide bond

Transcription regulation

889

495

808

3.0E-58

4.4E-51

1.7E-49

9.8E-56

4.0E-49

1.2E-47

 

Table 6: Functional Annotation of Dataset 3

Tricluster

Category

Term

Gene Count

p-value

Benjamini

Tricluster 1

GOTERM_BP_DIRECT

GOTERM_CC_DIRECT

UP_KEYWORDS

Transport

Membrane

mRNA transport

598

1203

287

1. 9E-10

2.5E-10

7.6E-6

5.1E-7

1.9E-7

6.9E-3

Tricluster 3

GOTERM_CC_DIRECT

KEGG_PATHWAY

GOTERM_CC_DIRECT

 

Nucleus

Metabolic pathways

Plasma membrane

1515

469

346

7.9E10

9.2E-9

1.7E-4

5.9E-7

1.0E-6

3.1E-2

Tricluster 5

GOTERM_BP_DIRECT

GOTERM_CC_DIRECT

UP_KEYWORDS

Mitotic nuclear division

Spindle pole body

Mitosis

921

65

103

6.5E-6

1.2E-5

2.3E-4

1.8E-2

2.4E-3

8.2E-2

 


 

 

Dataset 2:

Table 5 shows the functional annotation of Dataset 2. In Tricluster 1, 3409 genes are with the functional property “Phospho protein” with a very low p-value 7.9E-289. In Tricluster 3, 2032 genes have “Protein binding” property with a p-value 6.6E-132. In Tricluster 5, “Cytosol” is the top function with a p-value 3.0E-58 containing 889 genes.

 

Dataset 3:

Table 6 gives the functional annotation results of the triclusters extracted from Dataset 3. In Tricluster 1, the biological process function “Transport” has the lowest p-value as1.9E-10 with gene count 598 but the second cellular component function “Membrane” has the larger gene count 1203 with p-value 2.5E-10. In Tricluster 3, cellular component function “Nucleus” has the p-value 7.9E-10 with 1515 genes. In Tricluster 5, the function “Mitotic nuclear function” is the highest with 921 genes and 6.5E-6 as the p-value.

 

Biological Significance Comparison:

Fig. 7 shows the biological significance comparison of the proposed work with the existing algorithms. This is performed by calculating the number of genes in the triclusters which hit the TFBS analysis with the lowest p-value. The proposed PSO based triclustering algorithm has higher percentage of genes in the tricluster with lower p-value than the other triclustering algorithms. The inverse trend of genes hitting the TFBS analysis is observed with proposed work which has largest population at the lowest p-values whereas the other algorithms have increasing population with increasing p-values.  For a very low p-value, the proposed algorithm has larger number of genes contained for it and it decreases has the p-values increases.


 

 

Figure 5: Biological Significance Comparison

 


CONCLUSION:

Clustering techniques are applied to analyze gene expression data from microarray experiments. Biclustering has the ability to mine subgroups of genes and conditions from the data set. Triclustering contains a subset of genes that contains information related to the behaviour of some genes from under some conditions over certain time periods. In this work, particle swarm optimization is employed to identify the triclusters in high dimensional data.  The three real life high dimensional datasets are chosen for experimental analysis. Mean Square Residue of the triclusters is the objective function which is very lower for all the three datasets. The Biological significance analysis of the extracted triclusters of all the three datasets is analysed. Also, the biological significance analysis that are performed are compared with other triclustering algorithms in terms of the p-values obtained and the results prove that the proposed algorithm performs better than the existing triclustering algorithms 

 

 

REFERENCES:

1.     Gunavathi C, Premalatha K, Sivasubramanian K. A Survey on Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification. Research J. Pharm. and Tech. 2017; 10(5): 1395-1401.

2.     Nallakaruppan M. K, P. Ilango, N. Deepa, Anand Muthukumarappan. Clustering of Wireless Sensor Network Data. Research J. Pharm. and Tech. 2017; 10(1): 73-82.

3.     Senthil Kumar T, Boselin Prabhu S R, S. Sophia. The Impact of Clustering Mechanism in Dense Wireless Sensor Network. Research J. Engineering and Tech. 7(1): 19-23.

4.     Meenakshi K, Safa M, Karthick T, Sivaranjani N. A Novel Study of Machine Learning Algorithms for Classifying Health Care Data. Research J. Pharm. and Tech. 2017; 10(5): 1429-1432.

5.     Rakesh Kumar Soni, Shrikant B. Burje. Tumor Detection Algorithm Using Boundary Based Approach. Int. J. Tech. 1(2): 101-106.

6.     B. Lokapriya Nandan, Velin Marita Sequeira, Kanika Verma, K. Ramanathan. Exploring beta-tubulin-protein interactome using computational techniques. Research J. Pharm. and Tech. 8(12): 1679-1684.

7.     Rajeshri Lanjewar, Tripti Sharma. A Determining Approach for Customer Behavior Analysis Using K-Mean Clustering. Int. J. Tech. 2011; 1(2): 125-129.

8.     Santanu Halder, Abul Hasnat, Debotosh Bhattacharjee, Mita Nasipuri. A Novel Low Space Image Storing and Reconstruction Method by K-Means Clustering Algorithm. Int. J. Tech. 2014; 4(1): 186-196.

9.     Cheng Y and Church GM. Biclustering of expression data. Proceedings of International Conference on Intelligent Systems Molecular Biology, 2000; 93-103.

10.   Kumaran U, Neelu Khare, A Sai Suraj. Privacy Preserving in Data Mining Technical: A Review. Research J. Pharm. and Tech 2016; 9(11): 2023-2026.

11.   Zhao L and Zaki MJ. TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 2015; 694–705.

12.   Feng J, Barbano PE and Mishra B. Time-frequency feature detection for timecourse microarray data. Proceedings of the ACM Symposium on Applied Computing. 2004; 128-132.

13.   Zhao L and Zaki MJ. TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 2015; 694–705.

14.   Jiang H. Zhou S, Guan J and Zheng Y. gTRICLUSTER: A More General and Effective 3D Clustering Algorithm for Gene-Sample-Time Microarray Data.  Data Mining for Biomedical Applications.2006; 3916: 48–59.

15.   Bhar A, Haubrock M, Mukhopadhyay A, Maulik U, Bandyopadhyay S and Wingender E. δ-TRIMAX: Extracting triclusters and analyzing coregulation in time series gene expression data. Springer. 2012; 7534: 165–177.

16.   Aviles DG, Escudero CR, Alvarez FM and Riquelme JC. TriGen: A genetic algorithm to mine triclusters in temporal gene expression data. Neurocomputing. 2014; 132: 42–53.

17.   Bhar A, Haubrock M, Mukhopadhyay A, Maulik U, Bandyopadhyay S and Wingender E. Multiobjective triclustering of time-series transcriptome data reveals key genes of biological processes. BMC Bioinformatics. 2015; 16.

18.   G. Manikandan. A Comparative Analysis on the Applicability of Various Mutation Types for Achieving Privacy in medical Data Mining. Research J. Pharm. and Tech. 2017; 10(8): 2451-2455.

19.   Kennedy J and Eberhart RC. Particle Swarm Optimization. Proceedings of the 1995 IEEE International Conference on Neural Networks. 1995; 1942-1948.

20.   Kennedy J and Eberhart RC. Swarm Intelligence. Academic Press, London. 2001.

21.   Eberhart RC, Simpson P and Dobbins R. Computational Intelligence PC Tools, Academic Press. 1996.

22.   Eberhart, R., Kennedy, J., “A new optimizer using particle swarm theory”, MHS’95. Proceedings of the Sixth International Symposium on MicroMachine and Human Science. pp. 39–43, 1995.

23.   Shi Y, Eberhart R. A modified particle swarm optimizer. 1998 IEEE International Conference on Evolutionary Computation Proceedings, IEEE World Congress on Computational Intelligence. 1998; 69–73.

24.   Swathypriyadharsini P and Premalatha k. TrioCuckoo: A Multi Objective Cuckoo Search Algorithm for Triclustering Microarray Gene Expression Data. Journal of Information Science and Engineering. 2018; 34(6): 1617-1631.

25.   Bhar, A., Haubrock, M., Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S. and Wingender, E., “Coexpression and coregulation analysis of time-series gene expression data in estrogen-induced breast cancer cell,” Algorithms Mol Biol., Vol. 8, 2013.

26.   Huang DW, Sherman BT and Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols. 2009; 4(1): 44-57.

 

 

 

 

 

 

Received on 26.12.2018          Modified on 05.02.2019

Accepted on 10.03.2019        © RJPT All right reserved

Research J. Pharm. and Tech. 2019; 12(5):2222-2228.

DOI: 10.5958/0974-360X.2019.00370.6