Particle Swarm Optimization for Triclustering High Dimensional Microarray Gene Expression Data
Swathypriyadharsini P1 , K Premalatha2
1Research Scholar, Anna University, Chennai, India.
2Professor, Department of Computer Science and Engineering, Bannari Amman Institute of Technology, Sathyamangalam, India.
*Corresponding Author E-mail: swa.pspd@gmail.com, kpl_barath@yahoo.co.in
ABSTRACT:
The study of high dimensional microarray gene expression data represents the large computational challenge due to its huge volume of the data. Many clustering techniques are applied to extract the coexpressed genes over the samples. Biclustering improved the traditional clustering by grouping the genes that similarly expressed over only a subset of samples. However, to cluster the high dimensional data with three dimensions such as genes, samples and time points, Triclustering technique is employed for grouping the coexpressed genes over a subset of samples under a subset of time points which imposes huge computational burden. In this paper, Particle Swarm Optimization technique is applied to extract the triclusters from the high dimensional data with objective function as Mean Square Residue. The algorithm is applied to three real life microarray gene expression data and the performance of the work is analyzed using the objective function. The biological significances of the extracted triclusters from all the three datasets are also analyzed. The biological significance analysis are also compared with other triclustering algorithms and the proposed work outperforms the other algorithms.
KEYWORDS: Particle Swarm Optimization, Triclustering, High Dimensional Data, Microarray Gene Expression Data, Mean Square Residue.
INTRODUCTION:
Clustering is one of the machine learning algorithm that are applied in various fields[4,5,6]. The traditional clustering methods like k-means clustering algorithm which fails for clustering genes which are coexpressed over subset of samples[7,8]. This problem can be solved through biclustering which clusters subset of genes that are similar under subset of samples[9]. New algorithms need to be developed for addressing mining problems such as clustering, classification. Disadvantage of this approach is each dimension is reconstructed independently. So, there is loss of implicit data in multidimensional records[10]. Triclustering deals with three dimensional datasets which finds coexpressed genes under subset of experimental conditions across subset of time points or space.
A very first triclustering algorithm proposed for mining coherent clusters from three dimensional data is called “triCluster”[11]. It is the extension of the biclustering algorithm which uses heuristic approach to retrieve biclusters with mean square residue as the objective function. A time-frequency based full-space algorithm is proposed which uses a measure of functional correlation set between time course vectors of different genes[12]. Next, the tricluster algorithm is proposed that aims to retrieve groups of genes that have similar expression profiles over a subset of samples during a subset of time points. It also provides set of metrics to assess the quality of the triclusters and tested the approach on real life microarray datasets[13].
Later, “gTricluster” algorithm is introduced which extends clique search technique for three dimension data and retrieves triclusters using spearman rank correlation similarity measure[14]. The δ-TRIMAX algorithm is the extension of triCluster algorithm which extracts triclusters with mean square residue score below a threshold value δ[15]. TriGen algorithm implementing the optimization technique of genetic algorithm for extracting triclusters from temporal datasets[16]. The multi-objective optimization algorithm EMOA-δ-TRIMAX is proposed which extends δ-TRIMAX algorithm with the genetic algorithm optimization technique[17]. Common Optimization techniques like genetic algorithms are also applied in medical data mining for producing effective results[18].
Particle Swarm Optimization (PSO) is a heuristic search technique which is inspired by the swarming behaviour of biological population such as schools of fish or flock of birds[19]. Each bird or fish is represented as a single solution in the search space and it is called as particle. The PSO and its several variants have been popular because of its ease of implementation, intuitiveness and the ability to solve non linear problems. It is widely used in many branches of engineering including artificial neural networks, wireless sensor networks and many others.
MATERIALS AND METHODS:
Particle Swarm Optimization:
Particle Swarm Optimization (PSO) was introduced by Kennedy and Eberhart which is inspired from the controlled motion of swarm of birds analyzing the notion of “collective intelligence”, also known as “swarm intelligence” in biological populations[20]. The most commonly used example is the behaviour of fish school. These animals are characterized by its movement where each individual have only a limited intelligence of focussing only on its position in the swarm. It has some simple rules such as “going with the same speed of neighbours”, “moving in same direction” or “staying close to neighbours”. The global intelligence of swarm is a direct consequence of local interactions between different particles.
Initially, a random set of solutions called as initial swarm, propagates in the design space over a number of iterations. Particles adjust their flying directions based on its own flying experience and also its neighbour flying experience. It is inspired by the visualization of flocks of birds, adapting the environment and finding the food sources by implementing information sharing approach. Many revised variants of PSO have emerged with various concepts but the basic technique is population based optimizer, a number of parameters such as inertia weight and confidence factors give the simple version of PSO[21, 22].
PSO algorithm:
The steps of PSO algorithm is given as follows
Initialize a population is represented as
,
where Xi is a particle that moves within the multidimensional search space. A
particles position and velocity are its properties. The particle position represents
the candidate solution. The particle velocity is the information about
direction and varying rate. A particle in d-dimensional search space is given
as
,
the velocity for a particle i is given as
and
the best previously visited position of a particle is given as
.
For each particle, evaluate the optimization fitness function in N variables and compare the particles fitness with its pbest. If the current value is better than the pbest, then set pbest to current location X in N-dimensional space.
Update each particles velocity according to the equation(1).
(1)
Where is the particle velocity of dth
component after (k+1)th update.
is
currently the best particles solution after (k+1)th update.
is
the global best solution of the dth component after (k+1)th update.
and
are
the positive constants known as acceleration coefficients which controls the
movement of the particles.
is
the inertia weight which controls the effect of previous particles velocity on
next one.
and
are
random variables within the range [0,1].
Update each particles position according to the equation (2).
(2)
Where is the particles position after (k+1)th
update and
is
the particles velocity after (k+1)th update. The position and the velocity of
each particle are updated consecutively for much iteration until a termination
condition is reached[23].
Loop to step 2 until stopping criterion is met which may be a good fitness function or reaching a maximum number of iterations.
Triclustering:
Traditional clustering algorithms are applied to analyze the microarray data which work in the full dimensional space. For a microarray gene expression dataset say D, clustering technique considers the value of each object in all the dimensions and groups the similar expression patterns together. Biclustering technique emerged to overcome this strict constraint in clustering. Biclustering extracts co-expressed patterns of a subset of genes under a subset of the samples or experimental conditions. Triclustering emerges an an evolution of biclustering by mining gene expression patterns across samples and time points. The clustering of similar genes that exhibit similar behaviour only under a subset of conditions over only some set of time points is known as triclustering. Figure 1 shows the three dimensional matrix with genes in one dimension, samples in next dimension and time points as the third dimension.
Figure 1: Three Dimensional Data Matrix
Objective Function:
Tricluster is given as
where
,
and
. The cuboid C represents subset of genes which have similar
expression values over a subset of samples during subset of time points. The
objective function is the Mean Square Residue (MSR) of the tricluster which is
modelled as in equations (3) and (4) [24].
(3)
(4)
Where
is the mean of genes under samples at a time point,
is the mean of the genes over time under a sample,
is the mean of a gene in time under the samples,
is the mean of the genes under a sample and a time point,
is the mean of the values of a gene at a time point under samples,
is the mean of a gene under a sample at all time points and
is the mean value of all values in the tricluster.
Where
is the mean of genes under samples at a time point,
is the mean of the genes over time under a sample,
is the mean of a gene in time under the samples,
is the mean of the genes under a sample and a time point,
is the mean of the values of a gene at a time point under samples,
is the mean of a gene under a sample at all time points and
is the mean value of all values in the tricluster.
RESULTS:
Dataset Description:
PSO algorithm is implemented on three real life temporal high dimensional datasets which are obtained from Gene Expression Omnibus (GEO). Three datasets are experiments of various diseases. All the experiments are time series experiments that have the behaviour of genes under different samples across different time points.
Dataset 1:
The real life time series dataset GDS5192 is a experiment of Homo Sapiens. It holds 22158 Affymetrix human genome U133A 2.0 probe ids. Cultures of primary fibroblasts from neonatal skin treated by Egr1 and Tgfb1 were measured at 24 and 48 hours. Two biological replicates per time point were measured.
Dataset 2
This time series dataset is obtained from GEO with the accession code GDS1489 which is experiment for Mus Musculus for growth hormone treatment. It holds 12488 genes in Affymetrix Murine Genome U74A Version 2 Array. It has expression profiling of adipocytes treated with growth hormone and control for 30 mins, 4th hour and 48th hour in three independent experiments.
Dataset 3:
The temporal dataset GSE74259 is a time course experiment of Saccharomyces cerevisiae which holds 10928 genes Affymetrix Yeast Genome 2.0 Array. The cultures of BY4741 yeast were grown in non-fermentable yeast peptone glycerol ethanol at 30 degree Celsius and treated for 0, 60, 120 and 240 mins that is for two replicates.
The PSO algorithm is implemented in MATLAB version 7.14 (R2012a). The number of particles, number of iterations and inertia weight is fixed throughout for all the three datasets [19]. Table 1 shows the parameters and values considered for the PSO algorithm. Table 2 shows the average fitness function over 10 runs for all three datasets. The lower MSR value indicates that the quality of the triclusters is good. Thus, the MSR value is low for all the three datasets indicating that the extracted triclusters are with highly coexpressed genes.
Table 1: Parameters and Values considered for CS and PSO
|
Parameter |
Value |
|
Number of particles (p) |
50 |
|
Number of iterations |
20 |
|
C1 and C2 |
2.1 |
|
Inertia Weight (ω) |
0.9 |
Table 2: The Average Objective Function over 10 Runs
|
Generation |
MSR |
||
|
Dataset 1 |
Dataset 2 |
Dataset 3 |
|
|
1 |
3.90E-02 |
1.55E-01 |
2.12E-01 |
|
5 |
3.85E-02 |
1.54E-01 |
2.06E-01 |
|
10 |
3.82E-02 |
1.53E-01 |
2.05E-01 |
|
15 |
3.80E-02 |
1.51E-01 |
2.04E-01 |
|
20 |
3.77E-02 |
1.47E-01 |
2.03E-01 |
|
25 |
3.74E-02 |
1.46E-01 |
2.02E-01 |
|
30 |
3.64E-02 |
1.41E-01 |
1.99E-01 |
|
35 |
3.57E-02 |
1.09E-01 |
1.97E-01 |
|
40 |
3.48E-02 |
1.06E-01 |
1.89E-01 |
|
45 |
2.81E-02 |
1.00E-01 |
1.80E-01 |
|
50 |
2.62E-02 |
1.00E-01 |
1.79E-01 |
Each dataset have I genes, J samples and K time points which is given as input to the proposed work and it produces clusters which has i genes, j samples and k time points for which and . Figure 2, 3 and 4 shows the sample triclusters obtained from the proposed PSO algorithm for the three datasets respectively.
Figure 2: Sample Tricluster for Dataset 1
Figure 3: Sample Tricluster for Dataset 2
Figure 4: Sample Tricluster for Dataset 3
DISCUSSION:
Performance Comparison:
The performance of the proposed work with different representations is compared with other triclustering algorithms based on two validation indexes. The first measure is the Triclustering Quality Index (TQI) which is given in equation (5) [25].
(5)
Where
is the mean squared residue of the
tricluster
and
is the volume of the tricluster
. The volume of the ith tricluster is
defined as
where
,
and
represent the number of genes, samples
and time points of the ith tricluster. A lower TQI represents the better
quality of the triclusters.
The second measure is the Statistical Difference from Background (SDB) score which signifies whether a set of n triclusters are statistically different from the background data matrix or not [25]. The SDB score is given in the equation (6).
Where n is the total number
of triclusters extracted by the algorithm.
represents the mean squared residue of
the ith tricluster and
is the mean square residue of the jth
random tricluster having the same number of genes, samples and time points as
the ith resultant tricluster. The higher the value of the denominator denotes
the better the quality of the resultant tricluster. Hence, lower SDB score
signifies better performance of the algorithm. Table 3
shows the comparison of the performance of various algorithms in terms of SDB
and TQI indexes.
Table 3: Performance Comparison
|
Algorithm |
SDB |
Average TQI |
|
Tricluster using PSO for Dataset 1 |
0.2572 |
4.21E-09 |
|
Tricluster using PSO for Dataset 2 |
0.3323 |
3.27E-09 |
|
Tricluster using PSO for Dataset 3 |
0.2095 |
2.01E-09 |
|
δ-TRIMAX |
0.4679 |
3.08E-05 |
|
TRICLUSTER |
0.4773 |
3.35E-05 |
Biological Significance
Dataset 1
The biological significance of the genes belonging to the clusters is identified by performing Gene ontology and KEGG pathway enrichment analysis. David ontology tool which is freely available in the internet is used for the gene enrichment analysis [26]. Gene ontology analysis is performed to find the biological significance of the gene belonging to the triclusters. The genes which have lower p-value are selected representing the higher significance level. The functional process of the genes in the tricluster is identified through the functional annotation. The genes from three sample triclusters are given as input for performing functional annotation. Table 4 shows the functional annotation results for Dataset 1. According to the p-value, the genes are ordered. The functional property “Alternative Splicing” has the lowest p-value 3.0E-244 with the gene count of 8819. In the Tricluster 3, there are 4262 genes with the property “Cytoplasm” having the lower p-value 1.7E-32. In Tricluster 5, “Sequence variant” is the main function having 9566 genes with low p-value 2.6E-30.
Table 4: Functional Annotation of Dataset 1
|
Tricluster |
Category |
Term |
Gene Count |
p-value |
Benjamini |
|
Tricluster 1 |
UP_KEYWORDS UP_SQL_FEATURE UP_KEYWORDS |
Alternative splicing Splice variant polymorphism |
8819 6398 9355 |
3.0E-244 8.1E-109 1.5E-61 |
2.2E-241 1.9E-104 3.6E-59 |
|
Tricluster 3 |
GOTERM_CC_DIRECT UP_KEYWORDS UP_KEYWORDS |
Cytoplasm Disease mutation Coiled coil |
4262 2105 2459 |
1.7E-32 2.7E-32 2.2E-27 |
1.3E-29 2.8E-30 2.0E-25 |
|
Tricluster 5 |
UP_SEQ_FEATURE UP_KEYWORDS UP_KEYWORDS |
Sequence variant Metal binding Nucleotide binding |
9566 2938 1488 |
2.6E-30 2.2E-30 1.4E-24 |
3.1E-26 3.1E-26 1.0E-22 |
Table 5: Functional Annotation of Dataset 2
|
Tricluster |
Category |
Term |
Gene Count |
p-value |
Benjamini |
|
Tricluster 1 |
UP_KEYWORDS UP_KEYWORDS GOTERM_CC_DIRECT |
Phosphoprotein Acetylation Extracellular exosome |
3409 1666 1380 |
7.9E-292 3.5E-208 1.8E-111 |
4.3E-289 9.4E-206 2.4E-108 |
|
Tricluster 3 |
GOTERM_MF_DIRECT UP_KEYWORDS UP_KEYWORDS |
Protein binding Nucleus Ubl conjugation |
2032 1966 747 |
6.6E-132 3.3E-119 3.1E-68 |
2.0E-128 6.0E-117 3.4E-66 |
|
Tricluster 5 |
GOTERM_CC_DIRECT UP_KEYWORDS UP_KEYWORDS |
Cytosol Isopeptide bond Transcription regulation |
889 495 808 |
3.0E-58 4.4E-51 1.7E-49 |
9.8E-56 4.0E-49 1.2E-47 |
Table 6: Functional Annotation of Dataset 3
|
Tricluster |
Category |
Term |
Gene Count |
p-value |
Benjamini |
|
Tricluster 1 |
GOTERM_BP_DIRECT GOTERM_CC_DIRECT UP_KEYWORDS |
Transport Membrane mRNA transport |
598 1203 287 |
1. 9E-10 2.5E-10 7.6E-6 |
5.1E-7 1.9E-7 6.9E-3 |
|
Tricluster 3 |
GOTERM_CC_DIRECT KEGG_PATHWAY GOTERM_CC_DIRECT
|
Nucleus Metabolic pathways Plasma membrane |
1515 469 346 |
7.9E10 9.2E-9 1.7E-4 |
5.9E-7 1.0E-6 3.1E-2 |
|
Tricluster 5 |
GOTERM_BP_DIRECT GOTERM_CC_DIRECT UP_KEYWORDS |
Mitotic nuclear division Spindle pole body Mitosis |
921 65 103 |
6.5E-6 1.2E-5 2.3E-4 |
1.8E-2 2.4E-3 8.2E-2 |
Dataset 2:
Table 5 shows the functional annotation of Dataset 2. In Tricluster 1, 3409 genes are with the functional property “Phospho protein” with a very low p-value 7.9E-289. In Tricluster 3, 2032 genes have “Protein binding” property with a p-value 6.6E-132. In Tricluster 5, “Cytosol” is the top function with a p-value 3.0E-58 containing 889 genes.
Dataset 3:
Table 6 gives the functional annotation results of the triclusters extracted from Dataset 3. In Tricluster 1, the biological process function “Transport” has the lowest p-value as1.9E-10 with gene count 598 but the second cellular component function “Membrane” has the larger gene count 1203 with p-value 2.5E-10. In Tricluster 3, cellular component function “Nucleus” has the p-value 7.9E-10 with 1515 genes. In Tricluster 5, the function “Mitotic nuclear function” is the highest with 921 genes and 6.5E-6 as the p-value.
Biological Significance Comparison:
Fig. 7 shows the biological significance comparison of the proposed work with the existing algorithms. This is performed by calculating the number of genes in the triclusters which hit the TFBS analysis with the lowest p-value. The proposed PSO based triclustering algorithm has higher percentage of genes in the tricluster with lower p-value than the other triclustering algorithms. The inverse trend of genes hitting the TFBS analysis is observed with proposed work which has largest population at the lowest p-values whereas the other algorithms have increasing population with increasing p-values. For a very low p-value, the proposed algorithm has larger number of genes contained for it and it decreases has the p-values increases.
Figure 5: Biological Significance Comparison
CONCLUSION:
Clustering techniques are applied to analyze gene expression data from microarray experiments. Biclustering has the ability to mine subgroups of genes and conditions from the data set. Triclustering contains a subset of genes that contains information related to the behaviour of some genes from under some conditions over certain time periods. In this work, particle swarm optimization is employed to identify the triclusters in high dimensional data. The three real life high dimensional datasets are chosen for experimental analysis. Mean Square Residue of the triclusters is the objective function which is very lower for all the three datasets. The Biological significance analysis of the extracted triclusters of all the three datasets is analysed. Also, the biological significance analysis that are performed are compared with other triclustering algorithms in terms of the p-values obtained and the results prove that the proposed algorithm performs better than the existing triclustering algorithms
REFERENCES:
1. Gunavathi C, Premalatha K, Sivasubramanian K. A Survey on Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification. Research J. Pharm. and Tech. 2017; 10(5): 1395-1401.
2. Nallakaruppan M. K, P. Ilango, N. Deepa, Anand Muthukumarappan. Clustering of Wireless Sensor Network Data. Research J. Pharm. and Tech. 2017; 10(1): 73-82.
3. Senthil Kumar T, Boselin Prabhu S R, S. Sophia. The Impact of Clustering Mechanism in Dense Wireless Sensor Network. Research J. Engineering and Tech. 7(1): 19-23.
4. Meenakshi K, Safa M, Karthick T, Sivaranjani N. A Novel Study of Machine Learning Algorithms for Classifying Health Care Data. Research J. Pharm. and Tech. 2017; 10(5): 1429-1432.
5. Rakesh Kumar Soni, Shrikant B. Burje. Tumor Detection Algorithm Using Boundary Based Approach. Int. J. Tech. 1(2): 101-106.
6. B. Lokapriya Nandan, Velin Marita Sequeira, Kanika Verma, K. Ramanathan. Exploring beta-tubulin-protein interactome using computational techniques. Research J. Pharm. and Tech. 8(12): 1679-1684.
7. Rajeshri Lanjewar, Tripti Sharma. A Determining Approach for Customer Behavior Analysis Using K-Mean Clustering. Int. J. Tech. 2011; 1(2): 125-129.
8. Santanu Halder, Abul Hasnat, Debotosh Bhattacharjee, Mita Nasipuri. A Novel Low Space Image Storing and Reconstruction Method by K-Means Clustering Algorithm. Int. J. Tech. 2014; 4(1): 186-196.
9. Cheng Y and Church GM. Biclustering of expression data. Proceedings of International Conference on Intelligent Systems Molecular Biology, 2000; 93-103.
10. Kumaran U, Neelu Khare, A Sai Suraj. Privacy Preserving in Data Mining Technical: A Review. Research J. Pharm. and Tech 2016; 9(11): 2023-2026.
11. Zhao L and Zaki MJ. TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 2015; 694–705.
12. Feng J, Barbano PE and Mishra B. Time-frequency feature detection for timecourse microarray data. Proceedings of the ACM Symposium on Applied Computing. 2004; 128-132.
13. Zhao L and Zaki MJ. TRICLUSTER: an effective algorithm for mining coherent clusters in 3D microarray data. Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 2015; 694–705.
14. Jiang H. Zhou S, Guan J and Zheng Y. gTRICLUSTER: A More General and Effective 3D Clustering Algorithm for Gene-Sample-Time Microarray Data. Data Mining for Biomedical Applications.2006; 3916: 48–59.
15. Bhar A, Haubrock M, Mukhopadhyay A, Maulik U, Bandyopadhyay S and Wingender E. δ-TRIMAX: Extracting triclusters and analyzing coregulation in time series gene expression data. Springer. 2012; 7534: 165–177.
16. Aviles DG, Escudero CR, Alvarez FM and Riquelme JC. TriGen: A genetic algorithm to mine triclusters in temporal gene expression data. Neurocomputing. 2014; 132: 42–53.
17. Bhar A, Haubrock M, Mukhopadhyay A, Maulik U, Bandyopadhyay S and Wingender E. Multiobjective triclustering of time-series transcriptome data reveals key genes of biological processes. BMC Bioinformatics. 2015; 16.
18. G. Manikandan. A Comparative Analysis on the Applicability of Various Mutation Types for Achieving Privacy in medical Data Mining. Research J. Pharm. and Tech. 2017; 10(8): 2451-2455.
19. Kennedy J and Eberhart RC. Particle Swarm Optimization. Proceedings of the 1995 IEEE International Conference on Neural Networks. 1995; 1942-1948.
20. Kennedy J and Eberhart RC. Swarm Intelligence. Academic Press, London. 2001.
21. Eberhart RC, Simpson P and Dobbins R. Computational Intelligence PC Tools, Academic Press. 1996.
22. Eberhart, R., Kennedy, J., “A new optimizer using particle swarm theory”, MHS’95. Proceedings of the Sixth International Symposium on MicroMachine and Human Science. pp. 39–43, 1995.
23. Shi Y, Eberhart R. A modified particle swarm optimizer. 1998 IEEE International Conference on Evolutionary Computation Proceedings, IEEE World Congress on Computational Intelligence. 1998; 69–73.
24. Swathypriyadharsini P and Premalatha k. TrioCuckoo: A Multi Objective Cuckoo Search Algorithm for Triclustering Microarray Gene Expression Data. Journal of Information Science and Engineering. 2018; 34(6): 1617-1631.
25. Bhar, A., Haubrock, M., Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S. and Wingender, E., “Coexpression and coregulation analysis of time-series gene expression data in estrogen-induced breast cancer cell,” Algorithms Mol Biol., Vol. 8, 2013.
26. Huang DW, Sherman BT and Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols. 2009; 4(1): 44-57.
Received on 26.12.2018 Modified on 05.02.2019
Accepted on 10.03.2019 © RJPT All right reserved
Research J. Pharm. and Tech. 2019; 12(5):2222-2228.
DOI: 10.5958/0974-360X.2019.00370.6