A Novel Study of Machine Learning Algorithms for Classifying Health Care Data
Meenakshi K*, Ms. Safa M, Mr. Karthick T, Ms. Sivaranjani N
Department of Information Technology, SRM University, Chennai, India
*Corresponding Author E-mail: meenakshi.k@ktr.srmuniv.ac.in
ABSTRACT:
Machine learning is a kind of data analysis technique which provides a flexible way of learning information about the data, so that necessary action can be predicted accurately. Machine learning techniques provide the way of analyzing and predicting the valuable information from the available data, so that further actions can be carried out accurately. There is several kind of machine learning approaches are available based on their behavior and working procedure. In this analysis work different kind of methodologies are discussed which are used to learn the knowledge about the program. The various machine learning approaches differs in their working procedure and inputs and output processed by them. Various applications adapted the machine learning approaches for learning the information which are discussed in detail in this paper. The analysis work provides the detailed view of working procedure of different research works implemented by various authors. It also gives overview of merits and demerits of different machine learning techniques that are proposed earlier. The main goal of this analysis work is to identify the better machine learning approach which can lead to accurate learning with less false positive rate. The analysis of the work concluded with the performance results of different approaches that carried down and shows better approach which can lead to more accurate learning of health care data.
KEYWORDS: Machine learning, Classification, Clustering, Predicting pattern, Accuracy.
INTRODUCTION:
Machine learning is an artificial intelligence technique (AI) that provides computers with the ability to learn without being explicitly programmed [1]. The computer programs can be changed after exposing new data. The process of machine learning is similar to that of data mining. Both systems search through data to look for patterns. However, instead of extracting data for human comprehension as is the case in data mining applications machine learning uses that data to detect patterns in data and adjust program actions accordingly [2].
Machine learning algorithms are often categorized as being supervised or unsupervised. Supervised algorithms can apply what has been learned in the past to new data. Unsupervised algorithms can draw inferences from datasets. Because of new computing technologies, machine learning today is not like machine learning of the past [3]. It was born from pattern recognition and the computers can learn without being programmed to perform specific tasks; researchers interested in artificial intelligence wanted to see if computers could learn from data. They learn from previous computations to generate correct, specific decisions and results. It’s a science that’s not new but one that’s gaining fresh momentum. The overall organization of the research work is given as follows: In section 2, different research that falls under machine learning categories are briefly described. In section 3, comparison evaluation of the different machines learning approaches comes under different categories are depicted in terms of their merits and demerits. Finally in section 4, conclusion of the proposed research method is given.
Review of Machine Learning approaches:
Machine learning is a data analysis technique which is used by many real world applications for predicting the accurate results based on available information. Machine learning techniques are categorized into three types based on the level of available information for learning which is discussed briefly in the previous section. Various researchers has utilized the machine learning approaches in their application for the accurate learning of results. In this section different which has been introduced by various researchers by adapting the nature of different categories of machines learning techniques is discussed.
Discussion on Supervised Learning Algorithms:
Cardiovascular Disease Prognosis is done in the paper [4], where the semi supervised learning algorithm Support Vector Machine and naïve bias is utilized. It also provides a technique to improve the accuracy of proposed classifier models using feature selection technique. Patient’s data were collected from Department of Computing of Goldsmiths University of London. It is found that Naïve Bayes gave best result before attribute selection. But after performing a controlled and careful feature selection, SVM turned out to be the best classifier. Chronic kidney disease analysis is done by using classification techniques like Naive Bayes and Artificial Neural Network (ANN). Chronic kidney disease (CKD) has become a global health issue and is an area of concern [5]. It is a condition where kidneys become damaged and cannot filter toxic wastes in the body. The obtained results showed that Naïve Bayes is the most accurate classifier with 100% accuracy when compared to ANN having 72.73% accuracy. A fast mode decision algorithm is proposed for High Efficiency Video Coding (HEVC) intra coding [6]. This regression classifier is introduced to early terminate CU splitting decision process it formulates a binary classification problem. The logistic regression classifier uses set of coefficients which are produced by the offline training scheme. Efficient and computationally-friendly features are extracted based on an F-score approach for different QPs and CU depth levels. Effectiveness of the popular classification techniques k-Nearest Neighbor (kNN) algorithm is integrated with Ant Colony Optimization (ACO) to predict the likelihood of getting heart disease in the work [7]. The analysis has been performed in two phases. In the first phase, the kNN classification is used to classify the test data. In the second phase, the ACO is used to initialize the population and search for the optimized solution. The dataset used in this work is Streptococcus Pyogenes bacteria that cause Rheumatic Fever, also known as Acute Rheumatic Fever (ARF). The results are compared with the other existing algorithms and showed that this integrated approach predicts the presence of Streptococcus pyogenes that causes ARF with least error rate and better accuracy. The classification techniques namely Support Vector Machine (SVM) and Random Forest (RF) are used to learn, classify and find the comparison of cancer disease data [8]. Results with Support Vector Machines and Random Forest are compared for different data sets. It can be observed that there is a varying accuracy of classification with different probabilistic estimate with different kernel function. Results are observed much better with Radial basis function with SVM and in some cases results are comparable with Random Forest technique. Alzheimer’s disease prediction model is developed that helps medical professionals in predicting the status of the disease based on medical data about patients [9]. The sample medical data used in this work have five important attributes, namely, gender, age, genetic causes, brain injury, and vascular disease. The training set contains values for seventeen different patients that represent seventeen medical cases. It performs decision tree induction to create a decision tree that corresponds to the sample data.
Discussion on Unsupervised Learning Algorithms:
Unsupervised learning algorithms are more complex one and also more predictive in nature where there won’t be any labeled data in the training set. Unsupervised learning algorithms leads to predict more unknown and different labels where the supervised learning algorithm only rely existing label information where the new label cannot predicted. There are many types of unsupervised learning algorithm in machine learning are used by different researchers in different application. This section provides an overall view of multiple unsupervised learning algorithms that are used naturally by various authors Disease prediction is done by using clustering algorithm [10]. It proposes a hybrid model using K-means algorithm. In the initial stage, datasets are collected from the UCI repository is cleaned by deleting all the examples with missing values. In the second stage Best First search algorithm and Correlation based feature selection (CFS) are used for finding relevant feature selection. In the third stage the resultant dataset (binary class datasets) is then clustered into two segments using K-means and incorrectly clustered samples are eliminated to get final samples. Finally, the correctly clustered sample from the previous stage is trained with 12 different classifiers to build the final classifier. 30 human stool micro biome samples are analyzed by using hierarchical clustering [11]. This is done by using hierarchical clustering approach where enterotypes are classified accurately in different levels. Hierarchical clustering is a widely used unsupervised clustering method which has been applied to many applications in gene expression analysis. The hierarchical agglomerative algorithm was conducted with the Euclidean metric to be used for calculating dissimilarities between the samples, and the distances among the clustered samples were calculated in average linkages. A prediction system is developed for detecting heart disease using multilayer perceptron neural network [12]. The system accepts 13 clinical features as input and it is trained using back-propagation algorithm. It predicts a heart disease with highest accuracy of 98%. The accuracy thus obtained with this system shows that it is better and efficient than other systems. A disease memory (DM) framework, which extracts the integrated features, by modeling the relationships among RFs [13]. The variation of DM can model characteristics for patients affected by disease or not respectively by training deep networks with different samples. The Experimental results shows that the proposed framework can successfully predict the bone disease Expectation maximization (EM) algorithm was used to identify the parameters in the model [14]. Comparison of the accuracy of identification and glucose prediction is performed using the EM algorithm. The least square algorithm shows the practicability and efficiency of the EM algorithm. “ICA gene shaving” (ICA, independent component analysis) a novel method for analyzing gene expression and copy number data [15]. We investigated the properties of our proposed method by analyzing both simulated and real data. It is demonstrated that the robustness of our method to noise using simulated data.
Discussion on Reinforcement learning algorithms:
Reinforcement learning algorithms are mostly preferred by various researchers in the real world application due to nature of considering dynamic behavior for learning. Supervised and unsupervised algorithms only consider the statically available data for learning which cannot support most of the real world applications where the data’s are growing in nature. In this section, various reinforcement machine learning algorithm used in different research works are discussed in detail. Lag-1 and lag-2 stochastic models are developed to provide SI predictions based on a set of identified, time-varying SI data for a neonatal intensive care cohort. The model provides prediction estimators with greater, more conservative, coverage than expected from the probability bounds. The lag-2 effects did not improve the coverage proportion, and greater coverage overestimation in regions of higher data density pointed to the variance estimator based on local data density as a likely source of overestimation. Modifying the data density estimator by introducing a constant scaling factor showed that appropriate coverage was obtained at approximately 10%–50% of the original value [16]. By using Monte Carlo simulation and real data, it is estimated that a 2.5% might be potential candidates in being in the highest levels of progress of type-2 diabetes as manifested in nephropathy or necrosis, In addition, a 1% of the sample might be highly sensitive to cardiovascular attack [17]. The pattern of the sample is characterized by having low incomes per month, poor education to improve lifestyle, as well as the lack of contact with health specialist, among others. The results of this simulation might serve to reconfigure ongoing schemes of public health aiming to reduce diabetes complications and extend minimally the lifetime of those type-2 diabetes patients belonging to vulnerable groups.
Comparison Evaluation of Machine Learning approaches:
In this section, comparison evaluations of different machine learning approaches are discussed in detail by mentioning their merits and demerits.
Table 1 Comparison evaluation in supervised learning
AUTHOR |
METHOD |
MERITS |
DEMERITS |
Sabab, S. A., et al [6] |
SVM Naïve Bias |
Improve the accuracy of each model by reducing some lower ranked attributes |
It requires more related attributes to provide exact classification result |
Kunwar, V. et al [7] |
ANN Naïve Bias |
This research work proves that naïve bias algorithm is more accurate than ANN algorithm |
It cannot provide better result in case of presence of multiple features in the data set |
Hu, Q et al [8] |
Logistic regression classifier |
Reduced computational complexity level High signal quality |
Large volume of data cannot be analyzed perfectly due to complex structure of algorithm |
Rajathi, Set al [9] |
KNN with ACO algorithm |
More accuracy Better prediction rate |
Optimal solution finding would be difficult task in case presence less labeled data which cannot be predicted accurately by KNN approach due to its neighbor distance. |
Ahmed, K. A., et al [10] |
SVM Random Forest |
SVM provides more accurate result for different kernels |
Random forest algorithm do not with stand against SVM with varying kernels More computational complexity required |
Dana, A. D., et al [11] |
Decision tree classification |
Increased precision and accuracy rate
|
Requires multiple processing cycle to construct the decision tree. More complex in case of large volume of sample data with no labels |
Table 2 Comparison evaluation in unsupervised learning
AUTHOR |
METHOD |
MERITS |
DEMERITS |
Sumana, B. V et al [12] |
Cascaded K means clustering |
It increases the accuracy of prediction of accuracy considerably It can support many data sets efficiently and provide better result than classifiers |
It requires more computational overhead to complete the task |
Chen, T. F., et al [13] |
Hierarchical agglomerative algorithm |
Improved classification accuracy Can handle large volume of data in the flexible way |
More sub clustering is not possible in this research |
Sonawane, J. S., et al [14] |
Multilayer perceptron neural network |
Highest accuracy Can predict multiple chances of labels present in the training data |
It requires more training data for learning to provide accurate classification results for test data |
Li, H., et al [15] |
Disease memory (DM) framework |
Proposed method improves the prediction performance great potential to select the informative RFs for bone diseases |
It considers only statistic significant features |
Zeng, F., et al [16] |
Expectation maximization algorithm |
More efficiency Improved prediction rate |
It doesn’t support dynamic growth of data More computational complexity required |
Sheng, J., et al [17] |
ICA gene shaving |
It can perform better even in case of presence of more noises More accuracy |
Requires more processing cycles to complete the task |
Table 3 Comparison evaluation in Reinforcement learning
AUTHOR |
METHOD |
MERITS |
DEMERITS |
Le Compte, A. J., et al [18] |
Stochastic models |
Obtain the desired prediction and glycaemic control performance |
Continuous changing in data might cause performance degradation |
Huber Nieto-Chaupis [19] |
Monte Carlo simulation |
Reduced diabetic complication by accurate prediction of disease |
More time complexity More memory consumption |
CONCLUSION:
In this analysis work, various machine learning approaches which are used for the analyzing and predicting the valuable information about health care data are briefly evaluated and discussed. The algorithms under different categories of machine learning techniques are proposed and discussed briefly. And also evaluation of those research works based on their merits and demerits are provided. From this evaluation it can be find that there are various algorithms from the categories supervised and unsupervised learning is proposed. But only fewer researches are introduced from the reinforcement learning algorithms. As the world is changing and the increased usage online application leads to dynamic growth of data which cannot be supported efficiently by the traditional supervised and unsupervised learning algorithm. From this analysis it is proved that it is required to implement the novel approaches which can support the dynamic growth of data. It will be efficient if more researches have been conducted based on reinforcement learning algorithms.
REFERENCES:
1. Witten, I. H., Frank, E., Hall, M. A., and Pal, C. J. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann 2016.
2. Demšar, J., Zupan, B., Leban, G., and Curk, T. Orange: From experimental machine learning to interactive data mining. In European Conference on Principles of Data Mining and Knowledge Discovery , 2004, September, pp. 537-539.
3. Bose, I., and Mahapatra, R. K. Business data mining—a machine learning perspective. Information and management, 2001, pp. 211-225.
4. Sabab, S. A., Munshi, M. A. R., and Pritom, A. I. Cardiovascular Disease Prognosis Using Effective Classification and Feature Selection Technique
5. Kunwar, V., Chandel, K., Sabitha, A. S., and Bansal, A. Chronic Kidney Disease analysis using data mining classification techniques. In Cloud System and Big Data Engineering (Confluence), 2016 6th International Conference pp. 300-305
6. Hu, Q., Shi, Z., Zhang, X., and Gao, Z. , Fast HEVC intra mode decision based on logistic regression classification. In Broadband Multimedia Systems and Broadcasting (BMSB), 2016 IEEE International Symposium on pp. 1-4
7. Rajathi, S., and Radhamani, G. (2016, March). Prediction and analysis of Rheumatic heart disease using kNN classification with ACO. In Data Mining and Advanced Computing (SAPIENCE) Nov 2016, pp. 68-73.
8. Ahmed, K. A., Aljahdali, S., Hundewale, N., and Ahmed, K. I. (2012, July). Cancer disease prediction with support vector machine and random forest classification techniques. In Computational Intelligence and Cybernetics (CyberneticsCom), 2012 pp. 16-19.
9. Dana, A. D., and Alashqur, A. (2014, March). Using decision tree classification to assist in the prediction of Alzheimer's disease. In Computer Science and Information Technology (CSIT), 2014 6th International Conference pp. 122-126.
10. Sumana, B. V., and Santhanam, T. (2014, October). Prediction of diseases by cascading clustering and classification. In Advances in Electronics, Computers and Communications (ICAECC), 2014. 1-8.
11. Chen, T. F., Chen, R. M., Tsai, J. J., and Hu, R. M. (2016, October). Fine Classification of Human Gut Microbiota by Using Hierarchical Clustering Approach. In Bioinformatics and Bioengineering (BIBE),2016 , pp. 109-112.
12. Sonawane, J. S., and Patil, D. R. (2014, February). Prediction of heart disease using multilayer perceptron neural network. In Information Communication and Embedded Systems (ICICES), 2014 pp. 1-6.
13. Li, H., Li, X., Zhang, Y., Ramanathan, M., and Zhang, A. (2013, December). A generative framework for prediction and informative risk factor selection of bone diseases. In Bioinformatics and Biomedicine (BIBM), 2013, pp. 554-559.
14. Zeng, F., and Wang, Y. (2016, July). Dynamic model with time varying delay for type 1 diabetes mellitus identified by using expectation maximization algorithm. In Control Conference (CCC), 2016 pp. 9376-9381.
15. Sheng, J., Deng, H. W., Calhoun, V., and Wang, Y. P. Integrated analysis of gene expression and copy number data on gene shaving using independent component analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 2011, pp.1568-1579
16. Le Compte, A. J., Lee, D. S., Chase, J. G., Lin, J., Lynn, A., and Shaw, G. M. Blood glucose prediction using stochastic modeling in neonatal intensive care. IEEE Transactions on Biomedical Engineering, 2010, pp. 509-518
17. Nieto-Chaupis, H. Monte Carlo Simulation for Prediction of Worsening Conditions of Type-2 Diabetes Patients at Peri-Urban Zones of Lima City.
Received on 13.03.2017 Modified on 06.04.2017
Accepted on 24.04.2017 © RJPT All right reserved
Research J. Pharm. and Tech. 2017; 10(5): 1429-1432.
DOI: 10.5958/0974-360X.2017.00253.0