A Survey on Machine Learning Algorithms and finding the best out there for the considered seven Medical Data Sets Scenario


R. M. Balajee*, K. Venkatesh

Department of Computer Science and Engineering, Vel Tech Rangarajan Dr. Sagunthala R & D Institute of Science and Technology, Avadi, Chennai.

*Corresponding Author E-mail: balajee.rm@gmail.com



Today, researchers are focusing on many machine learning algorithms in general over the data’s available. Each and every algorithm will have certain characteristics and we need to test it on specific data set to say about its efficiency in particular. Each and every Algorithm efficiency will get varies according to the data set’s nature. Research based on medical science will be much useful to society in critical situations, so we are taking seven medical data sets scenario for our research work. A detailed survey over variety of machine learning algorithms like SVM, Naïve Bayes, Dession Tree, Random Forecast, K-Means Clustering, Partition Algorithm, Bayesian Algorithm, Hierarchical Algorithm, Missing Values, Low Variance, Principal Component Analysis, Rough Set Theory, etc, over the seven medical data sets scenario which is taken to study and the results will be made on the aspect, which algorithms are good for what kind of medical records.


KEYWORDS: Machine Learning Algorithms, SVM, Naïve Bayes, Dession Tree, Random Forecast, K-Means Clustering, Partition Algorithm, Bayesian Algorithm, Hierarchical Algorithm, Missing Values, Low Variance, Principal Component Analysis, Rough Set Theory, Seven Medical Data Sets Scenario.




The medical records may be of different formats such as given below,

1.    Symptom disease relationship

2.    Particular patients medical file

3.    Particular doctor’s treatment style

4.    Hand written prescriptions

5.    Formatted database records

6.    Current data collected from sensor devices

7.    Voice and video based records


Symptom Disease Relationship:

We all know, all the diseases are figured out by its symptoms except the new diseases which are not clearly stated. It is always important to analysis the symptoms and disease relationship in medical field. The following is one of the examples of symptom and disease chart.




Table 1: Symptoms and Disease Relationship



Patients Particular Medical File:

A patient may be keeping his/her medical file for more than 2 years, then that file may have various health issue details and prescribed tablet for those. This data can be taken for research with separate concern.


Particulars Doctors Treatment Style:

Every doctor will have their own treatment style even for the common disease, so it need to be absorbed for further analysis and to predict the successive treatment style for a particular disease.


Hand Written Prescription:

These hand witted prescription need text extraction based machine learning algorithm to extract text from it for doing any kind of further study. Separate algorithms are there to do it so.


Formatted Database Records:

These are very easy to retrieve when compared to unstructured data. The data is already there as an electronic format, so only focus need is on the mining algorithm which can does the job efficiently for the selected data set.


Current Data Collected from the Sensor Devices:

These data are really small and only to a extent of certain MB’s, if it is an image file or else the size will be in the measure of KB’s for test data. The algorithm here should able to judge the accuracy of measured data, rather than retrieving data.


Voice and Video based Records:

These kinds of records are also tough to handle, because the voice need to be converted to text and then the data set need to be created for the extracted data. The created data set will be subjected to analysis then afterword.



In order to identify the appropriate algorithm for 7 different data already discussed in introduction section, we need to assume form the data sets as follows,


Data Set 1:

The data set for system disease relationship should be formed with the name of DS-SDR,


Data Set 2:  

The data from different patients are stored in separate data set called DS-P,


Data Set 3:

Every individual doctor’s prescription will be collected in such a way, Doctor’s name as Key and Prescription as Values (documents). The data set named as DS-DS,


Data Set 4:

Randomly collected hand written prescription in the data set called DS-HWP,


Data Set 5:

Already stored details of patients and medical history of those patients in a formatted database record called DS-FR,




Data Set 6:

These are the small data sets in practical because it will contain some 20 to 100 entries only for a patient. These data sets are then and there transferred to server. We are naming these data sets as DS-SD,


Data Set 7:

Voice and video based records, we need to identify the voice and then it needs to be separated as tokes and further more Mapreduce algorithm process.  Initially these data’s are stored with the data set name of DS-VV.


In machine learning based data increment model, we can go for two ways, one is all about data or parameter extension and other is the replacement option [13]. Here we are taking data / parameter extension process.


The DS-SDR data set will be used to predict the disease as a final result. The input give for the machine learning algorithm will be symptoms of those diseases. There are many machine learning algorithms [1] based on three major classifications which are supervised learning, unsupervised learning and reinforcement algorithms.


The supervised learning algorithms are like

1.    SVM (Support Vector Machine): It may be I-SVM or LASVM [2],

2.    Decision Tree and

3.    Naïve Bayes.


Unsupervised algorithms are based on two types,

1.    Cluster Analysis and

a.    Bayesian Algorithm

b.    Hierarchical Algorithm

c.    Partition Algorithm

d.    K-Means Clustering Algorithm based on KNN (K-Nearest Neighbor from a particular data point)

2.    Dimensionality Reduction

a.    Missing Values

b.    Low Variance

c.    Random Forecast

d.    Principal Component Analysis


Predictive voice analysis in call center used to do sentimental analysis is one of the example of reinforcement algorithm.


The data sets will follow any one of the database style as follows,

1.    Stored as Document:

We will consider key and values over here also but the values here are documents which may be video, voice, images and text. The datasets DS-P, DS-DS, DS-HWP, DS-SD (Partially) and DS-VV will be of this type.


2.    Stored as Key-Value Pair:

The value here will represent a string data type. The dataset DS-SD (Partially) will come under this type.


3.    Graph Structure Datasets:

The datasets here will link another related datasets. It may follow any other dataset architecture internally. It is simply resembles like product recommendation technique in e-shopping web sites.


4.    Field or Attributes based Datasets:

These dataset can be separated from SQL formatted relational database. The columns can be filtered separately as a datasets.


5.    SQL Structure:

These are the wellknown and widely used relational formatted database. The dataset DS-SDR will come under this category.


The DS-SDR dataset is having various diseases as a branch of different symptom combination. [4] As a proof of similar scenario in survey, this type of dataset can be effectively handled by supervised learning algorithms and especially decision tree because of Ѳ(log n) complexity to search in most of the cases. In worst case it may reach to O(n) but it is a rare scenario. Here ‘n’ is the number of symptom nodes [3,5].


The Mapreduce technique can be applied over database style 1 (Stored as Document) which will take several documents and tokenized the strings to form key-value pairs [7,9,10]. Then this is shuffle and sorted in the basis of key value. We need key shuffle here because, when we are trying to sort the values in ascending order and  the parameters taken was in descending order (unknowingly) then the  time complexity to sort will result in O(n) while using insertion sort where ‘n’ is the total number of elements or keys again [6,12]. These sorted keys are taken to reducer / reduce block where the corresponding query or condition can be applied to get the result.


Fig. 1: Sentimental Analysis by Machine Learning Algorithm over User Opinions


While considering medical record, it is important to track the patient’s disease, medicine prescribed, recovery period and patient’s opinion over the recovery process. The opinion of patients will produce the source for feedback and correction process. The opinions need to be taken and analyzed by predicting sentiments on it. To do this sentiment analysis, the iterative decision tree based algorithm will be better than other algorithms like SVM, Maximum Entropy, Naive Bayes and Random Forecast technique [8].


The dataset’s DS-P and DS-DS are having Individuals record/prescription sheets, so the rule set can be clearly written after analyzing the overall data and requirement. When the rule set is clear and there cannot form hierarchical structure, in such situation SVM classification will be effective. The SVM classification will produce the output on the basis of rule set formed over the data sets.


The SVM algorithm will become slow while going with large data sets of medical prescription, since most of the prescriptions are image based and text classification need to be done to extract data. To overcome this, rough set theory need to use over the data set [11,15]. The extraction of text from images is based on the quality of text it had. If the quality of text is missing then proper algorithm need to be used to extract the text from image, in this process, we can use polynomial smooth support vector machine to produce better efficiency [14].


The data set DS-HWP will have randomly collected prescription where the understanding of data set to form clear rule set will be difficult and further using the data set efficiently by giving equal priority over each data is almost very difficult. Giving equal priority is important because we don’t know about each prescription, since those are randomly collected ones. In such situation, random forecast technique will be much effective than other machine learning techniques [1]. By utilizing random forecast technique, each data will get equal priority to process and contribute them self in the final report/result.

Fig. 2: MapReduce Block Diagram



Machine learning is a wide area with lot of algorithms in it for various processes. Even though many algorithms can fit in to do the job we required, we still need best algorithm to do it for improving efficiency over retrieving process from the dataset. From our survey, we conclude, for the medical dataset, the following algorithms will be the best to utilize in the current state of the art in the domain. Decision tree for disease prediction from the input symptoms, iterative decision tree for recursive feedback or opinion analysis, SVM classification individual patient’s and doctor’s record, mapreduce for overall mapping and filter process, random forecast technique for random prescription storage in datasets and finally polynomial smooth support vector machine for text extraction. 



1.     Desarkar A, Das A. Big-Data Analytics, Machine Learning Algorithms and Scalable/Parallel/Distributed Algorithms. InInternet of Things and Big Data Technologies for Next Generation Healthcare 2017 (pp. 159-197). Springer, Cham.

2.     Rejab FB, Nouira K, Trabelsi A. Incremental support vector machines for monitoring systems in intensive care unit. In Science and Information Conference (SAI), 2013 2013 Oct 7 (pp. 496-501). IEEE.

3.     Al-Rawi A, Lansari A, Bouslama F. A new non-recursive algorithm for binary search tree traversal. In Electronics, Circuits and Systems, 2003. ICECS 2003. Proceedings of the 2003 10th IEEE International Conference on 2003 Dec 14 (Vol. 2, pp. 770-773). IEEE.

4.     Li N, Zhao L, Chen AX, Meng QW, Zhang GF. A new heuristic of the decision tree induction. In Machine Learning and Cybernetics, 2009 International Conference on 2009 Jul 12 (Vol. 3, pp. 1659-1664). IEEE.

5.     Al-Furajh I, Aluru S, Goil S, Ranka S. Parallel construction of multidimensional binary search trees. IEEE Transactions on Parallel and Distributed Systems. 2000 Feb;11(2):136-48.

6.     Nenwani K, Mane V, Bharne S. Enhancing adaptability of Insertion sort through 2-Way expansion. In Confluence The Next Generation Information Technology Summit (Confluence), 2014 5th International Conference- 2014 Sep 25 (pp. 843-847). IEEE.

7.     Xu X, Tang M. A new approach to the cloud-based heterogeneous MapReduce placement problem. IEEE Transactions on Services Computing. 2016 Nov 1;9(6):862-71.

8.     Hegde R, Seema S. Aspect based feature extraction and sentiment classification of review data sets using Incremental machine learning algorithm. In Advances in Electrical, Electronics, Information, Communication and Bio-Informatics (AEEICB), 2017 Third International Conference on 2017 Feb 27 (pp. 122-125). IEEE.

9.     Rattanaopas K, Kaewkeeree S. Improving Hadoop MapReduce performance with data compression: A study using wordcount job. InElectrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2017 14th International Conference on 2017 Jun 27 (pp. 564-567). IEEE.

10.   Merla P, Liang Y. Data analysis using hadoop MapReduce environment. In Big Data (Big Data), 2017 IEEE International Conference on 2017 Dec 11 (pp. 4783-4785). IEEE.

11.   Zhuo W, Lili C. The algorithm of text classification based on rough set and support vector machine. In Future Computer and Communication (ICFCC), 2010 2nd International Conference on 2010 May 21 (Vol. 1, pp. V1-365). IEEE.

12.   Min W. Analysis on 2-element insertion sort algorithm. In Computer Design and Applications (ICCDA), 2010 International Conference on 2010 Jun 25 (Vol. 1, pp. V1-143). IEEE.

13.   Outrata J. Boolean factor analysis for data preprocessing in machine learning. In Machine Learning and Applications (ICMLA), 2010 Ninth International Conference on 2010 Dec 12 (pp. 899-902). IEEE.

14.   Pu DM, Gao DQ, Yuan YB. A dynamic data correction algorithm based on polynomial smooth support vector machine. In Machine Learning and Cybernetics (ICMLC), 2016 International Conference on 2016 Jul 10 (Vol. 2, pp. 820-824). IEEE.

15.   Liu Z, Li Y. A new heuristic algorithm of rules generation based on rough sets. In Business and Information Management, 2008. ISBIM'08. International Seminar on 2008 Dec 19 (Vol. 1, pp. 291-294). IEEE.









Received on 04.02.2019          Modified on 14.03.2019

Accepted on 21.04.2019        © RJPT All right reserved

Research J. Pharm. and Tech. 2019; 12(6):3059-3062.

DOI: 10.5958/0974-360X.2019.00518.3