INTRODUCTION:

Disease diagnosis is an important application where data mining tools produce useful results¹. The medical industry gathers enormous quantities of patient’s information which is not used regrettably for mining process to notice unseen statistics for current decision making. Detection of concealed patterns and associations frequently becomes idle. Progressive data mining methods can benefit remedying this state². Diabetes range, habit of smoking, High pressure of Blood, age, obesity and the habit of alcohol feasting has direct impact on heart attack.

Coronary heart disease (CHD) is the main reasons of frailty in grown person as well as one of the main causes of death in the well advanced countries³. Classification algorithms have paying attention of significant interest together in the machine learning and in the data mining research zones. Between several classification techniques, the C4.5 method^4,5 justifies a special indication for numerous details⁶. The heart ailment records to be the foremost source of demise universally. It is problematic for healing doctors to forecast the heart disease as it is a multifarious mission that needs knowledge and familiarity⁷. The most influential attributes were identified and there values were used in building the C4.5 decision tree.

Konwing the classes of an instances set, the algorithm is used to discover the way the vectors of attributes behaves ,to estimate classes for new instances that can be similar. One approach to do this is through decision trees (DT’s). These attributes denote internal nodes of the tree at Root, Branch and Leaf or Class levels. The top level of the tree is root node and is used to take actions. Then each node is split recursively based on test condition conclusion. The last outcome is a decision tree, which has many branches representing a likely situation of result and its outcome. In this implementation the outcomes are deceased or alive. The echocardiogram dataset is given as input after removing missing values to C4.5 algorithm in different modes. The 10-fold cross validation were used in these implementations where the dataset is partitioned as training set for learning and test set to find the class label under which the record falls. The class labels are the outcome of the C4.5 decision tree. The architecture of the projected system is given in Fig.1.

II. DATA SET:

The healthcare environment is still ‘information rich’ but ‘knowledge poor’⁸. The echocardiogram dataset from UCI database is used in this paper. This dataset has 132 instances with 12 attributes out of this only eight attributes were applied for prediction. The rest of 4 attributes were not used for implementation. The patients might affected by heart attacks at some point of time in the past where some patients may be still alive and some may not. The 2 variables survival and still-alive variables, combined together to identify whether a patient survived for at least one year following the heart attack.

The difficulty addressed by previous investigators was to forecast from the other variables whether or not the patient will survive at least one year. The maximum problematic portion of this problem is suitably forecasting that the patient will NOT survive. This dataset contains missing, blank and inconsistent values and are continuous. Initially in Mode-1 the dataset is experimented with actual missing values.

Data preprocessing is to complete the empty value of the attribute by eradicating the noisy value and filing the accurate data values. In later Modes Mode-2, Mode-3 and Mode-4 Necessary pre-processing steps are done to clean the dataset before mining. This will lead to improve the performance of the mining process. The list of attributes and respective descriptions were listed in Table 1.

A. Decision Tree Construction:

Classification Algorithm^9,10is the popularly used algorithm, whose determination is to produce an precise classifier by the examination of the instances of training data set so as to choose the group of the unidentified data trials. Classification algorithms have fascinated significant interest in the machine learning and mining areas^11. C4.5 algorithm is a popular decision tree algorithm due to its competence and wide-ranging topographies^12.

A DT is a significant capital of data mining and inductive learning, which is typically used to practice classifiers and prediction models^13. Machine learning and modern data mining approaches are suitable for forecasting and categorizing heart disease^14. Diverse data mining procedures have been used to assist medical specialists in finding the heart disease^15. C4.5 classification algorithm is used in this paper which can support numeric continuous valued attributes, missing attributes and pruning. C4.5 algorithm is given below.

Source of Input:

1) Training dataset D, containing a set of training observations and related values.

2) Attribute list represented as A, a bunch of candidate attributes.

3) Designated splitting conditions and technique

Output: A well-constructed DT.

Algorithm:

1 Create a node N_d.

2 If all tuples in the training dataset is with the same class output value C, then return N_d as a leaf node labeled with C.

3 If attribute list is empty, then return N_d as leaf node labeled with majority class output value in training dataset.

4 Apply selected splitting criteria method to training dataset in order to find the ‘‘best” splitting criterion attribute.

5 Label node N_d with the splitting criterion attribute.

6 Remove the splitting criterion attribute from the attribute list.

7 For each value j in the splitting criterion attributes.

a) Let Dj be the observations in training dataset satisfying attribute value j.

b) If Dj is empty (no observations), then attach a leaf node labeled with the majority class output value to node N_d.

c) Else attach the node returned by generate decision tree (Dj, attribute list, selected splitting criteria method) to node N_d.

8) End for.

9) Return node N_d.

The C4.5 make use of information gain as splitting condition and the tree construction stops once every instances fall under a single value of target feature or when the value of the information gain is not greater than zero. The steps involved in calculating Entropy and Information gain is given below.

Step 1: Compute entropy and info gain for each attribute.

Step 2: Sort the attributes in ascending order with respect to entropy and in descending order for information gain.

Step 3: While attribute list is not empty or length (attribute list)>threshold, choose the first attribute as the root of the subtree/tree.

Step 4: Remove the attribute from the sorted list.

Step 5: Go to step 3

III. CHALLENGES:

Decision making is the basis of medical diagnosis and treatment¹⁶^.When the instances for a problem have a set of attributes and a set of values for each attribute, then decision trees can be a good choice. While it is easier to have the values finite and disjoint from each other, it is also possible to handle real-valued attributes as well. It is obvious that this is preferred because it is necessary to arrive at some leaf in the tree when making a decision, and having a finite number of these possible labels makes it easy to simply label a leaf as a value. The diagnosis is often made, based on doctor’s experience and knowledge. This leads to unwanted results and excessive medical costs of treatments provided to patients^17. Decision trees naturally represent disjunctive descriptions. Thus, they are preferred as it can be very simple to translate between the two. Decision trees are fairly robust to errors in the training data, or for a mislabeled value for an attribute. Thus, noisy data is typically not much of a problem. Training data may contain missing attribute values. Decision trees can even still be used in situations where some values may not be known for some of the attributes of the instances in the training set. In decision tree construction there are two main parameters, one is splitting criterion used and the other is pruning method^18.

A. MISSING VALUES:

The echocardiogram dataset consists of many missing values. The attributes ALIVE-AT-ONE with 43.2%, EPSS with 10.6% and LVDD with 7.5% of missing values. Totally 66.3% of the data had a row with at least 1 missing attribute value.

B. MISSING VALUES AND REMOVAL.:

The amount of missing values shows that the data is not fit for prediction. Luckily, C4.5 classification algorithm is robust to noisy data. The missing values are replaced by the concept of Binning in this paper. The Binning result of each attributes is considered as substitution for missing values. The values substituted were calculated statically after completely the data had been read in from the database instead of dynamic which computes while the data was still being read.

The method of binning is preferred because of the following three main reasons. First, this is a particularly a simple method that doesn’t require a lot of computation. No need to consider the data changing, because the C4.5 tree cannot back up once they have made a splitting decision without needing to regenerate the tree. Second, this method gives the real average value of the particular attribute for the whole dataset. Third, the computation time is minimum with good efficiency.

C. NODE SPLITTING CRITERION:

Entropy:

Entropy refers to the randomness of the data. It ranges from 0 to 1.A decision tree is formed by having some concrete concept of splitting data into subsets which form children nodes of the parent. Entropy is the measure of impurity or chaos in the data. If all elements in a set of data belong to the same class, the entropy would be zero and if all the elements in a dataset were evenly mixed, the entropy would be one. The C4.5 algorithm can use this measure of entropy of each attribute set of values to determine the best gain, or expected reduction in entropy due to splitting on attribute or the difference in entropies before splitting and after splitting on attribute.

Entropy is calculated by the equation (1)

E(S) = - (p+)*log2 (p+) - (p-)*log2(p- ) - eqn(1), where

“S” represents the set and “p+” are the number of elements in the set “S” with positive values and “p –“is the number of elements with negative values.

Information Gain:

In decision trees, nodes are created by singling out an attribute. At each level of the tree, the “best” attributes can be found to create the maximum reduction in entropy, called information gain. C4.5 aim is to create the leaf nodes with homogenous data. That means it has to choose the attribute that fulfills this requirement the most. C4.5 calculates the “Gain” of the individual attributes. The attribute with the highest gain results in nodes with the smallest entropy. The way of calculating Gain is given in equation (2) below.

Gain(S, A) = Entropy(S) - S ((|S_v| / |S|) * Entropy(S_v)) – eqn(2), where

‘S’ is the set and ‘A’ is the attribute. ‘SV‘ is the subset of ‘S’ where attribute ‘A’ has value ‘v’. ‘|S|’ is the number of elements in set ‘S’ and ‘|Sv|’ is the number of elements in subset ‘Sv’.

C4.5 chooses the attribute with the highest gain to create nodes in the decision tree. If the resulting subsets do not have entropy zero or equal to zero then it chooses one of the remaining attribute to create further nodes until all the subsets are homogenous.

D. OVERFITTING AND PRUNING:

A major issue with C4.5 tree is that it can be built to fit the training data "too perfectly". While a decision tree may perfectly classify your training set, it could be too trained to the training data in such a way that it may not perform that well on real-world instances that are unknown. This issue is known as overfitting the training data, and is something one should be aware of when using decision trees.

Overfitting can decrease the accuracy of a decision tree on real world instances significantly. For example, in an experiment using ID3 performed on noisy training data, the final tree was found to be 10%-20% less accurate due to overfitting.

One method for dealing with overfitting in decision trees is known as reduced error pruning. This algorithm goes through the entire tree, and removes any nodes (including the subtree of that node) that have no negative effect on the accuracy of the decision tree, turning the subtree into a leaf node with the most common label. Removing subtrees from a tree is known as pruning the tree. Removing these redundant subtrees results in a less-specific decision tree that performs the same as the original tree. The hope is that because it is less specific, it is not as fitted to the data so that unknown instances will be more accurately classified. Instead of using a testing set to decide how accurate a subtree is, a separate validation set is used, which is taken from the training set.

Rule post-pruning is a method used to find high accuracy hypotheses. This method prunes the tree as well in hopes of reduced overfitting. The steps used in this technique are as follows:

1. Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing overfitting to occur.

2. Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node.

3. Prune each rule by removing any preconditions that result in improving its estimated accuracy.

4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.

IV. RESULTS AND DISCUSSION:

The Measures obtained from the different modes of implementation is given in Table 2. In Mode-1, the echocardiogram dataset with 132 instances and 12 attributes were applied directly on C4.5 classification algorithm. It is found that 55.303% of instances were correctly classified and 44.697% of were classified incorrectly. The other measures are kappa statistic -0.0125, Mean absolute error 0.4673, Root Mean squared error 0.5884, Relative absolute error 99.9268%, Root relative squared error 121.6952%, coverage of cases at 0.95 level of 84.8485%, and Mean relational region size at 0.95 level of 83.7121%.

In Mode-2, the echocardiogram dataset with 132 instances and 12 attributes were applied after replacing missing values on C4.5 classification algorithm. It is found that 58.3333 % of instances were correctly classified and 41.6667 % of were classified incorrectly. The other measures are kappa statistic -0.0157, Mean absolute error 0.4755, Root Mean squared error 0.5295, Relative absolute error of 101.7017%, Root relative squared error 109.5771 %, coverage of cases at 0.95 level of 94.697 % and Mean relational region size at 0.95 level of 95.0758 %.

Table. 2: Measures Obtained from Implementation.

Measures	Mode-1	Mode-2	Mode-3	Mode-4
Correctly Classified	55.303%	58.3333%	59.0909%	62.1212%
Incorrectly Classified	44.697%	41.6667%	40.9091%	37.8788%
Kappa statistic	-0.0125	-0.0157	0.0609	0.1226
Mean absolute	0.4673	0.4755	0.4617	0.4203
Root mean squared error	0.5884	0.5295	0.5784	0.5474
Relative absolute error	99.9268%	101.7017%	98.7556%	89.8449%
Root relative squared error	121.6952%	109.5771%	119.6832%	113.1586%
Coverage of cases (0.95 level)	84.8485%	94.697%	87.1212%	89.3939%
Mean rel. region size(0.95 level)	83.7121%	95.0758%	86.7424%	85.2273%
Total Number of Instances	132	132	132	132

Table. 3: Accuracies and Error rate

S. No	MODE	DATASET	APPROACH	ACCURACY
1	MODE 1	Echo Cardiogram Dataset (With Missing Values)	C4.5 Algorithm	55.303%
2	MODE 2	Echo Cardiogram Dataset (Removing Missing Values)	C4.5 Algorithm	58.3333%
3	MODE 3	Echo Cardiogram Dataset With Removed Missing Values	C4.5 Algorithm With Node Splitting Criterion	59.0909%
4	MODE 4	Echo Cardiogram Dataset With Removed Missing Values	C4.5 Algorithm With Node Splitting Criterion And Tree PRUNING	62.1212 %

In Mode-3, the echocardiogram dataset with 132 instances and 12 attributes were applied by considering attribute split on C4.5 classification algorithm. It is found that 59.0909% of instances were correctly classified and 40.9091%of were classified incorrectly. The other measures are kappa statistic 0.0609, Mean absolute error 0.4617, Root Mean squared error 0.5784, Relative absolute error of 98.7556%, Root relative squared error 119.6832%, coverage of cases at 0.95 level of 87.1212% and Mean relational region size at 0.95 level of 86.7424%.

In Mode-4, the echocardiogram dataset with 132 instances and 12 attributes were applied by considering attribute split and tree pruning on C4.5 classification algorithm. It is found that 62.121% of instances were correctly classified and 37.8788% of were classified incorrectly. The other measures are kappa statistic 0.1226, Mean absolute error 0.4203, Root Mean squared error 0.5474, Relative absolute error of 89.8449 %Root relative squared error 113.1586%, coverage of cases at 0.95 level of 89.3939 % and Mean relational region size at 0.95 level of 85.2273%.

From Table.3, it is found that Mode-4, i.e., the echocardiogram dataset implementation with C4.5 classification algorithm by considering removal of missing values, attribute split criterion and tree pruning gives the maximum better accuracy of 62.1212%. This accuracy value is better than Mode-1, Mode-2 and Mode-3. The Fig.2 illustrates the accuracies of different methods of echocardiogram datasets with improved C4.5 classification algorithms.

Fig.2: Accuracies Of Different Methods Of Echocardiogram Datasets with improved C4.5 classification algorithms.

V. CONCLUSION:

In this paper, C4.5 algorithm is implemented with echocardiogram datasets. The mechanisms for different methods of node splitting, missing attribute substitution and tree pruning were explored. The results of echocardiogram dataset with C4.5 classification algorithm shows the accuracy level of 55.303% with missing values. By removing missing values and considering node splitting and tree pruning concept the accuracy has been rapidly improved to 62.1212%. After performing data cleaning to remove missing values the result is improved with node splitting mechanism and tree pruning concept.

VI. REFERENCES:

1 Jothikumar R and Sivabalan RV. Performance Analysis on Accuracies of Heart Disease Prediction System Using Weka by Classification Techniques. AJBAS. 9(7); 2015: 741-749.

2 Palaniappan S and Awang R. Intelligent heart disease prediction system using data mining techniques. In Computer Systems and Applications.2008:108-115.

3 Karaolis MA et al. Assessment of the risk factors of coronary heart events based on data mining with decision trees. Information Technology in Biomedicine. 14(3); 2010: 559-566.

4 Michie D et al. JR C4. 5: Programs for Machine Learning. 1993.

5 Quinlan JR. Improved Use of continuous Attributes in C4.5. Journal of Artificial Intelligence Research. 4; 1996: 77-90.

6 Ruggieri S. Efficient C4. 5 [classification algorithm]. IEEE transactions on Knowledge and Data Engineering. 14(2); 2002: 438-444.

7 Masethe HD and Masethe MA. Prediction of heart disease using classification algorithms. In Proceedings of the world Congress on Engineering and computer Science. 2; 2014: 2224.

8 Jothikumar R et al. Data Cleaning Using Weka For Effective Data Mining In Health Care Industries. International Journal of Applied Engineering Research.10(30); 2015: 22786-22794.

9 Mehta M et al. SLIQ: A fast scalable classifier for data mining. Advances in Database Technology—EDBT'96. 1996: 18-32.

10 Wang M et al. Scalable mining for classification rules in relational databases. In Database Engineering and Applications Symposium. Proceedings. IDEAS'98. International. IEEE. 1998: 58-67.

11 Ruggieri S. Efficient C4. 5 [classification algorithm]. IEEE transactions on Knowledge and Data Engineering. 14(2); 2002: 438-444.

12 Idri A and Kadi I. Evaluating a decision making system for cardiovascular dysautonomias diagnosis. SpringerPlus. 5(1); 2016: 81.

13 Cao R and Xu L. Improved C4. 5 algorithm for the analysis of sales. In Web Information Systems and Applications Conference. WISA 2009. IEEE. 2009: 173-176.

14 Jabbar MA et al. Alternating decision trees for early diagnosis of heart disease. In Circuits, Communication, Control and Computing (I4C). IEEE. 2014: 322-328.

15 Shouman M et al. Using data mining techniques in heart disease diagnosis and treatment. In Electronics, Communications and Computers (JEC-ECC), IEEE. 2012: 173-177.

16 Sekiya T et al. The use of modified constellation graph method for computer-aided classification of congenital heart diseases. Biomedical Engineering, IEEE Transactions on, 38(8); 1991: 814-820.

17 JothiKumar R et al. Accuracies of j48 weka classifier with different supervised weka filters for predicting heart diseases. ARPN Journal of Engineering and Applied Sciences. 10(17); 2015: 7788 – 7793.

18 Mahmood AM and Kuppa MR. Early detection of clinical parameters in heart disease by improved decision tree algorithm. In Information Technology for Real World Problems (VCON). IEEE. 2010: 24-29.

Received on 01.11.2017 Modified on 10.12.2017

Research J. Pharm. and Tech 2018; 11(5):1951-1956.

DOI: 10.5958/0974-360X.2018.00362.1

S. NO	ATTRIBUTE	DESCRIPTION
1	Age _Heart_Attack	The age of the patient during the heart attack
2	Pericardial_Effusion	Binary value - relative to fluid surrounded by the heart
3	Fractional_Shortening	the amount of contractility round the heart, minimum values represent abnormal conditions.
4	EPSS - E-Point Septal Separation	is an another way of criteria representing contractility with maximum values representing abnormal conditions.
5	LVDD - Left Ventricular End-Diastolic Dimension	Its larger numbers are abnormal
6	Wall_Motion_Index	The count of segments of the left ventricle moving divided by the number of segments seen
7	Alive@One	0 - indicates deceased or unknown and 1 represents alive for one year
8	Survival and Still-Alive	SURVIVAL represents the no. of months the patient survived from heart attack, STILL-ALIVE - represents patients survived less than a year(0 - deceased and 1 - alive).

Table .1: List Attributes Used