Classification of Medicines using Naive bayes Classifier

 

Baisani Indraja1, Annapurani K2

1Student, Department of CSE, SRM University, Kattankulathur-603 203

2Asst. Professor, Department of CSE, SRM University, Kattankulathur-603 203

*Corresponding Author E-mail: bindrajareddy@gmail.com, annapoorani.k@ktr.srmuniv.ac.in

 

ABSTRACT:

In data mining there is a function called classification that assigns items into groups based on their similarities. These classification techniques are used in many real time applications to classify the data. In pharmaceutical industries all the new medicines need to get approval from Food and Drug administration (FDA) before bringing them into the market. FDA is completely depended on its manual operation and leads to more amount of time for the approval of new medicines. By using classification technique the FDA can automate the procedure with which the time required for approval of new medicine can be reduced. For classification of unknown medicines six chemical properties are used such as relative molecular mass, Number of hydrogen bond donor, Number of hydrogen bond acceptor, Polar surface Area , Hydrophobic constant  and  number of flexible rotation keys. Drug category and these six properties are collected from U.S National Library of Medicine for creating a new database. By applying Naive Bayes algorithm on the collected data, target class of unknown medicines can be accurately predicted. The fundamental purpose of this paper is to provide a comprehensive analysis of classification methods for predicting unknown medicines. And even discusses the computation procedure of binary classification system for fever and typhoid medicines.

 

KEYWORDS: Data Mining, Classification, Feature extraction, Naive Bayes Algorithm, Similarity between medicines

 

 


1. INTRODUCTION:

In Pharmaceutical industry there are two types of drugs such as brand and generic medicines. These are confined to a variety of laws and regulations that govern testing and marketing of drugs. The small pharmaceutical company which develops generic drugs does not have enough financial support to research new drugs and cannot wait longer duration for the approval. By using the existing drug data, identification of unknown drugs can decrease the approval time based on this proposed classification. Various categories of drug datasets are collected from US National library of medicines along with their chemical properties.

 

 

Existing system[1] uses a new algorithm based on K-nearest neighbor (KNN) algorithm to divide the drug data based on their similarities. Although the classification method implemented is better in performance and accuracy, but couldn’t handle many issues such as over fitting. Because this method depends on chemical properties of available training data (which are limited) The first step to classify unknown drugs is to create a new database with more number of drug datasets with chemical properties such as relative molecular mass, polar surface area, number of hydrogen bond donors and acceptor, hydrophobic constant and rotatable bond count are collected to avoid the phenomenon of over fitting. This paper classifies the drugs used for fever and typhoid based on the six chemical properties assigned to each diseases using Naïve Bayes classifier. The dataset is created for the various drugs of different diseases.

 

 

2. RELATED WORK:

To implement a classification process large number of data is collected to avoid over fitting problem. Along with these medicines their important chemical features which are mainly required for classification are also collected. Therefore a clear study regarding the chemical properties is to be made for this classification purpose. In this paper[2] an analysis is made on the medicine concentration and composition based on least square support vector machines (LS-SVMs). Similar substance with same composition have same absorbency .Gaussian kernel functions are used. Although experimental results of MLR gives improved performance in training data but obtains unsatisfied results in testing when compared to LS-SVM. Different features have a varying influence while calculating the similarity of medicines which is clearly stated by the author[1] and a weighting parameter 𝜴 is calculated in order to reflect the importance of different features. Gradient descent method is used for solving standard optimization problem occurred during this process. After the calculation of weighting parameter 𝜴 similarity of medicines is calculated using k-nearest neighbor algorithm and categorization is done in this paper and results show 77.7% accuracy which more efficient than decision tree. A novel drug classification algorithm is proposed based on K-NN algorithm. The next major task is the selection of appropriate algorithm. They are many algorithms used for same kind of problem in data mining and this paper [3] provides some guidance which can give better results. The problem with algorithm selection is to find an appropriate model that can solve particular problem. In meta-learning, previous experiments and their experience are used to improve automatic learning and helps to give better results for existing system. Based on the type of problem one among those algorithms is chosen for the dataset. The basic concepts of SVM and maximizing of margin hyper planes are explained using heart data, diabetes and shuttle and satellite data for experiments performed by the system[4].

 

The kernel selection of SVM is discussed and describes the main features of RBF kernel. The RBF kernel with less mathematical difficulties nonlinearly maps samples into a higher dimensional space and has less hyper parameters. A new mathematic tool called Rough set is discussed which can analyze all types of fuzzy, incomplete data and gets knowledge from it. This tool can be implemented even with un-integrality and uncertain knowledge. The experimental results prove that the combination of cost parameter and kernel function is the most effective classification for this given data Tanimoto similarity measure[5] proposed has proven to be one of the most effective ways for calculating the similarity between chemicals which provides powerful information to find the relation between drugs. This is done using the SMILES representation of chemical structure and gives an innovative approach for drug-target-prediction. A generalized approach[6] to Naive Bayes classification using Fuzzy approach helps in predicting classes even with incomplete or partially correct information by the concept of domain-dependent constraints is proposed. A case study where banks need to make decisions regarding extension of credit to customers is explained by representing both risk scores and predicted risk for sample datasets. By applying mining techniques with statistical and fuzzy concepts decision making can be done more accurately even with uncertain information.

 

SVM Classification Using MATLAB is proposed by the author[7] for the binary classification of educational data. Weka tool is applied on the datasets; this system predicts the placements of the students. This can be further extended to posterior probabilities of SVM classifier. More accurate results are obtained by appropriate selection of values. School and artist classification of paintings using different classifiers such as SVM algorithm, Naive Bayes, Fisher Linear Discriminate classifier are shown in the paper[8] .

 

The most accurate results are obtained by SVMs classifier on linear kernel.SVM is an effective classifier than the others when color moments and mixed features are used. The naive Bayes classifier is an efficient classification model[9] which is easy to learn and gives high accuracy in various domains and focus on learning an optimal naïve Bayes classifier and deals with its drawbacks like decreasing accuracy when attributes are not independent and dealing with non-parametric continuous attributes. Naive Bayes algorithm is used for multiclass classification. Naive Bayes classifier is used to solve floating point issue which is a major issue of running out of the floating point range with Laplace smoothening addition Naïve Based algorithm shows impressive results compared to Ad boost and Expected Maximum algorithm. Future work can be carried out on different Scenarios. This paper[10] mainly deals with Naive Bayes classification algorithm based on Poisson distribution model. Poisson distribution is used to describe the times of random events happening at the unit time. A series of comparative experiments are conducted on Chinese data sets using new method derived by the combination of Poisson distribution and Naive Bayes. Even on small sample set high accuracy is obtained by this system. A Naive Bayes classification algorithm[11] for uncertain data is proposed and the value of each data item is represented by a probability distribution function (Pdf). Here while handling uncertain data the class conditional probability estimation problem is solved by extending kernel density estimation method .Experiments on many UCI datasets show that the proposed model gives better accuracy by using Pdf information of uncertain data and proves that Formula based approach have many advantages.

 

3  DESIGN OF MEDICAL CLASSIFICATION SYSTEM:

The chemical properties of drugs are collected as features and accordingly the drugs are assigned for the diseases. In this paper only two classes fever and typhoid are taken which has a minor difference in the property of drugs. Once the data are collected then Naïve Bayes classifier is applied to classify the drugs based on the properties to a particular diseases by computing mean, standard deviation and probability. According to the probability the diseases are classified, since only two classes of fever and typhoid are considered it is a binary class. The overall design of medical classification system for two classes is shown in the figure 1. New database of 120 dataset is created from U.S National Library of Medicine (NIH).

 

 

Figure 1: Block diagram of medical classification system:

 

i)  Feature selection:

Based on different features, different results are obtained while predicting the drug classes. Therefore feature selection is an important task while performing classification. Here six chemical properties such as Relative molecular mass, Number of hydrogen bond acceptor, Number of hydrogen bond donor, Number of flexible rotation keys, Polar surface area, Hydrophobic constant are selected as features for finding similarity between medicines.

 

ii)   Dataset:

Most commonly used drugs are collected for various diseases such as fever, typhoid, diabetes, anti-fungal, asthma and heart diseases. For each drug the six chemical properties, Relative molecular mass, Number of hydrogen bond acceptor, Number of hydrogen bond donor, Number of flexible rotation keys, Polar surface area, Hydrophobic constant are collected from U.S National Library of Medicine (NIH).

 

 

iii)  Database creation:

Collected information about six features and the category of drug are used for creating a new database. New database consists of 120 datasets. Classification algorithms are applied on this database for finding the similarity between medicines.

 

1.      Implementation of Medical Classification System:

While computing Naive Bayes algorithm fever and typhoid classes are considered as positive and negative classes respectively.

 

i)     Applying Naive Bayes algorithm:

Naive Bayes algorithm is applied on the database for predicting whether the unknown drug belongs to fever class or typhoid class.

 

1. Firstly, retrieve all the attribute values of fever drugs from database.

 

2. Calculate the sum of relative molecular mass attribute values for all the fever drugs.

 

i

 

3. Compute the mean of relative molecular mass using the equation (1) for fever medicines.

 

Where n= number of fever drugs

 

4        Computing standard deviation for relative molecular mass attribute for fever class using equation.

 

5. Computing probability density function for relative molecular mass attribute using equation (2) and (3)

 

Where

t = relative molecular mass of unknown drug

 

6. Repeat these steps for rest five features and calculate the probability of each feature.

 

7. Total probability of fever class

 

P(fever) 

(Prealtive molecular mass* P no. of hydro genbondacceptor* P polar surface area* P no. of flexible rotation keys* Phydrophobicconstant )

 

By following same procedure probability of typhoid is calculated.

ii)     Ranking the probabilities:

Probability of the unknown drug X occurring in fever class is

 

Probability of the unknown drug X occurring in typhoid class is

If the derived probability value of fever is greater than that of typhoid then unknown drug belongs to fever class else it belongs to typhoid class.

 

5       EXPERIMENTAL RESULTS:

The implemented system gives the following computation results as shown in the figure 5.1 when Aspirin medicine which is used to reduce fever.


 

 

Figure 1: Output screen of medical classification system

 


The obtained probability of fever class is more than that of typhoid class; therefore Aspirin medicine belongs to fever class. The proposed system classifies given data into two classes fever and typhoid and obtains an accuracy, recall and precision of 80%, 83% and 72% respectively.

 

 

Figure 2: Experimental Results of implemented system

 

 

 

6     CONCLUSION:

Classification is the technique in data mining to predict the predetermined classes and is useful in solving many real time problems. This paper explains Naïve Bayes classification method which is used in the classification of drugs. The proposed system is implemented using Naive Bayes for classifying medicines which is used against the diseases of fever and typhoid. Here classification of medicines into two diseases such as fever and typhoid is done and explained. From the experiments of implemented medical classification system, the accuracy obtained is of 80%, recall is of 83% and precision is of 72%. A clear knowledge of new database is needed for effective classification of medicines. Computation of Naive Bayes algorithm for the binary classification is explained in a detailed manner. Thus future work of this paper can deal with more number of diseases instead of binary classification and other algorithm such as SVM algorithm can be used for classifying medicines.

 

 

 

7. REFERENCES:

1        D.G.Huang et al., Chemical Medicine Classification through Chemical Properties Analysis,IEEE Access, 5, pp.1618-1623, 2017.

2        Xin-Chen et al., Medicine Composition Concentration Analysis Based On Least Square Support vector Machine, Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, 18-21 August 2005.

3        N.Pise et al., Algorithm selection for classification problems,In SAI Computing Conference (SAI), 2016 (pp. 203-211). IEEE, 2016.

4        Durgesh K.S et al., Data classification using support vector machine, Journal of Theoretical and Applied Information Technology, 12(1), pp.1-7,JATIT, 2010.

5        D.Galeano et al., Drug targets prediction using chemical similarity, In Computing Conference (CLEI), 2016 XLII Latin American (pp. 17), IEEE, 2016.

6        P.R.Krishna et al., A new approach to mining fuzzy databases using nearest neighbor classification by exploiting attribute hierarchies, International journal of intelligent systems, 19(12) (pp.1277-1290), 2004.

7        G.Pratiyush et al., Classifying Educational Data Using Support Vector Machines: A Supervised Data Mining Technique, Indian Journal of Science and Technology, 9(34), 2016.

8        C.Liu et al., Classification of traditional Chinese paintings based on supervised learning methods, In Signal Processing, Communications and Computing (ICSPCC), 2014 IEEE International Conference (pp. 641-644). IEEE, 2014.

9        Adi A.O et al., Classification of 20 News Group with Naive Bayes Classifier, In Signal Processing and Communications Applications Conference (SIU), 2014 22nd (pp. 2150-2153). IEEE, 2014, April.

10      Y.Huang et al., Naive Bayes classification algorithm based on small sample set, In Cloud Computing and Intelligence Systems (CCIS), 2011 IEEE International Conference on (pp. 34-39), IEEE, 2011.

11      J.Ren et al., Naive bayes classification of uncertain data. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on (pp. 944-949), IEEE, 2009.

 

 

 

 

 

 

Received on 30.12.2017           Modified on 20.02.2018

Accepted on 05.03.2018          © RJPT All right reserved

Research J. Pharm. and Tech 2018; 11(5):1940-1944.

DOI: 10.5958/0974-360X.2018.00360.8