Predictive Analysis of Drug side effects using Computational Intelligence Methods
Siddharth Reddy1, Perepi Rajarajeswari2*, Rithin Sai Kommineni3
1U.G Student, School of Computer Science and Engineering,
Vellore Institute of Technology, Vellore, Tamil Nadu India.
2Associate Professor, Department of Software Systems,
School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India.
3U.G Student, School of Computer Science and Engineering,
Vellore Institute of Technology, Vellore, Tamil Nadu India.
*Corresponding Author E-mail: siddharth.sanna1@gmail.com, rajacse77@gmail.com, rithinsaikommineni@gmail.com
ABSTRACT:
Drug evaluation and safety plays a crucial role in the development and usage of therapeutically effective medications. Traditionally, randomized controlled trials have been the gold standard for assessing drug efficiency and safety. However, these trials often limited number of participants who meet specific eligibility criteria, which may not fully represent the diversity of the target population. Despite ongoing efforts to predict toxicity, accurately forecasting drug side-effects remains difficult. In this research paper we have proposed an approach that leverages side-information sources and compares state-of-the-art machine learning techniques to enhance prediction accuracy. A data analysis pipeline is implemented to obtain relevant side-information for the prediction task. The prediction problem is formulated as a machine learning task to predict side effects for new drugs. We have compared the prediction accuracies of linear and non-linear machine learning methods across ten different side-effects.
KEYWORDS: Machine Learning, Side-effect Prediction, Supervised Learning.
1. INTRODUCTION:
Post-marketing drug surveillance can take different forms, including systems established by government regulators and public/private organizations. Research initiatives on drug events, reports are focused on drug side effects through various websites1. Historically, methodologies for identifying adverse events and side effects have relied on statistical measures and categorical variables, often overlooking the sentiment expressed by users through their actual reviews.
However, with the advent of deep learning-based methods, post-market drug surveillance has advanced significantly. These methods leverage techniques such as the extraction of temporal events, performed procedures and social circumstances to identify adverse events and predict side effects. Additionally, they have been used to analyze drug compounds and molecular compositions for predicting adverse side effects. Leveraging these reviews presents an opportunity to detect and monitor drug side effects more comprehensively.
Nevertheless, extracting sentiment from online medical reviews presents several challenges. User reviews in online forums often lack conventional structure and may be written by individuals without medical knowledge, limiting the meaningful extraction of information. Moreover, many review websites employ numerical ratings as a means of quantifying sentiment. However, the interpretation of these ratings can vary among users, potentially introducing biases. In some cases, users may rate all qualities as highly important, resulting in overtly positive ratings that may not accurately reflect the drug's true efficiency or side effect profile. Such biases can lead to a skewed perception of certain drugs, particularly addictive ones and hinder the effectiveness of ratings for different population subgroups. Despite these challenges, publicly available information on the Internet provides a valuable resource for gaining deep insights from user reviews of drugs. Unlike other medical data, these reviews are not filtered through medical professionals and are freely available, ensuring patient confidentiality. The motivation for this article stems from the pressing need to enhance drug safety and efficiency evaluation. Post- marketing drug surveillance systems have been established to address these limitations, but there is still room for improvement.
This research work lies in leveraging machine learning and natural language processing techniques to analyze user-generated drug reviews and extract valuable insights regarding drug effectiveness and side effects. By tapping into the vast amount of publicly available information on the internet, this research aims to enhance the understanding of drug experiences and sentiments expressed by users. This research work seeks to contribute to the field of drug evaluation by developing a predictive model that can classify user ratings based on textual reviews, identifying contingencies, and distinguishing overtly positive or negative scores. Through this analysis, the research aims to provide healthcare professionals and regulatory authorities with a deeper understanding of drug safety profiles and help optimize decision-making processes regarding drug usage and regulation2.
By utilizing advanced computational techniques, the objective of this work is to overcome the limitations of traditional methodologies and uncover valuable insights that may not have been captured through statistical measures alone. The ultimate goal is to enhance drug safety, improve patient outcomes, and contribute to the development of more effective and reliable medications3.
In summary, the contribution this research paper lies in addressing the limitations of traditional drug evaluation methods and harnessing the power of machine learning and natural language processing to extract meaningful information from user-generated drug reviews. By doing so, we have addressed to advance drug safety evaluation and provide valuable insights for healthcare professionals, regulatory authorities, and the pharmaceutical industry.
The main research question addressed by this research work is whether the chemical descriptors (known as fingerprints) of drugs and drug indications provide valuable insights into the reported common side effects.
The organization of the paper as follows:
Related work is presented in section 2 and Section 3 Proposed the methodology. Section 4 gives results of this research work and section 5 concluded the research paper.
2. RELATED WORK:
To address this gap, the study introduces Galen OWL4, a semantic-enabled online framework designed to assist healthcare professionals in accessing detailed drug information. Based on their specific conditions, allergies, and potential drug interactions, the framework recommends medications for patients. The integration of clinical information involves converting clinical data and terminology to ontological terms using global standards such as ICD-10 and UNII. Leilei Sun conducted an analysis of large- scale treatment records to determine the optimal treatment prescription for patients. The effectiveness of the suggested treatment was assessed using an efficient semantic clustering algorithm to estimate similarities between treatment records. Based on their demographic information and medical complications, this framework can recommend the best treatment regimens for new patients. An electronic medical record (EMR) collected from multiple clinics was utilized in the study for testing, and the results showed an improvement in the cure rate.Another study by Mohammad Mehedi Hassan et al. introduced a cloud-assisted drug recommendation system (CADRE) that suggests drugs based on patient symptoms and related prescriptions. The initial approach was based on collaborative filtering techniques, but due to limitations such as computational costs, cold start, and data sparsity, the model shifted to a cloud-assisted approach using tensor decomposition to enhance the quality of medication recommendations. Jiugang Li and colleagues5 analyze sentiment from the perspective of sentiment analysis. The results proved to be superior in performance compared to traditional models like SVM and standard RNN.. A risk level classification method was proposed to assess the patient's immunity, considering factors such as hypertension, alcohol addiction, and other risk factors6. A web-based prototype system was developed to assist doctors in selecting first-line drugs. Xiaohong Jiang et al. compared three different algorithms for the medication recommendation module due to its high accuracy, efficiency and scalability. An error- checking system was also proposed to ensure diagnosis accuracy and service quality. In summary, the studies mentioned in this text contribute to the development of drug recommendation frameworks by incorporating sentiment analysis, semantic clustering, collaborative filtering, and machine learning techniques. These advancements aim to enhance the accuracy, efficiency, and personalization of drug recommendations based on patient-specific information.
3. PROPOSED METHODOLOGY:
The proposed system is a standalone software application designed to predict drug side- effects using machine learning algorithms. It serves as a valuable tool within the pharmaceutical domain, assisting researchers, data scientists, and healthcare professionals in evaluating the potential side-effects of drugs7. The system operates independently providing an efficient and accurate prediction mechanism that complements existing drug development processes. It integrates seamlessly into the drug research workflow, allowing for easy incorporation of its prediction capabilities. The system supports diverse data sources, enabling the utilization of drug information, disease indications, and structural features for side-effect prediction. It is scalable to handle large datasets and accommodate an increasing number of drugs and side-effects. With a user- friendly interface, the system facilitates easy interaction, data input, model selection, and result visualization. Regular maintenance and upgrades ensure the system remains up-to- date with the latest advancements in machine learning and drug research8. It prioritizes compliance with regulatory standards and data security measures to protect sensitive information. Overall, the system enhances prediction accuracy, streamlines the evaluation process, and supports informed decision-making in drug development9.
1. The dataset used for training the machine learning models is representative and accurately labelled.
2. The side-effects predicted by the models are based on correlations and patterns found in the training data and may not capture all possible side-effects or account for rare occurrences.
3. The system assumes that the input drug information, disease indications, and structural features are reliable and up-to-date.
4. The machine learning algorithms used in the system assume that the data is well-pre- processed and does not contain significant noise or missing values.
1. The system depends on a robust and efficient machine learning framework or library to implement the linear regression, linear SVM, and random forest algorithms.
2. The accuracy of the predictions relies on the availability of relevant and accurate side- information sources such as drug information databases and disease indication databases.
3. The system relies on a stable and secure computing environment with sufficient computational resources to train and run the machine learning models effectively.
Fig 1. Architecture diagram
In this proposed work has three modules as follow:
· Data preprocessing
· Normalizing numeric data
· Partitioning dataset
1) Pre-processing data is the act of preparing raw data to be used in a machine learning model. The first and most crucial step in the development of a machine learning model is to find clean and formatted data. It is necessary to clean and format the data before performing any operation on it. As a result, we use the data pre-processing task for this.
It consists of the following steps:
• Acquiring the dataset. Importing libraries. Importing datasets. Identifying missing data. Coding categorical data. Splitting the dataset into training and test sets. Feature scaling.
2) Normalizing numeric data is a common technique used in data preparation for machine learning: Adjusting the values of numeric columns in a dataset to a shared scale is its purpose, with no distortion of value range differences. Normalization is only needed for datasets with varying ranges of features for machine learning purposes, not all datasets.
In machine learning, normalization involves transforming the data to fit within a specific range, such as [0,1], or onto the unit sphere. This process can be beneficial for certain machine learning algorithms, especially those that rely on calculating Euclidean distance.
Standardization is another related concept that involves transforming data to have zero mean and unit variance. This is useful when the data is assumed to follow a normal distribution, as in the case of linear discriminant analysis (LDA).
Normalization and standardization are particularly useful when working with linear models and interpreting the significance of their coefficients. For example, if one variable has values in the hundreds while another variable has values in the 0.01 range, without normalization or standardization, the coefficient generated by a logistic regression model for the first variable would likely be significantly larger than the coefficient for the second variable.
Fig 2. Normalizing numeric data
3) Data partitioning: It involves distributing data across multiple tables, disks, or sites to enhance query processing performance and improve database manageability. It can improve query processing in two ways: firstly, by determining in advance which partitions are not needed for a particular query, and secondly, by enabling parallel access to different partitions across multiple disks or sites.
This parallelism enhances I/O performance and can lead to faster query execution. Moreover, data partitioning enhances database manageability by enabling operations such as backup and recovery on specific partition subsets and facilitating loading operations for historical data.
1. Creating training and test datasets: The available data is divided into two sets – one for training a model and the other for evaluating its performance.
2. Training a model on data: Machine learning algorithms are used to build a model that can learn from the training dataset and make predictions based on the provided data.
3. Evaluating the model: The trained model is tested using the separate test dataset to assess its accuracy and performance in predicting outcomes.
Accuracy= (Sum of diagonal elements (left to right)/Total number of elements) *100
4. RESULTS AND DISCUSSION:
4. 1 PREPROCESSING STAGE:
From the dataset we acquired10, the following inferences could be made in order to separate the data into useful, clean data and anomalies. This was done with the help of box-plots, pie charts and other statistical tools in order to help visualize the data.
Here we see a box plot of the conditions the patients faced vs the length of the review in words.
Another important faucet is the rating and accuracy of the drugs patients have used and have been prescribed in order to deal with the disease. Patient rating of the drug on a 1-10 scale vs the drug name is shown below as a bar chart. This is an essential step as the model algorithm11 will take these values as input and accurately predict the drug, based on past user experience is shown above fig 3.
Fig 4. Patient average rating vs drug
We can see from the distribution below that the review length density was peaking between 600 and 700 characters, with some outliers past that11. In this research, we have used a maximum of 800 characters for a single review, in order to maintain efficiency is shown below fig 5.
The data has to be transformed into a numerical value so that the model can digest. This can be done with the help of the Label Encoder function. We can see that the usefulness of the data is also measured with the help of the same is shown below fig 6 12.
|
|
Drug name |
rating |
Useful count |
Review-length |
|
72016 |
457 |
7 |
20 |
372 |
|
143086 |
413 |
6 |
13 |
335 |
|
29091 |
329 |
2 |
5 |
637 |
|
122667 |
28 |
10 |
57 |
338 |
|
121323 |
267 |
4 |
5 |
32 |
4.2. Algorithmic Implementation Results:
4.2.1. Logistical Regression:
This research work focuses on solving a multiclass classification problem with a total of 10 classes. The main objective is to classify or predict the class label for a new data point13. Initially, we employed a logistic regression model for this task, which yielded an accuracy of 79.83%is shown below fig 7. To evaluate the model's performance, we also generated a confusion matrix, providing insights into the classification results.
The confusion matrix for showing the accuracy on train and test data for logistical regression with BOW is as follows shown below fig 8.
4.2.2 Linear Support vector machine algorithm:
The Linear SVM algorithm was able to produce a pretty similar result as compared to logistical regression with BOW with an accuracy of 79.8 on test data and 92 on train data. The accuracy and hyperparameter graph depict this in the figure shown below fig 9.
Fig 9. Accuracy VS Hyperparameter.
The confusion matrix along with the accuracies for Linear SVM with BOW is as follows fig 10.
Train accuracy: 91.5 Test accuracy: 79.7
4.2.3 Random Forest Algorithm:
Upon utilizing logistic regression, we observed some misclassified data points as depicted in Figure, as indicated by the confusion matrix. In an attempt to improve the results, we proceeded with implementing a Linear SVM model with hyperparameter tuning. However, the SVM model did not yield significant improvements, and we obtained a similar accuracy of 79.8%. Dissatisfied with these outcomes, we realized that relying solely on linear models might not be sufficient. Consequently, we decided to explore tree-based models. Despite employing a random forest model, the performance did not demonstrate a substantial difference compared to the linear models, yielding an accuracy of 80.1%. Therefore, we continue to seek more promising results and explore alternative approaches to enhance the classification performance in below fig 11.
The Confusion matrix along with the train and text accuracies of Random forests with BOW is as follows in fig 12.
5.2. Experimental Analysis:
After thorough experimentation and evaluation, we arrived at a promising solution. The random forest model outperformed the linear models, achieving an accuracy of 80%, as illustrated. Based on these results, we confidently conclude that the random forest model is the preferred choice for our research. In below Figure 13 you can observe the comparative performance of the models, affirming the superiority of the random forest approach. Therefore, we have made the decision to deploy the paper utilizing the random forest model, considering its slightly better performance in comparison to the other three models in below fig 13.
|
|
Featuraization |
Model |
Train- accuracy |
Test- accuracy |
|
0 |
BOW |
Logistic Regression |
0.9801 |
0.7973 |
|
1 |
BOW |
Liner SVM |
0.9187 |
0.7985 |
|
2 |
BOW |
Random Forest |
1.0000 |
0.8055 |
5. CONCLUSION AND FUTUREWORK:
Machine learning-based methods have breathed new life into the field of drug development. These methods have found applications in various stages of the process, including target identification, lead compound selection, synthesis, and protein-ligand interactions. Machine learning algorithms enable enhanced data querying, analysis, and generation, revolutionizing the way we approach drug discovery. One prominent application of machine learning is targeting identification, where existing omics and medical data are analyzed and explored to identify potential targets. In this research work, we have achieved a prediction accuracy of 80.1% using the random forest model. Overall, the integration of machine learning in drug development holds immense promise for accelerating the discovery of new drugs and improving their efficacy. With continued research and advancements in ML techniques, we can expect even greater achievements in this field, leading to breakthroughs in medicine and healthcare. However, it should be noted that the current study assumes independence among side-effects, and future research should explore modelling the joint relationships between multiple side-effects to improve predictions and gain a deeper understanding of the underlying biological mechanisms. Some of the researchers are focused on IOT based healthcare systems 14,15,16,17,18,19. Several aspects are focused on drugs of various health issues20,21.
6. CONFLICT OF INTEREST:
No conflict of Interest.
1. Wittich CM, Burkle CM, Lanier WL et al. Medication errors: an overview for clinicians. Mayo Clin Proc. 2014; 89(8). https://doi.org / 10.1016/j.mayocp.2014.05.007
2. Chen, M. R., and Wang, H. F et al. The reason and prevention of hospital medication errors. Practical Journal of Clinical Medicine. 2023; 11(5).https://doi.org /10.15680/IJIRCCE.2023.1105001
3. Paralakhemundietal,PowerandEmbeddedSystem(SCOPES),2016, https://doi.org/10.1109/SCOPES.2016.7955684.
4. Y.Bao and X. Jiang et al. An intelligent medicine recommender system framework, IEEE 11th Conference on Industrial Electronics and Applications, 2016 ,https://doi.org /10.1109
5. Shimada K, Takada H, Mitsuyama S, et al. Drug-recommendation system for patients with infectious diseases. AMIA Annu Symp Proc. 2005 , https://pmc.ncbi.nlm.nih.gov/articles/PMC1560833/
6. Galeano, Alberto Paccanaro et al. Machine learning prediction of side effects for drugs in clinical trials. Cell Report Methods. 2022; 12(2). https://doi.org/10.1016/j.crmeth.2022.10035
7. A. Poleksic, L. Xie et al. Predicting serious rare adverse reactions of novel chemicals. Bioinformatics. 2018; 34. https://doi.org/10.1093/bioinformatics/bty193
8. D.S. Wishart, Y.D. Feunang, A.C. Guo, E.J. Lo, A. Marcu, J.R. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda, et al. Drug bank 5.0: a major update to the drug bank database for 2018. Nucleic Acids Res. 2018; 46. https://doi.org /10.1093/nar/gkx1037.
9. S.d.S. Santos, M. Torres, D. Galeano, M.D.M. Sánchez, L. Cernuzzi etal, A. Paccanaro Machine learning and network medicine approaches for drug repositioning for covid-19. Patterns. 2022; 3. https://doi.org/ 10.1016/j.patter.2021.100396
10. Galeano, D., Paccanaro et al. A Machine Learning Prediction of Side effects for Drugs in Clinical Trials. Mendeley. 2022; 2(12). https://doi.org/10.1016/j.crmeth.2022.100358
11. Zixiao Jin, Minhui Wang, Xiao Zheng, Jiajia Chen, Chang Tang et al. Drug side effects prediction via cross attention learning and feature aggregation. Expert Systems with Applications. 2024; 248. https://doi.org/10.1016/j.eswa.2024.123346
12. Ding Y. et al. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neuro Computing. 2019; 23(6). https://doi.org/ 10.1109/JBHI.2018.2883834
13. Berlin J.A. et al. Adverse vent detection in drug development: recommendations and obligations beyond phase 3. American Journal of Public Health. 2008; 98(8) https://doi.org/10.2105/AJPH.2007.124537
14. Manisha Chandrakar, V. K. Patle etal, Security Issue in IoT Based Architecture for Health Care System, 2020, 11(2) https://doi.org/ 10.5958/2321-581X.2020.00016.1
15. Hema Malini S et al. Efficient Cloaked Face Recognition Methodology throughout The Covid-19 Pandemic. 2021; 12(3). https://doi.org/ 10.52711/2321-581X.2021.00014
16. Yogesh Devaraj et al. Implications of Emulating a Dermatologist: A Study of Topical medication usage for dermatoses prescribed by Non-Dermatologists in a rural area. 2024; 22, https://doi.org/ 10.52711/0974-360X.2024.00236
17. Mesi Leorita, Zullies Ikawati, Agung Endro Nugroho, Ismail Setyopranoto et al. Comparison of the Efficacy and Tolerability of Candesartan Cilexetil between Hypertension patients of Muna and Tolaki Ethnicity. 2024; 17(4). https://doi.org/ 10.52711/0974-360X.2024.00238
18. Shreya Bhatia et al. Ayurvedic Management of Refractory Atopic Dermatitis. Case Report. 2024; 17(4). https://doi.org/ 10.52711/0974-360X.2024.00239
19. Jaymin Patel, Kaushika Patel, Shreeraj Shah, Methacrylic et al. Acid Co-Polymers: Crucial agents for the Colon Targeted Oral Drug Delivery System. https://doi.org/ 10.52711/0974-360X.2024.00242
20. Lin Mosbah Katramiz, Doaa Kamal Alkhlaidi, Muneeb Ahsan, Dujana Mostafa Hamed et al. Physicians’ Perceptions regarding the Role of Vitamin D in COVID-19 Management: A Qualitative Study. 2024; 11(2). https://doi.org/ 10.52711/0974-360X.2024.00245
21. Grishma Patel, Rajnikant Maradia, Tejal Soni, Bhanubhai Suhagia, Dhananjay Meshram et al. Development and Validation of UV Spectrophotometric Method for Simultaneous Estimation of some SGLT-2 and DPP-4 inhibitor in Bulk and Pharmaceutical Dosage Form. https://doi.org/10.52711/0974-360X.2024.00254
|
Received on 08.05.2024 Revised on 10.09.2024 Accepted on 16.11.2024 Published on 12.06.2025 Available online from June 14, 2025 Research J. Pharmacy and Technology. 2025;18(6):2459-2465. DOI: 10.52711/0974-360X.2025.00351 © RJPT All right reserved
|
|
|
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Creative Commons License. |
|