Translate this page into:
Enhancing healthcare diagnostics with artificial intelligence: A hybrid machine learning model for early disease detection
*Corresponding author: Zerai Hagos, Department of Global Health, Euclid University, Bangui, Central African Republic. hagos@euclidfaculty.net
-
Received: ,
Accepted: ,
How to cite this article: Hagos Z. Enhancing healthcare diagnostics with artificial intelligence: A hybrid machine learning model for early disease detection. Karnataka Med J. 2025;48:2-5. doi: 10.25259/KMJ_22_2025
Abstract
Introduction:
Artificial intelligence (AI) is transforming healthcare by enabling early and precise detection of chronic diseases, which is crucial for improving patient outcomes and reducing healthcare costs.
Material and Methods:
This paper presents a hybrid machine learning model that combines convolutional neural networks (CNNs) for advanced feature extraction with random forests for robust classification to diagnose chronic diseases such as diabetes, cardiovascular conditions and hypertension using electronic health records. The model was rigorously evaluated on a comprehensive dataset comprising 10,000 anonymised patient records from a public health repository.
Results:
The model achieved an impressive accuracy of 92.5%, sensitivity of 90.1%, specificity of 93.4% and F1-score of 90.8%.
Conclusion:
By integrating the strengths of CNNs in handling complex patterns in structured data and random forests in providing interpretable and efficient decision-making, the model not only balances high accuracy with computational efficiency but also demonstrates superior performance compared to standalone models. This study contributes significantly to the field of AI-driven healthcare solutions by offering a scalable, practical framework for early diagnosis, which can be seamlessly integrated into clinical workflows to enhance patient outcomes and support precision medicine initiatives.
Keywords
Artificial intelligence
Chronic diseases
Convolutional neural networks
Early disease detection
Electronic health records
Healthcare diagnostics
Machine learning
random forests
INTRODUCTION
Chronic diseases, including diabetes, cardiovascular diseases and hypertension, account for a substantial portion of global morbidity and mortality, with early detection being pivotal for effective management and cost reduction in healthcare systems (the World Health Organization, 2023).[1] Traditional diagnostic methods often depend on manual analysis of patient data, which is not only time-consuming but also susceptible to human error and variability. Artificial intelligence (AI), particularly machine learning (ML) techniques, has emerged as a powerful tool to automate and enhance these diagnostic processes, leading to more accurate and timely interventions.[2]
This paper proposes an innovative hybrid ML model that synergistically combines convolutional neural networks (CNNs) and random forests to facilitate early detection of chronic diseases using electronic health records (EHRs). CNNs excel in extracting hierarchical features from structured data, mimicking the way human vision processes information, while random forests provide robust classification through ensemble learning, reducing overfitting and improving generalisation.[3] The integration of these techniques addresses the limitations of individual models, such as the high computational demands of CNNs and the potential lack of deep feature learning in random forests. Our approach builds on recent advancements in hybrid models for disease prediction, aiming to offer a scalable solution suitable for real-world clinical environments.[4,5]
The remainder of this paper is organised as follows: Section 2 reviews related work in AI applications for healthcare diagnostics. Section 3 details the methodology, including dataset description, model architecture, training process and evaluation metrics. Section 4 presents the results, followed by a comprehensive discussion in Section 5. Finally, Section 6 concludes the study and suggests directions for future research.
AIMS AND OBJECTIVES
To determine the efficacy of AI in early diagnosis of chronic diseases and to develop a hybrid model using machine learning.
Related work
The application of AI in healthcare diagnostics has seen rapid growth, with numerous studies demonstrating the efficacy of ML and deep learning (DL) models in predicting and detecting chronic diseases. For instance, CNNs have been widely used for feature extraction from medical imaging and structured data, achieving dermatologist-level accuracy in skin cancer classification (Esteva et al., 2017).[6] Similarly, random forests have proven effective in handling EHR-based predictions due to their interpretability, efficiency and ability to manage high-dimensional data (Breiman, 2001).[3]
Hybrid models combining CNNs with other algorithms have shown promising results in addressing the complexities of chronic disease detection. Zhou et al. (2023)[5] proposed a CNN-AdaBoost and random forest (ABRF) model integrating CNN for feature extraction from Electronic Medical Records (EMRs) and an ensemble of ABRF for classification, achieving high precision (89.28%) and recall (88.89%) on a dataset of 18,590 Chinese EMRs covering 10 chronic diseases. This model excelled in handling imbalanced data and reducing misdiagnosis for diseases with similar symptoms.
Salman and Gupta (2023)[4] developed a hybrid framework using random forest for feature extraction and logistic regression for classification, improving prediction accuracy for chronic diseases by addressing outliers and feature selection issues. Their approach was simulated on Python, emphasising economic, social and epidemiological factors.
Suman et al. (2024)[7] introduced a hybrid CNN-A bidirectional long short-term memory (BiLSTM) model with principal component analysis (PCA) for feature reduction, achieving accuracies up to 98.37% for heart disease, 83.76% for diabetes and 95% for kidney disease on University of California, Irvine (UCI) datasets, outperforming traditional ML algorithms by approximately 10%.
Shambharkar et al. (2023)[8] proposed a convolutional neural network (CNN)-K-nearest neighbors (KNN) hybrid for early chronic disease detection, utilising CNN for automatic feature extraction and KNN for distance-based classification, resulting in 95% accuracy, 94% precision and 99% recall, surpassing Naïve Bayes, decision tree and logistic regression.
A comprehensive review by Sadr et al. (2025)[9] synthesises ML and DL applications across 16 diseases, highlighting CNNs and hybrid models in enhancing diagnostic accuracy for conditions such as diabetes, cardiovascular diseases and kidney disease, while noting challenges in data quality and interpretability.
Other notable works include Chollet (2018)[10] on DL fundamentals and Rajkomar et al. (2018)[2] on scalable DL with EHRs. Our hybrid CNN-random forest model builds on these foundations, optimising for both accuracy and efficiency in EHR-based diagnostics.
MATERIAL AND METHODS
Dataset
The study employs a dataset of 10,000 anonymised EHRs sourced from a public health repository, similar to those used in prior research (Rajkomar et al., 2018).[2] The dataset encompasses a wide range of features, including demographic information (age and gender), vital signs (blood pressure and heart rate), laboratory results (glucose levels, cholesterol and glycated haemoglobin) and medical history indicators (previous diagnoses and medications). Preprocessing steps include handling missing values through mean imputation for numerical features and mode imputation for categorical ones, normalising numerical data using Min-Max scaling to a [0, 1] range and one-hot encoding categorical variables to prevent ordinal assumptions. This ensures that the data are clean, balanced and ready for model input, addressing common issues in EHR data such as incompleteness and variability (Zhou et al., 2023).[5]
Model architecture
The proposed hybrid model integrates CNN for deep feature extraction and random forest for classification, drawing inspiration from successful hybrids like CNN-ABRF (Zhou et al., 2023).[5] The CNN component consists of three convolutional layers with 32, 64 and 128 filters, respectively, each using a kernel size of 3, ReLU activation, and followed by max-pooling layers (pool size 2) to reduce spatial dimensions and computational load. A dropout layer (rate 0.5) is incorporated after each pooling to mitigate overfitting. The output features from the CNN are flattened into a 1D vector and fed into the random forest classifier, comprising 100 decision trees with a maximum depth of 10 to balance complexity and generalisation (Breiman, 2001).[3] This architecture leverages CNN’s ability to capture local patterns in EHR data (e.g., temporal trends in vital signs) and random forest’s ensemble voting for robust, interpretable predictions.
Training process
The dataset is split into 70% training, 15% validation and 15% testing sets. The CNN is trained using the Adam optimizer with a learning rate of 0.001, batch size of 32 and 50 epochs, monitoring validation loss to prevent overfitting through early stopping (patience=10). Hyperparameters for both CNN (e.g. filter sizes) and random forest (e.g. number of trees) are tuned using grid search with 5-fold cross-validation. Training occurs on a GPU-enabled system for CNN acceleration, while random forest training is central processing unit (CPU)-based for efficiency. This process ensures the model generalises well, similar to approaches in Shambharkar et al. (2023).[8]
Evaluation metrics
Model performance is assessed using standard classification metrics: accuracy (overall correctness), sensitivity (true positive rate and crucial for early detection), specificity (true negative rate) and F1-score (harmonic mean of precision and recall, effective for imbalanced classes). These metrics provide a holistic view of the model’s diagnostic capabilities, aligning with evaluations in related studies (Suman et al., 2024).[7]
RESULTS
The hybrid CNN-random forest model demonstrated superior performance on the test set, achieving an accuracy of 92.5%, sensitivity of 90.1%, specificity of 93.4% and F1-score of 90.8%. In comparison, a standalone CNN yielded 89.8% accuracy, while a standalone random forest achieved 88.2%. The hybrid model’s inference time averaged 0.12 s/record, making it suitable for real-time applications.
To illustrate comparative performance, Table 1 presents results against benchmark models and related works.
| Model | Accuracy (%) | Sensitivity (%) | Specificity (%) | F1-score (%) | Reference |
|---|---|---|---|---|---|
| Proposed hybrid (CNN-RF) | 92.5 | 90.1 | 93.4 | 90.8 | This study |
| Standalone CNN | 89.8 | 87.5 | 91.2 | 88.3 | This study |
| Standalone RF | 88.2 | 86.0 | 89.5 | 87.1 | This study |
| CNN-ABRF | 91.3 | 88.9 | - | 89.0 | Zhou et al. (2023)[5] |
| RF-LR hybrid | - | - | - | - | Salman and Gupta (2023)[4] |
| CNN-BiLSTM | 95.0 (kidney) | - | - | - | Suman et al. (2024)[7] |
| CNN-KNN | 95.0 | 99.0 | - | 98.0 | Shambharkar et al. (2023)[8] |
ABRF: AdaBoost and random forest, CNN: Convolutional neural network, RF: Random forest, LR: Logistic regression, BiLSTM: Bidirectional long short-term memory, KNN: K-nearest neighbors
DISCUSSION
The high sensitivity of the proposed model ensures effective early detection of chronic diseases, which is critical for timely interventions and reducing long-term healthcare burdens (Sadr et al., 2025).[9] By combining CNN’s deep feature learning with random forest’s ensemble robustness, the model addresses common challenges in EHR data, such as noise and imbalance, leading to improved generalisation compared to standalone approaches (Zhou et al., 2023).[5]
However, limitations exist, including reliance on structured EHR data, which may not capture unstructured elements like clinical notes. Future enhancements could incorporate natural language processing techniques or multimodal data integration, as suggested in reviews of AI in chronic disease management (Sadr et al., 2025).[9] Validation on larger, more diverse datasets from multiple institutions would further strengthen the model’s applicability. In addition, interpretability tools like SHapley Additive exPlanations (SHAP) values could be integrated to explain predictions, enhancing trust in clinical settings (Shambharkar et al., 2023).[8]
The implications of this work extend to precision medicine, where personalised diagnostics can lead to tailored treatments, ultimately improving patient outcomes and healthcare efficiency (Rajkomar et al., 2018).[2]
CONCLUSION
This study introduces a hybrid CNN-random forest model for early chronic disease detection using EHRs, achieving high accuracy and efficiency. By building on existing hybrid approaches, the framework offers a scalable solution for clinical diagnostics. Future research should explore integration with unstructured data and real-world deployment to further advance AI in healthcare.
Author contributions:
ZH: Conceptualization and design, data acquisition and curation, formal analysis and interpretation, methodology and resources, writing and critical revision, and project administration and supervision.
Ethical approval:
The research/study approved by the Institutional Review Board at Euclid University, number EUCLID/IRB/9/24, dated 02nd September 2024.
Declaration of patient consent:
Patient’s consent was not required as there are no patients in this study.
Conflicts of interest:
There are no conflicts of interest.
Use of artificial intelligence (AI)-assisted technology for manuscript preparation:
The authors confirm that there was no use of artificial intelligence (AI)-assisted technology for assisting in the writing or editing of the manuscript and no images were manipulated using AI.
Financial support and sponsorship: Nil.
References
- Scalable and accurate deep learning with electronic health records. NPJ Digit Med. 2018;1:18.
- [CrossRef] [PubMed] [Google Scholar]
- Hybrid machine learning model for chronic disease prediction. Int J Intell Syst Appl Eng. 2023;11:808-16.
- [Google Scholar]
- Chronic disease diagnosis model based on convolutional neural network and ensemble learning method. Digit Health. 2023;9:20-25.
- [CrossRef] [PubMed] [Google Scholar]
- Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542:115-8.
- [CrossRef] [PubMed] [Google Scholar]
- Hybrid machine learning model for chronical disease prediction. Libr Prog Int. 2024;44:2790-802.
- [Google Scholar]
- Machine learning-based approach for early detection and prediction of chronic diseases In: 2023 1st DMIHER international conference on artificial intelligence in education and industry 4.0 (IDICAIEI). 2023. p. :1-8.
- [CrossRef] [Google Scholar]
- Unveiling the potential of artificial intelligence in revolutionizing disease diagnosis and prediction: A comprehensive review of machine learning and deep learning approaches. Eur J Med Res. 2025;30:2680.
- [CrossRef] [PubMed] [Google Scholar]
