Class-Imbalance Aware Machine Learning for CKD Detection and Risk Assessment

Main Article Content

Ashish Kumar, Ram Kinkar Pandey, Prabhat Kumar Srivastava

Abstract

Chronic kidney disease (CKD) is a growing global health concern, contributing significantly to morbidity and mortality. Early detection and management of CKD and its complications (such as hyperkalemia) are vital to improving patient outcomes. Recently, machine learning techniques have shown promise in improving CKD diagnosis and prognosis. In this study, we develop and evaluate a machine learning approach for predicting CKD presence using patient data, and we identify key risk factors from clinical, lifestyle, and laboratory features. We utilize an ensemble Random Forest (RF) classifier on a dataset of 1,659 individuals (91.9% CKD patients, 8.1% non-CKD) to distinguish CKD from healthy status. The model achieved a high overall accuracy (~92%), correctly identifying all CKD cases (100% sensitivity) but with limited specificity due to class imbalance. We address this imbalance via techniques such as class weighting, which modestly improved detection of non-CKD cases. The most influential predictors of CKD in our data were Serum Creatinine, Proteinuria (protein in urine), and Glomerular Filtration Rate (GFR), aligning with medical knowledge of kidney function. Certain clinical symptoms (e.g. itching and muscle cramps) also emerged as important indicators of CKD. We further discuss our findings in the context of recent literature – including a related study that attained 99.8% CKD prediction accuracy using RF with feature selection[4], and another that employed an XGBoost model to predict hyperkalemia (a dangerous CKD complication) with an AUC of 0.867, outperforming clinicians[5]. Our results underscore the potential of machine learning models to support early CKD diagnosis and complication risk forecasting, while highlighting the challenges posed by imbalanced datasets and the need for careful feature consideration. We conclude that ensemble learning methods, when combined with domain-specific feature insights, can provide highly accurate and clinically useful decision support for CKD management, though further work is needed to improve model generalizability and the detection of rare outcomes. The best-performing unbalanced Logistic Regression achieved an AUC of 0.804 and an accuracy of 92.5 %, while a probability weighted voting ensemble raised accuracy to 93.7 %. Adjusting the decision threshold on the logistic model further boosted accuracy to 94.0 % with perfect recall for CKD cases.

Article Details

Section
Articles