The Effect of Combined Synthetic Tabular Data Generated Using CTGAN Model with Actual Data on Performance of DHF, Varicella, and COVID-19 Recognition Model

Main Article Content

Husni Iskandar Pohan, Sutoto, Yaya Heryadi, Harjanto Prabowo

Abstract

There are several quickly spreading illnesses such as DHFs spread by mosquitoes, COVID-19 spreads through respiratory droplets and contact with contaminated surfaces, and Varicella spreads by direct touch. The transmission rate of these diseases can be reduced if medical services can identify them early. However, the performance of the prediction model based on the machine learning approach is limited by the availability of labelled patient datasets.  This study showed some empirical evidence of the use of synthetic data generated using actual medical records as the basis to improve the performance of the prediction model. The empirical results showed that the Decision Tree algorithm which is trained using a mixed synthetic and actual dataset can achieve 91.98% average accuracy which is higher than model performance which is trained using real dataset only. The results of model interpretation using Shapley Additive Explanations have the advantage of measuring the overall dominant features and indicating that the top five most important features are Thrombocyte, Temp, Cough, Spot, and Nauseous .

Article Details

Section
Articles