Hybrid Model for Assamese Document Classification using Doc2vec for feature extraction

Main Article Content

Chayanika Talukdar, Shikhar Kumar Sarma

Abstract

Document level categorization is challenging for texts with a huge number of words, often indicating contradicting categories. This research is particularly useful for vast amount of unorganized digitized text, produced as a side effect of the exponential growth of internet. Many text classification studies have been carried out using various machine learning and deep learning techniques, however, mainly for short text. In this study, we will categorize Assamese documents, a subject that has mostly gone unexplored until now. Here, we propose a hybrid model that combines the advantages of two most popular deep learning models- the CNN and LSTM. Also, Doc2vec has been used to convert documents into numeric vectors of 3 dimensions- 100, 128 and 300. When evaluated on the prepared data set of 780 Assamese documents, the model was found to have worked effectively with an accuracy of 96.5% and an F1-score of 96%, for the vectors with dimension value of 300.

Article Details

Section
Articles