Semantic Similarity Caculating based on BERT

Main Article Content

Denghui Yang, Dengyun Zhu, Hailong Gai, Fucheng Wan

Abstract

The exploration of semantic similarity is a fundamental aspect of natural language processing, as it aids in comprehending the significance and usage of vocabulary present in a language. The advent of pre-training language models has significantly simplified the process of research in this field. This article delves into the methodology of utilizing the pre-trained language model, BERT, to calculate the semantic similarity among Chinese words. In order to conduct this study, we first trained our own model using the bert-base-chinese pre-trained model. This allowed us to acquire the word embeddings for every single word, which served as the basis for calculating semantic similarity. Essentially, word embeddings are vector-based depictions of words that encapsulate word’s significance and surroundings, allowing for the measurement of the semantic similarity between words. Next, we executed a sequence of experiments to assess the efficiency of the BERT model in managing semantic similarity tasks within the Chinese language. The results were encouraging, as the BERT model demonstrated remarkable performance in these tasks. Furthermore, it was observed that the BERT model outperformed traditional methods in terms of performance and generalization capabilities. This study, therefore, underscores the potential of the BERT model in natural language processing, particularly in the Chinese language. This emphasizes the model’s capacity to accurately calculate semantic similarity, paving the way for its widespread adoption in related fields.

Article Details

Section
Articles
Author Biography

Denghui Yang, Dengyun Zhu, Hailong Gai, Fucheng Wan

[1]Denghui Yang

2Dengyun Zhu

3Hailong Gai

4,*Fucheng Wan

 

[1] Key Laboratory of Linguistic and Cultural Computing Ministry of Education, Northwest Minzu University, Lanzhou, Gansu 730030, China

2 Key Laboratory of Linguistic and Cultural Computing Ministry of Education, Northwest Minzu University, Lanzhou, Gansu 730030, China; Key Laboratory of China's Ethnic Languages and Intelligent Processing of Gansu Province, Northwest Minzu University, Lanzhou, Gansu, China

3 Key Laboratory of Linguistic and Cultural Computing Ministry of Education, Northwest Minzu University, Lanzhou, Gansu 730030, China;

4 Key Laboratory of China's Ethnic Languages and Intelligent Processing of Gansu Province, Northwest Minzu University, Lanzhou, Gansu, China

*Corresponding author: Fucheng Wan

Copyright © JES 2024 on-line : journal.esrgroups.org

References

Kenter T, Borisov A, De Rijke M. Siamese cbow: Optimizing word embeddings for sentence representations. arXiv preprint arXiv:1606.04640, 2016.

Fucheng Wan. Medical Information Extraction Technology Based on Association Rules. Indian Journal of Pharmaceutical Sciences,2018[3].

Fucheng Wan, Dongjiao Zhang, Lei Zhang, Ao Zhu. Question Similarity calculating method towards medical question answering system, basic clin pharmacol,2021,127(3):278-293

Ribeiro M T, Singh S, Guestrin C. Semantically equivalent adversarial rules for debugging NLP models. Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: long papers). 2018: 856-865.

Wan Fucheng,Yang Yimin, Zhu Dengyun, et al. Semantic Role Labeling Integrated with Multilevel Linguistic Cues and Bi-LSTM-CRF. Mathematical Problems in Engineering, 2022

Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Floridi L, Chiriatti M. GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 2020, 30: 681-694.

Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.

Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

Church K W. Word2Vec. Natural Language Engineering, 2017, 23(1): 155-162.

Kusner M, Sun Y, Kolkin N, et al. From word embeddings to document distances. International conference on machine learning. PMLR, 2015: 957-966.

Turney P D, Pantel P. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 2010, 37: 141-188.

Manning C, Schutze H. Foundations of statistical natural language processing. MIT press, 1999.

Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013, 26.

Bond F, Foster R. Linking and extending an open multilingual wordnet. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2013: 1352-1362.

Guo C, Lu M, Wei W. An improved LDA topic modeling method based on partition for medium and long texts. Annals of Data Science, 2021, 8: 331-344.

Salazar J, Liang D, Nguyen T Q, et al. Masked language model scoring. arXiv preprint arXiv:1910.14659, 2019.

Studebaker G A. A” rationalized” arcsine transform. Journal of Speech, Language, and Hearing Research, 1985, 28(3): 455-462.

Schütze H, Manning C D, Raghavan P. Introduction to information retrieval. Cambridge: Cambridge University Press, 2008.

Wan Fucheng , Yang Fangtao , Wu Titantian ,et al. Chinese shallow semantic parsing based on multilevel linguistic clues. Journal of Computational Methods in Sciences and Engineering, 2020(2):1-10.DOI:10.3233/JCM-194111.