Analysis of Human Voice for Speaker Recognition: Concepts and Advancement

Main Article Content

Khushboo Jha, Aruna Jain Sumit Srivastava

Abstract

Human voice or speech is a contactless, non-invasive biometric trait for human recognition, easy to use with minimal computer complexity and inexpensive to implement. Speaker recognition (SR) has turned out to be a magnificent approach using speech as the central premise since decades. Its broad range of usages, like forensic speech verification to identify culprits by law enforcement authorities and access control to mobile banking, mobile shopping, etc., has made it a lucrative area of research. Also, the ease of use and dependability of SR will significantly assist people with disabilities in securely accessing and reaping the benefits of digital-era services. Additionally, the emergence of numerous deep learning methods for feature extraction and classification, has helped SR to achieve tremendous progress. This paper presents a comprehensive study on the progression of SR for decades till the present, including integration with Blockchain and challenges. It covers most of the factors that influence SR performance such as fundamentals and structure of SR, different speech pre-processing techniques, various speech features, feature extraction techniques, traditional and neural network-based classification techniques and deep learning-based SR toolkits. As a consequence, in this digital Blockchain  era, it will help to design robust and reliable recognition-based services for mankind.

Article Details

Section
Articles
Author Biography

Khushboo Jha, Aruna Jain Sumit Srivastava

[1]Khushboo Jha

Aruna Jain

Sumit Srivastava

 

[1] Department of Computer Science & Engineering, Birla Institute of Technology, Mesra, Ranchi-835215, India

 

 

References

Hanifa, Rafizah Mohd, Khalid Isa, and Shamsul Mohamad (2021) A review on speaker recognition: Technology and challenges. Computers & Electrical Engineering 90: 107005. https://doi.org/10.1016/J.COMPELECENG.2021.107005

B. Homayoon (2011) Speaker Recognition, Springer Science & Business Media.

Furui, Sadaoki (2004) Fifty years of progress in speech and speaker recognition. The Journal of the Acoustical Society of America 116: 2497-2505. https://doi.org/10.1121/1.4784967

Greenberg, Craig S., et al. (2020) Two decades of speaker recognition evaluation at the National Institute of Standards and Technology. Computer Speech & Language 60:101032. https://doi.org/10.1016/J.CSL.2019.101032

Kabir, Muhammad Mohsin, et al. (2021) A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access 9:79236-79263. https://doi.org/10.1109/ACCESS.2021.3084299

Ohi, Abu Quwsar, et al. (2021) Deep Speaker Recognition: Process, Progress, and Challenges. IEEE Access 9:89619-89643. https://doi.org/10.1109/ACCESS.2021.3090109

Kinnunen, Tomi, and Haizhou Li (2010) An overview of text-independent speaker recognition: From features to supervectors. Speech communication 52(1):12-40. https://doi.org/ 10.1016/J.SPECOM.2009.08.009

Al-Ali, Ahmed Kamil Hasan, et al. (2017) Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions. IEEE Access 5:15400-15413. https://doi.org/10.1109/ACCESS.2017.2728801

Srivastava, Sumit, Mahesh Chandra, G. Sahoo (2019) Speaker identification and its application in automobile industry for automatic seat adjustment. Microsystem Technologies 25(6):2339-2347. https://doi.org/10.1007/S00542-018-4111-Z

Jahangir, Rashid, et al. (2021) Speaker Identification through Artificial Intelligence Techniques: A comprehensive Review and Research Challenges. Expert Systems with Applications 114591. https://doi.org//10.1016/J.ESWA.2021.114591

Reynolds, Douglas A., Thomas F. Quatieri, Robert B. Dunn (2000) Speaker verification using adapted Gaussian mixture models. Digital signal processing 10:19-41. https://doi.org/ 10.1006/DSPR.1999.0361

Zhao, Xiaojia, DeLiang Wang (2013) Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: IEEE international conference on acoustics, speech and signal processing 7204-7208.

Abdalrahman, Roaya Salhalden A., Bülent Bolat, Nihan Kahraman (2018) A cascaded voice biometric system. Procedia computer science 131: 1223-1228. https://doi.org/10.1016/J.PROCS.2018.04.334

Cucu, Horia, et al. (2015) Enhancing ASR systems for under-resourced languages through a novel unsupervised acoustic model training technique. Advances in Electrical and Computer Engineering 15(1)-63-68.

Li, Lantian, et al. (2022) CN-Celeb: multi-genre speaker recognition,” Speech Communication 137:77-91.

Gonzalez-Rodriguez, Joaquin (2014) Evaluating automatic speaker recognition systems: An overview of the NIST speaker recognition evaluations (1996-2014) Loquens 1. https://doi.org/ 10.3989/LOQUENS.2014.007

Jokic, Ivan, et al. (2015) Automatic speaker recognition dependency on both the shape of auditory critical bands and speaker discriminative MFCCs. Advances in Electrical and Computer Engineering 15(4):25-33.

Sharma, Garima, Kartikeyan Umapathy, and Sridhar Krishnan (2020) Trends in audio signal feature extraction methods. Applied Acoustics 158:1-13. https://doi.org/ 10.1016/J.APACOUST.2019.107020

Hinton, Geoffrey E., Simon Osindero, and Yee-Whye (2006) A fast learning algorithm for deep belief nets. Neural computation 18(7):1527-1554.

Hourri, Soufiane, Jamal Kharroubi (2020) A deep learning approach for speaker recognition,” International Journal of Speech Technology 23(1):123-131. https://doi.org/10.1007/S10772-019-09665-Y

Li, Lantian, et al. (2022) A principle solution for enroll-test mismatch in speaker recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 443-455.

Baig, Faisal, Saira Beg, and Muhammad Fahad Khan (2018) Speaker recognition based appliances remote control for severely disabled, low vision and old aged persons. INAE Letters 3(1):1-9. https://doi.org/10.1007/S41403-017-0032-X

Wang, DeLiang, and Guy J. Brown (2006) Computational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE press. https://doi.org/10.1109/TASLP.2023.3244507

Govindan, Sumithra Manimegalai, Prakash Duraisamy, and Xiaohui Yuan (2014) Adaptive wavelet shrinkage for noise robust speaker recognition. Digital Signal Processing 33:180-190. https://doi.org/10.1016/J.DSP.2014.06.007

Wu, Jian-Da, Yi-Jang Tsai (2011) Speaker identification system using empirical mode decomposition and an artificial neural network. Expert Systems with Applications 38(5):6112-6117. https://doi.org/10.1016/J.ESWA.2010.11.013

Tirumala, Sreenivas Sremath, et al. (2017) Speaker identification features extraction methods: A systematic review. Expert Systems with Applications 90:250-271. https://doi.org/ 10.1016/J.ESWA.2017.08.015

Campbell, William M., Douglas E. Sturim, and Douglas A. Reynolds (2006) Support vector machines using GMM supervectors for speaker verification. IEEE signal processing letters 13(5):308-311. https://doi.org/10.1109/LSP.2006.870086

Schmidhuber, Jürgen (2015) Deep learning in neural networks: An overview. Neural networks 61:85-117. https://doi.org/10.1016/J.NEUNET.2014.09.003

K. Jha, A. Jain and S. Srivastava (2023) An Efficient Speaker Identification Approach for Biometric Access Control System. 5th International Conference on Recent Advances in Information Technology (RAIT), Dhanbad, India, pp. 1-5. https://doi.org/10.1109/RAIT57693.2023.10127101.

Jain, Anil K., Robert P. W. Duin, Jianchang Mao (2020) Statistical pattern recognition: A review. IEEE Transactions on pattern analysis and machine intelligence 22(1):4-37. https://doi.org/10.1109/34.824819

Sumit Srivastava, G. Sahoo, Naman Ladha, Mahesh Chandra (2018) A Review on User Identification using Voice as a Biometric Feature. International Journal of Computer Application, USA 11-14.

Hebert, Matthieu (2008) Text-dependent speaker recognition,” Springer handbook of speech processing. Springer, Berlin, Heidelberg, pp. 743-762.

Larcher, Anthony, et al. (2014) Text-dependent speaker verification: Classifiers, databases and RSR2015: Speech Communication 60:56-77 . https://doi.org/10.1016/J.SPECOM.2014.03 .001

Devi, Kharibam Jilenkumari, and Khelchandra Thongam (2023) Automatic speaker recognition from speech signal using bidirectional long‐short‐term memory recurrent neural network. Computational Intelligence 39(2):170-93. https://doi.org/10.1111/coin.12278

Sahoo, Tushar Ranjan, and Sabyasachi Patra (2014) Silence removal and endpoint detection of speech signal for text independent speaker identification. International Journal of Image, Graphics and Signal Processing 6:27-35. https://doi.org/10.5815/IJIGSP.2014.06.04

Farsiani, Shabnam, Habib Izadkhah, and Shahriar Lotfi (2022) An optimum end-to-end text-independent speaker identification system using convolutional neural network. Computers and Electrical Engineering 100:107882. https://doi.org/10.1016/J.COMPELECENG.2022.107882

Gan, Zhen-ye, Yue Yu, and Min Luo (2022) A tibetan-dependent speaker recognition method based on deep learning. Multimedia Tools and Applications 81:30821–30840. https://doi.org/10.1007/S11042-022-12540-9

Lee, Kong Aik, Ville Vestman, and Tomi Kinnunen (2021) ASVtorch toolkit: Speaker verification with deep neural networks. SoftwareX 14:100697. https://doi.org/ 10.1016/J.SOFTX.2021.100697

Sreehari, V. R., and Leena Mary (2022) Automatic short utterance speaker recognition using stationary wavelet coefficients of pitch synchronised LP residual. International Journal of Speech Technology 25(1):147-161. https://doi.org/10.1007/S10772-021-09895-Z

Li, Dongdong, et al. (2022) TRSD: A Time-Varying and Region-Changed Speech Database for Speaker Recognition. Circuits, Systems, and Signal Processing 41(7):3931-3956. https://doi.org/10.1007/S00034-022-01964-1

Zhang, Xingyu, et al. (2023) Imperceptible black-box waveform-level adversarial attack towards automatic speaker recognition. Complex & Intelligent Systems 9(1):65-79. https://doi.org/10.1007/S40747-022-00782-X

Zhao, Hong, et al. (2022) Research on x-vector speaker recognition algorithm based on Kaldi. International Journal of Computing Science and Mathematics 15(3):199-212. https://doi.org/ 10.1504/IJCSM.2022.124725

Paszke, Adam, et al. (2017) Automatic differentiation in pytorch. In 31st Conference on Neural Information Processing Systems, USA.

Pawar, Rupali V., Rajesh M. Jalnekar, and Janardan S. Chitode (2018) Review of various stages in speaker recognition system, performance measures and recognition toolkits. Analog Integrated Circuits and Signal Processing 94(2):247-257. https://doi.org/10.1007/S10470-017-1069-1

Sarkar, Achintya Kumar, et al. (2019) Time-contrastive learning based deep bottleneck features for text-dependent speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27(8):1267-1279. https://doi.org/10.1109/TASLP.2019.2915322

Delgado-Mohatar, Oscar et al. (2020) Blockchain and biometrics: A first look into opportunities and challenges. In Blockchain and Applications: International Congress, pp. 169-177. Springer International Publishing.

Jang, Hyeji, and Sung H. Han (2022) User experience framework for understanding user experience in blockchain services. International Journal of Human-Computer Studies 158:102733.

Lee, Youn Kyu, and Jongwook Jeong (2021) Securing biometric authentication system using blockchain. ICT Express 7(3):322-326. https://doi.org/10.1016/J.ICTE.2021.08.003

Upadhyay, Shrikant et al. (2022) Feature Extraction Approach for Speaker Verification to Support Healthcare System Using Blockchain Security for Data Privacy. Computational and Mathematical Methods in Medicine Article ID 8717263, 12 pages. https://doi.org/10.1155/2022/8717263

Zhang, Jing, Long Dai, Liaoran Xu, Jixin Ma, and Xiaoyi Zhou (2023) Black-Box watermarking and blockchain for IP protection of voiceprint recognition model. Electronics 12:3697.