Self-Supervised Learning for Speech Recognition:  A Review

Humam Khalid Jameel; Assad H. Thary Al Ghrairi; Mohammed  M. Neamah

doi:10.65204/djes.v3i2.782

Authors

Humam Khalid Jameel Al-Nahrain University Author
Assad Thary Al Nahrain University Author https://orcid.org/0000-0002-1006-1582
Mohammed Neamah Mustansiriyah University Author

DOI:

https://doi.org/10.65204/djes.v3i2.782

Keywords:

APC , Self-Supervised Learning , Automatic Speech Recognition, Datasets, SSL

Abstract

For deep supervised learning algorithms to work well, a lot of labeled data is usually needed. However, gathering and classifying this kind of data may be costly and time-consuming. A subclass of unsupervised learning called self-supervised learning (SSL) seeks to cutting discriminative features from unlabeled data without the need for human-annotated labels. Recently, SSL has attracted a lot of attention, which has prompted the creation of many associated algorithms. Comprehensive studies that clarify the relationships and development of various SSL variations are scarce, nonetheless. Automatic speech recognition (ASR) has advanced significantly in recent years thanks to a variety of deep learning methods. Since deep learning methods rely heavily on data, a variety of online speech datasets are also covered in detail. We included each aspect that could affect an ASR's performance in our investigation. Therefore, we hypothesize that this work is a suitable place for scholars interested in ASR research to start.

References

Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015

J. Ma, S. Matsoukas, O. Kimball, and R. Schwartz, “Unsupervised training on large amounts of broadcast news data,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2006

Y. Bengio, A. C. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013

M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, 2015

G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006

X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang,“Self-supervised learning: Generative or contrastive,” IEEE Transactions on Knowledge & Data Engineering, no. 01, pp. 1–1, Jun 2021.

P. Xia, S. Wu, and B. Van Durme, “Which *BERT? A survey organizing contextualized encoders,” in Proceedings of the 2020Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 7516–7533. [Online]. Available: https://aclanthology.org/ 2020.emnlp main.608.

S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, “Deep representation learning in speech processing: Challenges, recent advances, and future trends,” 2021.

P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,“Stacked denoising

autoencoders: Learning useful representations in a deep network with a local denoising

criterion,” Journal of Machine Learning Research, 2010

A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discreterepresentation learning,”

D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in 2nd International

Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14–16,

, Conference Track Proceedings, 2014

Y. Bengio, N. Leonard, and A. Courville, “Estimating or propagating ´ gradients through

stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013

J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech

representation learning using WaveNet autoencoders,” IEEE/ACM Transactions on Audio,

Speech, and Language Processing, vol. 27, no. 12, pp. 2041–2053, 2019.

Y.-A. Chung and J. Glass, “Generative pre-training for speech with autoregressive predictive

coding,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal

Processing, 2020

L. Wiskott and T. J. Sejnowski, “Slow feature analysis: Unsupervised learning of invariances,”

Neural Computation, vol. 14, no. 4, pp. 715– 770, 2002

Y.-A. Chung and J. Glass, “Improved speech representations with multi-target autoregressive

predictive coding,” 2020.

Y.-A. Chung, H. Tang, and J. Glass, “Vector-Quantized Autoregressive Predictive Coding,”

in Proceedings of the Annual Conference of the International Speech Communication

Association, 2020.

S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contextualized acoustic representations

for semi-supervised speech recognition,” in Proceedings of IEEE International Conference on

Acoustics, Speech and Signal Processing, 2020.

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional

Transformers for language understanding,” in NAACL, 2019

A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised speech

representation learning with deep bidirectional Transformer encoders,” in Proceedings of

IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.

A. T. Liu, S.-W. Li, and H.-y. Lee, “TERA: Self-supervised learning of Transformer encoder

representation for speech,” IEEE/ACM Transactions on Audio, Speech, and Language

Processing, vol. 29, pp. 2351– 2366, 2021

A. H. Liu, Y.-A. Chung, and J. Glass, “Non-Autoregressive Predictive Coding for Learning

Speech Representations from Local Dependencies,” in Proceedings of the Annual Confe

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “XLNet:

Generalized autoregressive pretraining for language understanding,” in Proceedings of

Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A.

Beygelzimer,F. d'Alche-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran ´ Associates, Inc.,

S. Ling and Y. Liu, “DeCoAR 2.0: Deep contextualized acoustic representations with vector

quantization,” arXiv preprint arXiv:2012.06659, 2020.

J. Luo, J. Wang, N. Cheng, and J. Xiao, “Dropout regularization for self-supervised learning

of Transformer encoder speech representation,” Proceedings of the Annual Conference of the

International Speech Communication Association, 2021

M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio,

“Multi- task self-supervised learning for robust speech recognition,” in Proceedings of IEEE

International Conference on Acoustics, Speech and Signal Processing, 2020.

M. Tagliasacchi, B. Gfeller, F. de Chaumont Quitry, and D. Roblek, “Pre-training audio

representations with self-supervision,” IEEE Signal Processing Letters, vol. 27, pp. 600–604,

M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” in

Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Scholkopf,

Eds., 2003.

A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive

coding,” arXiv preprint arXiv:1807.03748, 2018

M. Gutmann and A. Hyvarinen, “Noise-contrastive estimation: A new ¨ estimation principle

for unnormalized statistical models,” International Conference on Artificial Intelligence and

Statistics (AISTATS), 2010.

S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for

speech recognition,” arXiv preprint arXiv:1904.05862, 2019.

E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-softmax,” in

Proceedings of International Conference on Learning Representations, 2017

H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest ´ neighbor search,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128,

A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with

VQ-VAE-2,” in Proceedings of Advances in Neural Information Processing Systems, H.

Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, Eds., vol.

´ Curran Associates, Inc., 2019

M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised

learning of visual features,” in Proceedings of the European Conference on Computer Vision

(ECCV), September 2018

J. Kahn et al., “Libri-light: A benchmark for ASR with limited or no supervision,” in

Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing,

W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed,

“HuBERT: Self-supervised speech representation learning by masked prediction of hidden

units,” arXiv preprint arXiv:2106.07447, 2021.

S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J.

Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, and F. Wei, “WavLM: Large-scale

self- supervised pre-training for full stack speech processing,” 2021

A. Baevski, W. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for

self-supervised learning in speech, vision and language,” 2022.

D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normalization: The missing ingredient

for fast stylization,” CoRR, 2016.

Abbas, Syed Mazhar, and Shailendra Narayan Singh. "Region-based object detection and

classification using faster R-CNN." 2018 4th International Conference on Computational

Intelligence & Communication Technology (CICT). IEEE, 2018.

Apte, Shaila D. Speech and audio processing. New York: Wiley, 2012.

Jacob Benesty, M. Mohan Sondhi, Yiteng Huang, “Springer Handbook of Speech Processing”,

Springer.

L.R. Rabiner and R.W. Schafer, “Digital Processing of Speech Signals”, Prentice Hall Signal

Processing Series.

Chou, Ju-Chieh, et al. "Toward joint language modeling for speech units and text." Findings

of the Association for Computational Linguistics: EMNLP 2023. 2023.

Lee, Bradford J. "Exploring the potential of AI for pragmatics instruction." Technology in

Language Teaching & Learning 6.3 (2024): 1521-1521.

Ashwini, B., Sheffali Gulati, and Jainendra Shukla. "Artificial Intelligence Driven Predictive

Analysis of Acoustic and Linguistic Behaviors for ASD Identification." IEEE Transactions on

Artificial Intelligence 5.11 (2024): 5709-5719.

J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C.Moore, M. Plakal,

and M. Ritter, “Audio Set: An ontology and humanlabeled dataset for audio events,” in

Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing,

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on

public domain audio books,” in Proceedings of IEEE International Conference on Acoustics,

Speech and Signal Processing, 2015

D. Jiang, W. Li, R. Zhang, M. Cao, N. Luo, Y. Han, W. Zou, K. Han, and X. Li, “A further

study of unsupervised pretraining for Transformer based speech recognition,” in Proceedings

of IEEE International Conference on Acoustics, Speech and Signal Processing, 2021

D. Jiang, X. Lei, W. Li, N. Luo, Y. Hu, W. Zou, and X. Li, “Improving Transformer-based

speech recognition using unsupervised pre-training,” arXiv preprint arXiv:1910.09932,

D. B. Paul and J. Baker, “The design for the Wall Street Journalbased CSR corpus,” in Speech

and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February

–26, 1992

G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J.

Zhang, M. Jin, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y. Wang, Z.

You, and Z. Yan, “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours

of Transcribed Audio,” in Proceedings of the Annual Conference of the International Speech

Communication Association, 2021.

F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve, “TED-LIUM 3: Twice

as much data and corpus repartition for experiments on speaker adaptation,” in International

Conference on Speech and Computer. Springer, 2018, pp. 198–208.

A. Rousseau, P. Deleglise, and Y. Esteve, “TED-LIUM: An automatic ´ speech recognition

dedicated corpus,” in Proceedings of International Conference on Language Resources and

Evaluation, 2012, pp. 125– 129.

J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech corpus

for research and development,” in Proceedings of IEEE International Conference on

Acoustics, Speech and Signal Processing, 1992

J. Valk and T. Alumae, “VoxLingua107: A dataset for spoken language ¨ recognition,” in

Proceedings of IEEE Spoken Language Technology Workshop, 2021.

M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and M. Omologo, “The DIRHA-

English corpus and related tasks for distant-speech recognition in domestic environments,” in

Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2015.

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ‘CHiME’ Speech Separation and

Recognition Challenge: Dataset, task and baselines,” in Proceedings of the Annual Conference

of the International Speech Communication Association, 2018.

C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus:English multi-speaker

corpus for CSTR voice cloning toolkit,” 2016.

A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency,“Multimodal language

analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” 2018.

Laurent, Antoine, et al. "ON-TRAC consortium systems for the IWSLT 2023 dialectal and

low-resource speech translation tasks." Proceedings of the 20th International Conference on

Spoken Language Translation (IWSLT 2023). 2023.

Ait, Adem, Javier Luis Cánovas Izquierdo, and Jordi Cabot. "On the suitability of hugging

face hub for empirical studies." Empirical Software Engineering 30.2 (2025): 57.

F. A. A. Laleye, L. Besacier, E. C. Ezin, and C. Motamed, “Firstautomatic Fongbe continuous

speech recognition system: Development of acoustic models and language models,” in

FedCSIS. IEEE, 2016, pp. 477–482.

H. Gelas, L. Besacier, and F. Pellegrino, “Developments of Swahiliresources for an automatic

speech recognition system,” in Spoken Language Technologies for Under-Resourced

Languages, 2012.

E. Gauthier, L. Besacier, S. Voisin, M. Melese, and U. P. Elingui,“Collecting resources in sub-

Saharan African languages for automatic speech recognition: A case study of Wolof,” in

LREC 2016, 2016.

K. Kawakami, L. Wang, C. Dyer, P. Blunsom, and A. van den Oord, “Learning robust and

multilingual speech representations,” in EMNLP, 2020

A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0:A framework for self-

supervised learning of speech representations,” Proceedings of Advances in Neural

Information Processing Systems, vol. 33, 2020.

M. Riviere, A. Joulin, P.-E. Mazare, and E. Dupoux, “Unsupervised ´ pretraining transfers

well acros languages,” in Proceedings of IEEE International Conference on Acoustics, Speech

and Signal Processing, 2020

Rezapour Mashhadi, Mohammad Mahdi, and Kofi Osei-Bonsu. "Speech emotion recognition

using machine learning techniques: Feature extraction and comparison of convolutional neural

network and random forest." PloS one 18.11 (2023):.

Schuller, B. & Batliner, A. Computational Paralinguistics: Emotion, Affect and Personality in

Speech and Language Processing 1st edn. (Wiley Publishing, 2013.

Chiu, Sheng-Chieh, et al. "Learnable Layer Selection and Model Fusion for Speech Self-

Supervised Learning Models." Interspeech. 2024.

Mdhaffar, Salima, et al. "Performance analysis of speech encoders for low-resource slu and

asr in tunisian dialect." Proceedings of The Second Arabic Natural Language Processing

Conference. 2024.

Naini, Abinay Reddy, et al. "Generalization of self-supervised learning-based representations

for cross-domain speech emotion recognition." ICASSP 2024-2024 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.‏

Arunkumar, A., Vrunda N. Sukhadia, and Srinivasan Umesh. "Investigation of ensemble

features of self-supervised pretrained models for automatic speech recognition." arXiv

preprint arXiv:2206.05518 (2022).

S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y.Lin, A. T. Liu, J. Shi,

X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong,

S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal

PERformance Benchmark,” in Proceedings of the Annual Conference of the International

Speech Communication Association, 2021.

Self-Supervised Learning for Speech Recognition: A Review

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Versions

Issue

Section

Information

Make a Submission