Self-Supervised Learning for Speech Recognition: A Review
DOI:
https://doi.org/10.65204/djes.v3i2.782Keywords:
APC , Self-Supervised Learning , Automatic Speech Recognition, Datasets, SSLAbstract
For deep supervised learning algorithms to work well, a lot of labeled data is usually needed. However, gathering and classifying this kind of data may be costly and time-consuming. A subclass of unsupervised learning called self-supervised learning (SSL) seeks to cutting discriminative features from unlabeled data without the need for human-annotated labels. Recently, SSL has attracted a lot of attention, which has prompted the creation of many associated algorithms. Comprehensive studies that clarify the relationships and development of various SSL variations are scarce, nonetheless. Automatic speech recognition (ASR) has advanced significantly in recent years thanks to a variety of deep learning methods. Since deep learning methods rely heavily on data, a variety of online speech datasets are also covered in detail. We included each aspect that could affect an ASR's performance in our investigation. Therefore, we hypothesize that this work is a suitable place for scholars interested in ASR research to start.
References
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015
J. Ma, S. Matsoukas, O. Kimball, and R. Schwartz, “Unsupervised training on large amounts of broadcast news data,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2006
Y. Bengio, A. C. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013
M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, 2015
G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006
X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang,“Self-supervised learning: Generative or contrastive,” IEEE Transactions on Knowledge & Data Engineering, no. 01, pp. 1–1, Jun 2021.
P. Xia, S. Wu, and B. Van Durme, “Which *BERT? A survey organizing contextualized encoders,” in Proceedings of the 2020Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 7516–7533. [Online]. Available: https://aclanthology.org/ 2020.emnlp main.608.
S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, “Deep representation learning in speech processing: Challenges, recent advances, and future trends,” 2021.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,“Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion,” Journal of Machine Learning Research, 2010
A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discreterepresentation learning,”
D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in 2nd International
Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14–16,
, Conference Track Proceedings, 2014
Y. Bengio, N. Leonard, and A. Courville, “Estimating or propagating ´ gradients through
stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013
J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech
representation learning using WaveNet autoencoders,” IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 27, no. 12, pp. 2041–2053, 2019.
Y.-A. Chung and J. Glass, “Generative pre-training for speech with autoregressive predictive
coding,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal
Processing, 2020
L. Wiskott and T. J. Sejnowski, “Slow feature analysis: Unsupervised learning of invariances,”
Neural Computation, vol. 14, no. 4, pp. 715– 770, 2002
Y.-A. Chung and J. Glass, “Improved speech representations with multi-target autoregressive
predictive coding,” 2020.
Y.-A. Chung, H. Tang, and J. Glass, “Vector-Quantized Autoregressive Predictive Coding,”
in Proceedings of the Annual Conference of the International Speech Communication
Association, 2020.
S. Ling, Y. Liu, J. Salazar, and K. Kirchhoff, “Deep contextualized acoustic representations
for semi-supervised speech recognition,” in Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing, 2020.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pretraining of deep bidirectional
Transformers for language understanding,” in NAACL, 2019
A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised speech
representation learning with deep bidirectional Transformer encoders,” in Proceedings of
IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.
A. T. Liu, S.-W. Li, and H.-y. Lee, “TERA: Self-supervised learning of Transformer encoder
representation for speech,” IEEE/ACM Transactions on Audio, Speech, and Language
Processing, vol. 29, pp. 2351– 2366, 2021
A. H. Liu, Y.-A. Chung, and J. Glass, “Non-Autoregressive Predictive Coding for Learning
Speech Representations from Local Dependencies,” in Proceedings of the Annual Confe
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “XLNet:
Generalized autoregressive pretraining for language understanding,” in Proceedings of
Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A.
Beygelzimer,F. d'Alche-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran ´ Associates, Inc.,
S. Ling and Y. Liu, “DeCoAR 2.0: Deep contextualized acoustic representations with vector
quantization,” arXiv preprint arXiv:2012.06659, 2020.
J. Luo, J. Wang, N. Cheng, and J. Xiao, “Dropout regularization for self-supervised learning
of Transformer encoder speech representation,” Proceedings of the Annual Conference of the
International Speech Communication Association, 2021
M. Ravanelli, J. Zhong, S. Pascual, P. Swietojanski, J. Monteiro, J. Trmal, and Y. Bengio,
“Multi- task self-supervised learning for robust speech recognition,” in Proceedings of IEEE
International Conference on Acoustics, Speech and Signal Processing, 2020.
M. Tagliasacchi, B. Gfeller, F. de Chaumont Quitry, and D. Roblek, “Pre-training audio
representations with self-supervision,” IEEE Signal Processing Letters, vol. 27, pp. 600–604,
M. Schultz and T. Joachims, “Learning a distance metric from relative comparisons,” in
Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Scholkopf,
Eds., 2003.
A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive
coding,” arXiv preprint arXiv:1807.03748, 2018
M. Gutmann and A. Hyvarinen, “Noise-contrastive estimation: A new ¨ estimation principle
for unnormalized statistical models,” International Conference on Artificial Intelligence and
Statistics (AISTATS), 2010.
S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for
speech recognition,” arXiv preprint arXiv:1904.05862, 2019.
E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-softmax,” in
Proceedings of International Conference on Learning Representations, 2017
H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest ´ neighbor search,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 1, pp. 117–128,
A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-fidelity images with
VQ-VAE-2,” in Proceedings of Advances in Neural Information Processing Systems, H.
Wallach, H. Larochelle, A. Beygelzimer, F. d'Alche-Buc, E. Fox, and R. Garnett, Eds., vol.
´ Curran Associates, Inc., 2019
M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised
learning of visual features,” in Proceedings of the European Conference on Computer Vision
(ECCV), September 2018
J. Kahn et al., “Libri-light: A benchmark for ASR with limited or no supervision,” in
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing,
W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed,
“HuBERT: Self-supervised speech representation learning by masked prediction of hidden
units,” arXiv preprint arXiv:2106.07447, 2021.
S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J.
Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, and F. Wei, “WavLM: Large-scale
self- supervised pre-training for full stack speech processing,” 2021
A. Baevski, W. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A general framework for
self-supervised learning in speech, vision and language,” 2022.
D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Instance normalization: The missing ingredient
for fast stylization,” CoRR, 2016.
Abbas, Syed Mazhar, and Shailendra Narayan Singh. "Region-based object detection and
classification using faster R-CNN." 2018 4th International Conference on Computational
Intelligence & Communication Technology (CICT). IEEE, 2018.
Apte, Shaila D. Speech and audio processing. New York: Wiley, 2012.
Jacob Benesty, M. Mohan Sondhi, Yiteng Huang, “Springer Handbook of Speech Processing”,
Springer.
L.R. Rabiner and R.W. Schafer, “Digital Processing of Speech Signals”, Prentice Hall Signal
Processing Series.
Chou, Ju-Chieh, et al. "Toward joint language modeling for speech units and text." Findings
of the Association for Computational Linguistics: EMNLP 2023. 2023.
Lee, Bradford J. "Exploring the potential of AI for pragmatics instruction." Technology in
Language Teaching & Learning 6.3 (2024): 1521-1521.
Ashwini, B., Sheffali Gulati, and Jainendra Shukla. "Artificial Intelligence Driven Predictive
Analysis of Acoustic and Linguistic Behaviors for ASD Identification." IEEE Transactions on
Artificial Intelligence 5.11 (2024): 5709-5719.
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C.Moore, M. Plakal,
and M. Ritter, “Audio Set: An ontology and humanlabeled dataset for audio events,” in
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing,
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on
public domain audio books,” in Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing, 2015
D. Jiang, W. Li, R. Zhang, M. Cao, N. Luo, Y. Han, W. Zou, K. Han, and X. Li, “A further
study of unsupervised pretraining for Transformer based speech recognition,” in Proceedings
of IEEE International Conference on Acoustics, Speech and Signal Processing, 2021
D. Jiang, X. Lei, W. Li, N. Luo, Y. Hu, W. Zou, and X. Li, “Improving Transformer-based
speech recognition using unsupervised pre-training,” arXiv preprint arXiv:1910.09932,
D. B. Paul and J. Baker, “The design for the Wall Street Journalbased CSR corpus,” in Speech
and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February
–26, 1992
G. Chen, S. Chai, G.-B. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J.
Zhang, M. Jin, S. Khudanpur, S. Watanabe, S. Zhao, W. Zou, X. Li, X. Yao, Y. Wang, Z.
You, and Z. Yan, “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours
of Transcribed Audio,” in Proceedings of the Annual Conference of the International Speech
Communication Association, 2021.
F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Esteve, “TED-LIUM 3: Twice
as much data and corpus repartition for experiments on speaker adaptation,” in International
Conference on Speech and Computer. Springer, 2018, pp. 198–208.
A. Rousseau, P. Deleglise, and Y. Esteve, “TED-LIUM: An automatic ´ speech recognition
dedicated corpus,” in Proceedings of International Conference on Language Resources and
Evaluation, 2012, pp. 125– 129.
J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCHBOARD: Telephone speech corpus
for research and development,” in Proceedings of IEEE International Conference on
Acoustics, Speech and Signal Processing, 1992
J. Valk and T. Alumae, “VoxLingua107: A dataset for spoken language ¨ recognition,” in
Proceedings of IEEE Spoken Language Technology Workshop, 2021.
M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, and M. Omologo, “The DIRHA-
English corpus and related tasks for distant-speech recognition in domestic environments,” in
Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2015.
J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ‘CHiME’ Speech Separation and
Recognition Challenge: Dataset, task and baselines,” in Proceedings of the Annual Conference
of the International Speech Communication Association, 2018.
C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK corpus:English multi-speaker
corpus for CSTR voice cloning toolkit,” 2016.
A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency,“Multimodal language
analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” 2018.
Laurent, Antoine, et al. "ON-TRAC consortium systems for the IWSLT 2023 dialectal and
low-resource speech translation tasks." Proceedings of the 20th International Conference on
Spoken Language Translation (IWSLT 2023). 2023.
Ait, Adem, Javier Luis Cánovas Izquierdo, and Jordi Cabot. "On the suitability of hugging
face hub for empirical studies." Empirical Software Engineering 30.2 (2025): 57.
F. A. A. Laleye, L. Besacier, E. C. Ezin, and C. Motamed, “Firstautomatic Fongbe continuous
speech recognition system: Development of acoustic models and language models,” in
FedCSIS. IEEE, 2016, pp. 477–482.
H. Gelas, L. Besacier, and F. Pellegrino, “Developments of Swahiliresources for an automatic
speech recognition system,” in Spoken Language Technologies for Under-Resourced
Languages, 2012.
E. Gauthier, L. Besacier, S. Voisin, M. Melese, and U. P. Elingui,“Collecting resources in sub-
Saharan African languages for automatic speech recognition: A case study of Wolof,” in
LREC 2016, 2016.
K. Kawakami, L. Wang, C. Dyer, P. Blunsom, and A. van den Oord, “Learning robust and
multilingual speech representations,” in EMNLP, 2020
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0:A framework for self-
supervised learning of speech representations,” Proceedings of Advances in Neural
Information Processing Systems, vol. 33, 2020.
M. Riviere, A. Joulin, P.-E. Mazare, and E. Dupoux, “Unsupervised ´ pretraining transfers
well acros languages,” in Proceedings of IEEE International Conference on Acoustics, Speech
and Signal Processing, 2020
Rezapour Mashhadi, Mohammad Mahdi, and Kofi Osei-Bonsu. "Speech emotion recognition
using machine learning techniques: Feature extraction and comparison of convolutional neural
network and random forest." PloS one 18.11 (2023):.
Schuller, B. & Batliner, A. Computational Paralinguistics: Emotion, Affect and Personality in
Speech and Language Processing 1st edn. (Wiley Publishing, 2013.
Chiu, Sheng-Chieh, et al. "Learnable Layer Selection and Model Fusion for Speech Self-
Supervised Learning Models." Interspeech. 2024.
Mdhaffar, Salima, et al. "Performance analysis of speech encoders for low-resource slu and
asr in tunisian dialect." Proceedings of The Second Arabic Natural Language Processing
Conference. 2024.
Naini, Abinay Reddy, et al. "Generalization of self-supervised learning-based representations
for cross-domain speech emotion recognition." ICASSP 2024-2024 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
Arunkumar, A., Vrunda N. Sukhadia, and Srinivasan Umesh. "Investigation of ensemble
features of self-supervised pretrained models for automatic speech recognition." arXiv
preprint arXiv:2206.05518 (2022).
S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y.Lin, A. T. Liu, J. Shi,
X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong,
S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal
PERformance Benchmark,” in Proceedings of the Annual Conference of the International
Speech Communication Association, 2021.
Downloads
Published
Versions
- 2026-06-18 (2)
- 2026-06-17 (1)