View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Research Reports
February 7, 2022updated 10 Feb 2022 1:38pm

Meta Improves Speech Recognition Accuracy with Lip-Reading AI

Concept: Meta has introduced Audio-Visual Hidden Unit BERT (AV-HuBERT), a speech recognition framework that can understand speech analyzing sound and the movement of the speaker’s lips. It claims that AV-HuBERT shows recognition accuracy 75% higher than other audiovisual speech recognition systems trained on the same number of transcriptions.

Nature of Disruption: AV-HuBERT leverages unsupervised, or self-supervised ML. This multimodal framework learns to detect language using a combination of audio and lip-movement inputs. It can train supervised learning, algorithms such as DeepMind on labeled example data until it can determine the underlying correlations between the examples and certain outputs. The technology can classify unlabeled data by analyzing it and learning from its inherent structure. Meta claims that the framework can also capture complex correlations between the two data types by merging cues like the movement of the lips and teeth during speech with audio information. AV-HuBERT, according to Meta, recognizes a person’s speech 50% better than audio-only models when loud music or noise is playing in the background. When voice and background noise are both equally loud, AV-HuBERT achieves a WER (Word error rate) of 3.2%, compared to 25.5% for the previous best multimodal model. It boasts that AV-HuBERT only utilizes a tenth of the labeled data, making it potentially useful for languages with limited audio data.

Outlook: According to Meta, AV-HuBERT could open new opportunities for constructing conversational models for low-resource languages like Susu in the Niger-Congo family because it requires less labeled data for training. It can also be used to develop speech recognition systems for those with speech impairments, as well as to detect deepfakes and generate realistic lip motions for virtual reality avatars. AV-HuBERT has the potential to be used in the future to improve the performance of speech recognition technologies in noisy everyday situations, such as at a party or in a crowded street market. This technique could also benefit smartphone assistants, AR glasses, and smart speakers with cameras.

This article was originally published in Verdict.co.uk

Related Companies

NEWSLETTER Sign up Tick the boxes of the newsletters you would like to receive.
I consent to GlobalData UK Limited collecting my details provided via this form in accordance with the Privacy Policy
SUBSCRIBED

THANK YOU

Thank you for subscribing to Clinical Trials Arena