Meta Improves Speech Recognition Accuracy with Lip-Reading AI

Concept: Meta has introduced Audio-Visual Hidden Unit BERT (AV-HuBERT), a speech recognition framework that can understand speech analyzing sound and the movement of the speaker’s lips. It claims that AV-HuBERT shows recognition accuracy 75% higher than other audiovisual speech recognition systems trained on the same number of transcriptions.

Nature of Disruption: AV-HuBERT leverages unsupervised, or self-supervised ML. This multimodal framework learns to detect language using a combination of audio and lip-movement inputs. It can train supervised learning, algorithms such as DeepMind on labeled example data until it can determine the underlying correlations between the examples and certain outputs. The technology can classify unlabeled data by analyzing it and learning from its inherent structure. Meta claims that the framework can also capture complex correlations between the two data types by merging cues like the movement of the lips and teeth during speech with audio information. AV-HuBERT, according to Meta, recognizes a person’s speech 50% better than audio-only models when loud music or noise is playing in the background. When voice and background noise are both equally loud, AV-HuBERT achieves a WER (Word error rate) of 3.2%, compared to 25.5% for the previous best multimodal model. It boasts that AV-HuBERT only utilizes a tenth of the labeled data, making it potentially useful for languages with limited audio data.

Go deeper with GlobalData

The gold standard of business intelligence.

Find out more

Discover B2B Marketing That Performs

Combine business intelligence and editorial excellence to reach engaged professionals across 36 leading media platforms.

Find out more

Outlook: According to Meta, AV-HuBERT could open new opportunities for constructing conversational models for low-resource languages like Susu in the Niger-Congo family because it requires less labeled data for training. It can also be used to develop speech recognition systems for those with speech impairments, as well as to detect deepfakes and generate realistic lip motions for virtual reality avatars. AV-HuBERT has the potential to be used in the future to improve the performance of speech recognition technologies in noisy everyday situations, such as at a party or in a crowded street market. This technique could also benefit smartphone assistants, AR glasses, and smart speakers with cameras.

This article was originally published in Verdict.co.uk

Sections

Sections

Sections

Sections

Sections

Sections

Meta Improves Speech Recognition Accuracy with Lip-Reading AI

Go deeper with GlobalData

LOA and PTSR Model - immune globulin (human)

LOA and PTSR Model - mRNA-6231

Go deeper with GlobalData

Discover B2B Marketing That Performs

LOA and PTSR Model - immune globulin (human)

LOA and PTSR Model - mRNA-6231

Go deeper with GlobalData

Sign up for our daily news round-up!

Sign up to the newsletter: In Brief

Go deeper with GlobalData

LOA and PTSR Model - immune globulin (human)

LOA and PTSR Model - mRNA-6231

Go deeper with GlobalData

Discover B2B Marketing That Performs

Sign up for our daily news round-up!

Give your business an edge with our leading industry insights.

Go deeper with GlobalData

LOA and PTSR Model - immune globulin (human)

LOA and PTSR Model - mRNA-6231

Go deeper with GlobalData

Access deeper industry intelligence

Sign up for our daily news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing