Broadband wired and wireless internet connections are getting faster all the time. But in many parts of the world high-speed internet access is unavailable and/or unreliable. So researchers continue to find ways to find ways to use compression to save bandwidth while transmitting audio, video, images, and other content that typically uses a lot of data.

A few years ago developers behind the open source Opus audio codec announced a breakthrough that allowed for high-quality audio to be delivered using as little as 6 kilobits per second. Now Google has unveiled a new codec called Lyra that can sound as good or better at just 3 kbps.

That could enable high-quality audio communications on some of the slowest networks. The only catch? Lyra is specifically optimized for speech, so it probably won’t be much use for music streaming or other services.

Google developed the Lyra speech codec using “traditional codec techniques while leveraging advances in machine learning (ML) with models trained on thousands of hours of data.”

In other words, Google trained its artificial intelligence system to know what it sounds like when people are speaking by having the system examine thousands of hours worth of recordings of people talking. The model was trained using speakers from more than 70 different languages, so the results shouldn’t be limited to spoken in English or any other specific language.

Researchers says the Lyra codec recreates speech signals using a generative model that can generate multiple signals at different frequency ranges at the same time and output them into a single signal in an efficient manner that allows the model to run either on a cloud server or on a local device – including mid-range smartphones.

So how well does it work? Pretty well.

Google made some samples available to demonstrate how Lyra sounds compared with an original source recording and two other codecs: Opus running at 6 kbps and Speex at 3 kbps. You can hear the samples below.

Clean Speech
Original
[email protected]
[email protected]
[email protected]
Noisy Environment
Original
[email protected]
[email protected]
[email protected]

According to Google, most listeners in a crowdsourced test judged the Lyra samples to sound the most like the original source recording, and I’d tend to agree. It’s likely that the sound quality would be even better at higher bit rates.

Google says it’s already beginning to roll out Lyra for use in its own Duo voice and video chat software. It’s unclear if or when Lyra will be available for use in third-party apps.

via CNX Software

Support Liliputing

Liliputing's primary sources of revenue are advertising and affiliate links (if you click the "Shop" button at the top of the page and buy something on Amazon, for example, we'll get a small commission).

But there are several ways you can support the site directly even if you're using an ad blocker* and hate online shopping.

Contribute to our Patreon campaign

or...

Contribute via PayPal

* If you are using an ad blocker like uBlock Origin and seeing a pop-up message at the bottom of the screen, we have a guide that may help you disable it.

Join the Conversation

3 Comments

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

  1. When they mention how machine learning is being used, I wonder if they are using machine learning during the encoding step itself? Or are they saying that they only used machine learning to learn the best way to encode the data (as a research step), and no actual machine learning is being used in the day to day use?

    I’m mostly interested because I’m wondering how lightweight the hardware requirements are for encoding/decoding. They mention this being used in a server scenario, but I wonder if this could be encoded on something cheap like a Rasp Pi or smaller.

  2. Researchers says the Lyra codec recreates speech signals using a generative model that can generate multiple signals at different frequency ranges at the same time and output them into a single signal in an efficient manner that allows the model to run…

    Brad, it’s just my personal curiosity but I’m curious if you actually understand the above segment or you just copy-pasted it from the source? 🙂 It’s absolutely no problem if you don’t understand it of course; me almost neither. But it’s interesting how this part of tech journalism works. Or how it is supposed to work.

    1. Eh, kind of. This came closer to just rephrasing the source than I’d normally like because I figured that was safer than doing a crappy job of putting it all in my own words and running the risk of getting it wrong.