Menu

Using AI to explore the future of news audio

February 25th, 2021

Radio reaches more Americans every week than any other platform. Public radio stations in the United States have over 3,000 local journalists and each day they create audio news reports about the communities they serve. But news audio is in a similar place as newspaper articles were in the 1990s: hard to find, and difficult to sort by topic, source, relevance, or recency. News audio can not delay in improving its discoverability.

KQED is the most listened to public radio station in the United States, and one of the largest news organizations in the Bay Area. In partnership with Google, KQED and KUNGFU.AI, an AI services provider and leader in applied machine learning, ran a series of tests on KQED’s audio to determine how we might reduce the errors and time to publish our news audio transcripts, and ultimately, make radio news audio more findable.

“One of the pillars of the Google New Initiative is incubating new approaches to difficult problems,” said David Stoller, Partner Lead for News & Publishing at Google “Once complete, this technology and associated best practices will be openly shared, greatly expanding the anticipated impact.”

What makes finding audio so much harder?

In order for news audio to be searched or sorted, the speech must first be converted to text.  This added step is trickier than it seems and currently puts news audio at a disadvantage for being found quickly and accurately. Transcription takes time, effort, and bandwidth from newsrooms — not something that is in abundance these days. Even though there have been great advances in a speech to text, when it comes to news, the bar for accuracy is very high. As someone who works to make KQED’s reporting widely available, it is frustrating when KQED’s audio isn’t prominent in search engines and news aggregators.

The challenge of correctly identifying who, what and where

For our tests, KQED and KUNGFU.AI, applied the latest speech-to-text tools to a collection of KQED’s news audio. News stories try to address the “five Ws:” who, what, when, where and why. Unfortunately, because AI typically lacks the context in which the speech was made (i.e. identity of the speaker, location of the story), one of the most difficult challenges of automated speech-to-text is correctly identifying these types of proper nouns, known as named entities. KQED’s local news audio is rich in references to named entities related to topics, people, places, and organizations that are contextual to the Bay Area region. Speakers use acronyms like “CHP” for California Highway Patrol and “the Peninsula” for the area spanning San Francisco to San Jose. These are more difficult for artificial intelligence to identify.

When named entities aren’t understood, machine learning models make their best estimation of what was said. For example, in our test, “The Asia Foundation” was incorrectly transcribed as “age of Foundations” and “misgendered” was incorrectly transcribed as “Miss Gendered.”  For news publishers, these are not just transcription errors, but editorial problems that change the meaning of a topic and can cause embarrassment for the news outlet. This means people have to go in and correct these transcriptions, which is expensive to do for every audio segment. Without transcriptions, search engines can’t find these stories, limiting the amount of quality local news people can find online.

An illustration showing a new proposed process for audio transcription where the human correcting the mistakes in the first version helps inform it to make the transcription more clear, accurate for the future.

A machine learning ↔ human ↔ machine learning feedback loop

At KQED, our editors can correct common machine learning errors in our transcripts. But right now, the machine learning model isn’t learning from its mistakes. We’re beginning to test out a feedback loop in which newsrooms could identify common transcription errors to improve the machine learning model.

We’re confident that in the near future, improvements in these speech-to-text models will help convert audio to text faster, ultimately helping people find audio news more effectively.