Can Automatic Speech Recognition Keep Up with Buzzwords? Following the steps of GEN-Z
“That’s giving main character energy.”
“It’s a soft launch, don’t overthink it.”
“We need to circle back and touch base.”
In the context of globalization and the spread of the Internet, memes and popular vocabulary have become a major part of communication among young people. These are buzzwords: trendy, ever-evolving terms that dominate TikTok, YouTube, and even LinkedIn. From quiet quitting to girl math, these phrases spread fast—and evolve even faster. Young people in different countries are constantly creating new vocabulary unique to their online interactions.
These words and expressions reflect specific cultural, humorous, and social phenomena. They’re highly geographically rooted and time-sensitive. However, traditional ASR (Automatic Speech Recognition) systems often struggle to keep up, failing to accurately identify this fast-changing language. The result? Communication gaps and misunderstandings in cross-cultural online spaces.
What exactly is ASR?
ASR systems are like students: they can only recognise what they’ve been taught. Traditional ASR models learn from large datasets, often collected years ago. But buzzwords? They’re more like memes—they pop up overnight.
Words like brand deals, Spotify Wrapped, or girl math weren’t in those training datasets. ASR systems struggle with these Out-of-Vocabulary (OOV) words, just like a language learner hearing slang for the first time.
How is ASR used in social media?
Speech recognition is now everywhere. YouTube offers live subtitles, and video creators can generate captions automatically during editing. Tools like Vizard.ai allow users to upload a video and instantly generate subtitles using AI-powered speech recognition, with customisation options for font, colour, and animation.
I tested a Chinese editing app called 剪映 (Jianying) to compare its speech recognition output with the original subtitles provided by the author. Here's what I found:
-
“Ick” became “egg.”
-
“Icky-AK” was transcribed as nonsense.
-
“Gyat” turned into “get.”
-
“Ice Spice” showed up as “I spy.”
That’s what happens when ASR systems don’t get the memo on what’s trendy. They try—but it’s like your uncle trying to decode Gen Z slang: well-meaning, but completely off.
Why does improving ASR accuracy matter?
My goal is to enhance ASR systems so they can better recognise modern buzzwords. This would help young people from different countries understand one another’s online language and cultural context, reducing miscommunication.
It would also make content creation more efficient: video creators wouldn’t need to spend time correcting autogenerated subtitles. And it's not just about convenience—accurate captions can increase YouTube views by 7.32%!
So… How can ASR models keep up?
Researchers are now giving ASR a new skill: contextual understanding.
Some have combined BERT (a powerful language model) with deep clustering, which groups similar buzzwords together. This helps ASR understand that terms like Rizz, Slay, and No cap belong to the same trendy clique. It works especially well for Chinese names and phrases.
But there’s a problem — it’s too slow for real-time captions on platforms like YouTube or TikTok.
So, I’m exploring BERT + LP Adaptation to improve performance.
Think of BERT as a brain that remembers every conversation to understand the current one. It adds context by analysing surrounding words—so it knows that “soft launch” isn't about rockets.
But BERT has a drawback: it’s slow and requires a lot of training before it can be used.
That’s where LP Adaptation comes in. It’s like a fast learner who picks up new slang the moment it appears in the group chat. LP lets ASR systems adapt in real time, learning new buzzwords without needing to be re-trained from scratch.
This is what I’m working on: combining BERT’s intelligence with LP’s agility. Like a smart friend who also has street smarts.