Automatic key-word spotting in speech
Key words: Acoustic environments, speech recognition, auditory streaming, automatic speech recognition
Themes: healthy aging, sustainable society, energy
Human listeners often hardly notice the difference between speech recognition in studio conditions versus keyword spotting in a normal day-to-day acoustic environment. Humans can also reliably determine the presence of keywords from a mixture of speech and non-speech sounds. Humans do this by separating sounds into signal components (basic perceptual units) and assigning each to the correct source (a process called auditory streaming). Furthermore, humans can remove the effects of room acoustics from these perceptual units. At this stage CPSP can take care of the signal components estimation, but additional work is required to remove the effects of room acoustics on the signal components. Furthermore existing keyword spotting strategies must be adapted to deal with signal components combinations as basic input.
Although ignored in traditional Automatic Speech Recognition (ASR) systems, there is a lot of cognition required to extend the operating domain of even a very simple keyword spotting system to an environment in which, acoustically, anything can happen at any time. This project searches for the sort of knowledge from the domain of cognitive science (especially linguistics and psycholinguists) that is essential to make the transition from closed operating domain to open domains.
Initially, this sub-project will focus on adaptations of a number of research data-bases for robust ASR. Then the focus will broaden to the detection and recognition of a few keywords (10 to 20) from an unknown and unrestricted input: a task for which many commercial applications, but no solutions, exist. The system’s performance will be compared to human performance in similar environments. We focus on the building of a system that can be expected to function reliably in most of the acoustic environments we normally reside in. For example, livingrooms and bedrooms, classrooms, on the street, and in a car. Further care will be taken to ensure scalability of the number of words that can be detected and the future inclusion of high level linguistic information. This ensures that the developed technologies can be scale-up to full-fledged ASR systems for unconstrained acoustic environments.
Participating researchers: 2
Research programme: Sensory Cognition
Research Institute: Bernoulli Institute
Faculty: Faculty of Mathematics and Natural Science
Graduate school: Graduate School of Science and Engineering (GSSE)
Collaboration: Acoustical Imaging and Sound Control Group of Delft University of Technology
|Last modified:||03 March 2020 2.51 p.m.|