Audio Analytic writes:
These play a key role in achieving world-leading machine learning (ML) results. In previous posts we have talked about the importance of innovating in the data collection space, but another element – which is also often underestimated – is the critical importance of the loss function.
What is the loss function?The loss function is a key element of DNN training. It sets the definition of ‘correct behaviour’ for the network, by defining a target to reach through training. However, this target definition needs to be appropriate for the considered application. A good, application-specific loss function is essential for all applications of ML, but for this post let us focus on the role of loss functions in the context of sound recognition.
In more technical terms, a loss function measures how close a network output is to the desired output. The slope, or derivative, of the function is used to guide the backpropagation training process by gradient descent: it tells the network which direction to go in the search for an optimal configuration.
As such, the loss function has a crucial impact on building a useful network: while the network architecture defines the degrees of freedom that a DNN can achieve, without training guided by a loss function the network per se does not know what to do. Think about your legs for example: they have articulations and muscles which are able to achieve a variety of movements, but we only learn to walk, run or leap through training – the legs are the architecture, the goal of moving forward without falling is defined by the loss function.
Cross-entropy is classically used by default as the simplest loss function in supervised machine learning to solve classification problems, such as deciding if an image is a cat or a dog, or deciding if an audio stream contains a baby cry or a smoke alarm. At each training iteration, cross-entropy compares the true probabilities of the data – for example [100% baby cry, 0% smoke alarm, 0% any other sound] -against the probabilities output by the network – for example [70% baby cry, 10% smoke alarm, 20% any other sound] – at a particular training step. It then gradually guides the network towards achieving better detections, i.e., closer and closer to [100%, 0%, 0%] as the training progresses. With cross-entropy this is done independently for each separate training example in the hope that the model will learn all the characteristics of the data.
Unfortunately, cross-entropy is a very crude loss function: it only looks at training data point-by-point in the hope that the model will learn all the characteristics of the data after many examples. Often, this is not the case. In the absence of better guidance, only some characteristics end up making a significant impact during training. More sophisticated families of loss functions are required to represent training constraints that are more useful than a simple comparison between point-wise scores and labels.
When it comes to sound recognition, several specific characteristics of the task need to be taken into account when designing loss functions which will guide the training more accurately:
Sound events are structured across time – and that time can be anything from milliseconds to hours. The tones of various sounds are described in terms of frames (groups of samples) which are typically 10’s of milliseconds long. If sound detection decisions were made on frames only, in a single year you could have up to 3 billion decisions to determine if a range of specific sounds has occurred. With an error of just 1% you would be making ~30m errors in a year trying to classify a single sound. It is therefore important to represent the temporal structure of sound events in the definition of what the network should recognise: some sounds are short – like a glass break – while others can happen over a long period of time – like a baby crying or the combination of a complex set of sounds that suggest that a person is leaving their home.
Sounds can happen randomly and are not constrained by a model like language. When thinking of sound recognition, the first application that comes to mind is automatic speech recognition. However, the landscape of sound is much broader and chaotic than just speech. Indeed, speech is deliberately structured to convey information and follows certain rules, whereas sounds can come from many seemingly independent and uncontrollable sources. Therefore, in addition to exploiting temporal constraints at the scale of sound events, the training must be guided by wider knowledge about the contrast between sound event classes, which is also contributed via the loss function.
Designing a loss function for the challenges of soundSounds are not isolated instantaneous points. They are sequences of acoustic data points, with particular durations and temporal patterns. Our researchers set about designing the optimal loss function framework for sound that could cope with the specific challenges. That loss function framework was recently patented.
As part of our research, we identified three fundamental constraints that should be included in a loss function framework built for sound recognition:
1. Sufficient detection during a sound event
Think about the sound of a baby crying. Over the duration of this ‘audio event’ there are frames of audio when there is no sound coming from the baby (typically because it is filling its lungs to cry again). Our patented loss function accounts for the discontinuity inherent to sound events. The system doesn’t need to train itself to classify all the frames correctly, instead it needs to classify enough of them correctly so that its decision is robust at the scale of whole sound events. In other words, the loss function allows a proportion of misclassifications at the frame level in aid of making the system more confident of its decision when it looks at multiple frames across whole sound events.
2. Minimise indecisiveness
A system that changes its classification every audio frame also changes its error every frame. Our loss function encourages the system to make decisions which are more stable and consistent across time. Training models to behave in this way results in more confident and thus better performing models.
3. Avoid false positives and cross triggers
From the standpoint of user experience, our product development studies have shown that the cost of false alarms on user opinion is higher for cross-triggers than for generic false alarms. For example, a confusion between baby cry and the specific smoke alarm class is perceived more negatively than confusion between baby cry and the unspecific “any other sound” class. This is encoded in our loss function by putting extra weight on cross-triggers to lead the network to avoid confusions between the target and detection classes. Therefore, the network learns to produce sound detection decisions which are ‘disciplined’ to satisfy user experience criteria. In contrast, networks which learn from crude and non-specialist loss functions like cross-entropy may produce decisions which are unsatisfactory for end users.
To summarise, ‘vanilla’ loss functions such as cross-entropy ignore some very strong key characteristics of sound events which are key features of our loss function framework:
- accounting for the true definition of events as interrupted acoustic sequences of a certain duration,
- across which stable decisions must be made at the level of events rather than frames,
- and accounting for user experience criteria in the performance that the network is trained to deliver.
As such, sound recognition systems trained with cross-entropy cannot claim that they are performing sound event recognition: they merely do tone recognition at the frame level. In contrast, our system is trained to deliver true sound event recognition and user satisfaction by design on everything from computationally-constrained earbuds to smartphones and smart speakers. As a result, our compact system is performing much better in the real world than the systems based on cross-entropy.
DNNs do what they are taught to do, and the role of the loss function framework is precisely to define the rules of behaviour that the network is being taught – better learn the behaviour that users demand than the one defined by a default math function.