Content moderation for voice is very different than moderation for text. Not only is voice a different data type than text, but it is also still in the acceptance stage. Voice moderation is less developed compared to other types of content moderation, particularly across advanced technologies.
Trust & Safety leaders at online platforms will likely come across a number of unfamiliar terms when investigating voice moderation solutions. Here’s a quick guide to voice moderation terms that may be helpful if you find yourself in this position!
*.wav files (‘wave’ files) are the file format that an audio recording is stored in (much like a Word document is a *.doc file.) This file format was created by IBM and Microsoft back in the day, named ‘wave’ files because sound is a wave.
Because speech/sound is a wave, it is continuous and has an infinite number of data points. It would be impossible to make an absolutely complete recording, and even if you did, it would take up an enormous amount of storage space.
‘Sample Rate’ refers to the frequency of samples per second in a digital audio recording. The higher the sample rate, the more precisely the original audio is captured. While there are a variety of options associated with sample rate, the magic number for acceptable audio quality seems to be 16 KHz: or 16,000 samples per second.
Bit rate is similar to sample rate: it, too has an impact on the quality of an audio file. It describes how much data is transmitted per second, from one location to the next. Bit rate is usually described as bits per second (bps), kilobits per second (Kbps) or megabits per second (Mbps.)
The higher the sample and bit rate, the better the quality of the audio recording. However, higher-quality recordings are also larger, taking up more storage space.
Bit Depth (Dynamic Range)
Bit depth refers to the number of bits in each sample. Think of pixels in an image: the higher the number of pixels in a limited area, the higher the resolution of the image. Similarly, the greater the number of bits in an audio sample, the better the quality of the audio recording.
Channels (Mono or Stereo)
Mono channel recordings are created on a single channel, while stereo recordings are done on dual channels. A stereo recording may be of higher quality, and it may be easier to tell who is speaking.
On the other hand, for data analytics, even if you send a file that has been recorded in stereo it will be analyzed on one channel, as analysis occurs on an average. Additionally, recording on more channels means more processing, more storage, and more work required.
Imagine a voice chat on a gaming channel. There will likely be multiple users, each with their own ‘stream.’ For the purposes of Trust & Safety, it is better to separate each user into their own stream: to eliminate any confusion, even when conversations overlap one another. Analyzing concurrent streams overlaid on top of one another leaves room for misunderstanding or misattribution of behaviors. Consistent, accurate enforcement of your community guidelines requires that your Trust & Safety team can tell who said what.
Inference is a term used to describe the way that an AI solution uses training data to operate more efficiently and effectively. AI content moderation solutions use massive amounts of data to train their algorithms to make predictive assumptions – and by doing so, they can minimize the time and effort it takes to analyze a data set.
For a voice chat solution, the inference rate refers to how often the analysis is running against the voice file. The more frequent the analysis, the more accurate the inference – however, a higher inference rate will require more work in processing.
Step time is the frequency that audio snapshots are taken and sent into inference for analysis. Step time can have an impact on the quality of voice analysis. In considering step time, it is important to remember that overlap between audio snapshots is critical so that context and continuity can be maintained.
Voice chat moderation is a new frontier for platforms – based on technology that is still in development. As voice chat becomes more commonly adopted across different platforms, the need for Trust & Safety solutions that encompass voice chat will become greater. Hopefully, a familiarity with the terms used to describe voice chat moderation can help Trust & Safety officers make the best decisions for their users, platforms, and businesses.