One of my pet projects involves detecting *speech* in audio signals. I am not interested in actual
*voice recognition* but rather want to detect whether a signal is *probably* a person speaking or
something else. Since this is a pet project and I have no knowledge of digital signal processing, I
wanted to discover things on my own. This posts contains my very preliminary results.

# Processing audio signals with Python

For simplicity, I only tackle uncompressed `WAV`

files for now. SciPy is my
friend here:

```
data = scipy.io.wavfile.read( "Test.wav" )[1]
```

`data`

now contains the amplitude information of the signal. To make any calculations on the signal
more robust, I chunk the signal without overlaps into portions of equal length, which I will refer
to as *frames*. Each frame describes a short amount (say 10ms) of audio. This makes it possible to
calculate statistics per frame and obtain distributions of said statistics for the complete
recording. For proper signal processing, I should probably use some sort of windowing approach.

# Time-domain statistics

For this post, I shall only use some features in the *time domain* of the signal. I am well aware
that Fourier analysis exists, but my knowledge about these techniques is too sparse at the
moment—I know about high- and low-pass filters and the theory behind algorithms such as FFT,
but that's about it.

By browsing some books about audio analysis, I stumbled over some statistics that seemed worthwhile (because they are easy to implement):

- Short-term energy
- Zero-crossing rate
- Entropy of energy

I only have a cursory understanding of them (for now) so I hope that I got the terminology right.

## Short-term energy

The short-term energy of a frame is defined as the sum of the squared absolute values of the amplitudes, normalized by the frame length:

```
def shortTermEnergy(frame):
return sum( [ abs(x)**2 for x in frame ] ) / len(frame)
```

## Zero-crossing rate

The zero-crossing rate counts how often the amplitude values of the frame change their sign. This value is then normalized by the length of the frame:

```
def zeroCrossingRate(frame):
signs = numpy.sign(frame)
signs[signs == 0] = -1
return len(numpy.where(numpy.diff(signs))[0])/len(frame)
```

## Entropy of energy

By further subdividing each frame into a set of sub-frames, we can calculate their respective
short-term energies and treat them as *probabilities*, thus permitting us to calculate their
entropy:

```
def chunks(l, k):
for i in range(0, len(l), k):
yield l[i:i+k]
def entropyOfEnergy(frame, numSubFrames):
lenSubFrame = int(numpy.floor(len(frame) / numSubFrames))
shortFrames = list(chunks(frame, lenSubFrame))
energy = [ shortTermEnergy(s) for s in shortFrames ]
totalEnergy = sum(energy)
energy = [ e / totalEnergy for e in energy ]
entropy = 0.0
for e in energy:
if e != 0:
entropy = entropy - e * numpy.log2(e)
return entropy
```

# Combining everything

My brief literature survey suggested that these three statistics can be rather simply used to determine speech. More precisely, we need to take a look at the following quantities:

- The
*coefficient of variation*of the short-term energy. Speech tends to have higher values here than non-speech. Some experimentation with a few speech/music files shows that a threshold of`1.0`

is OK for discriminating between both classes. - The standard deviation of the zero-crossing rate. Again, speech tends to have higher values here.
A threshold of
`0.05`

seems to work. - The minimum entropy of energy. This is where speech usually has
*lower*values than music. Also, different genres of music seem to exhibit different distributions here—this seems pretty cool to me. Here, I used a threshold of`2.5`

to decide whether an audio file contains speech.

This is in no way backed by anything other than me trying and fiddling with some audio files. I hope this is interesting and useful to someone. I also hope that I will find the time to do more with it! Thus, no special git repository for this one (at least for now), but rather the complete source code. Enjoy:

```
#!/usr/bin/env python3
#
# Released under the HOT-BEVERAGE-OF-MY-CHOICE LICENSE: Bastian Rieck wrote
# this script. As long you retain this notice, you can do whatever you want
# with it. If we meet some day, and you feel like it, you can buy me a hot
# beverage of my choice in return.
import numpy
import scipy.io.wavfile
import scipy.stats
import sys
def chunks(l, k):
"""
Yields chunks of size k from a given list.
"""
for i in range(0, len(l), k):
yield l[i:i+k]
def shortTermEnergy(frame):
"""
Calculates the short-term energy of an audio frame. The energy value is
normalized using the length of the frame to make it independent of said
quantity.
"""
return sum( [ abs(x)**2 for x in frame ] ) / len(frame)
def rateSampleByVariation(chunks):
"""
Rates an audio sample using the coefficient of variation of its short-term
energy.
"""
energy = [ shortTermEnergy(chunk) for chunk in chunks ]
return scipy.stats.variation(energy)
def zeroCrossingRate(frame):
"""
Calculates the zero-crossing rate of an audio frame.
"""
signs = numpy.sign(frame)
signs[signs == 0] = -1
return len(numpy.where(numpy.diff(signs))[0])/len(frame)
def rateSampleByCrossingRate(chunks):
"""
Rates an audio sample using the standard deviation of its zero-crossing rate.
"""
zcr = [ zeroCrossingRate(chunk) for chunk in chunks ]
return numpy.std(zcr)
def entropyOfEnergy(frame, numSubFrames):
"""
Calculates the entropy of energy of an audio frame. For this, the frame is
partitioned into a number of sub-frames.
"""
lenSubFrame = int(numpy.floor(len(frame) / numSubFrames))
shortFrames = list(chunks(frame, lenSubFrame))
energy = [ shortTermEnergy(s) for s in shortFrames ]
totalEnergy = sum(energy)
energy = [ e / totalEnergy for e in energy ]
entropy = 0.0
for e in energy:
if e != 0:
entropy = entropy - e * numpy.log2(e)
return entropy
def rateSampleByEntropy(chunks):
"""
Rates an audio sample using its minimum entropy.
"""
entropy = [ entropyOfEnergy(chunk, 20) for chunk in chunks ]
return numpy.min(entropy)
#
# main
#
# Frame size in ms. Will use this quantity to collate the raw samples
# accordingly.
frameSizeInMs = 0.01
frequency = 44100 # Frequency of the input data
numSamplesPerFrame = int(frequency * frameSizeInMs)
data = scipy.io.wavfile.read( sys.argv[1] )
chunkedData = list(chunks(list(data[1]), numSamplesPerFrame))
variation = rateSampleByVariation(chunkedData)
zcr = rateSampleByCrossingRate(chunkedData)
entropy = rateSampleByEntropy(chunkedData)
print("Coefficient of variation = %f\n"
"Standard deviation of ZCR = %f\n"
"Minimum entropy = %f" % (variation, zcr, entropy) )
if variation >= 1.0:
print("Coefficient of variation suggests that the sample contains speech")
if zcr >= 0.05:
print("Standard deviation of ZCR suggests that the sample contains speech")
if entropy < 2.5:
print("Minimum entropy suggests that the sample contains speech")
```