One of my pet projects involves detecting speech in audio signals. I am not interested in actual voice recognition but rather want to detect whether a signal is probably a person speaking or something else. Since this is a pet project and I have no knowledge of digital signal processing, I wanted to discover things on my own. This posts contains my very preliminary results.

Processing audio signals with Python

For simplicity, I only tackle uncompressed WAV files for now. SciPy is my friend here:

data = scipy.io.wavfile.read( "Test.wav" )[1]

data now contains the amplitude information of the signal. To make any calculations on the signal more robust, I chunk the signal without overlaps into portions of equal length, which I will refer to as frames. Each frame describes a short amount (say 10ms) of audio. This makes it possible to calculate statistics per frame and obtain distributions of said statistics for the complete recording. For proper signal processing, I should probably use some sort of windowing approach.

Time-domain statistics

For this post, I shall only use some features in the time domain of the signal. I am well aware that Fourier analysis exists, but my knowledge about these techniques is too sparse at the moment—I know about high- and low-pass filters and the theory behind algorithms such as FFT, but that's about it.

By browsing some books about audio analysis, I stumbled over some statistics that seemed worthwhile (because they are easy to implement):

  • Short-term energy
  • Zero-crossing rate
  • Entropy of energy

I only have a cursory understanding of them (for now) so I hope that I got the terminology right.

Short-term energy

The short-term energy of a frame is defined as the sum of the squared absolute values of the amplitudes, normalized by the frame length:

def shortTermEnergy(frame):
  return sum( [ abs(x)**2 for x in frame ] ) / len(frame)

Zero-crossing rate

The zero-crossing rate counts how often the amplitude values of the frame change their sign. This value is then normalized by the length of the frame:

def zeroCrossingRate(frame):
  signs             = numpy.sign(frame)
  signs[signs == 0] = -1

  return len(numpy.where(numpy.diff(signs))[0])/len(frame)

Entropy of energy

By further subdividing each frame into a set of sub-frames, we can calculate their respective short-term energies and treat them as probabilities, thus permitting us to calculate their entropy:

def chunks(l, k):
  for i in range(0, len(l), k):
    yield l[i:i+k]

def entropyOfEnergy(frame, numSubFrames):
  lenSubFrame = int(numpy.floor(len(frame) / numSubFrames))
  shortFrames = list(chunks(frame, lenSubFrame))
  energy      = [ shortTermEnergy(s) for s in shortFrames ]
  totalEnergy = sum(energy)
  energy      = [ e / totalEnergy for e in energy ]

  entropy = 0.0
  for e in energy:
    if e != 0:
      entropy = entropy - e * numpy.log2(e)

  return entropy

Combining everything

My brief literature survey suggested that these three statistics can be rather simply used to determine speech. More precisely, we need to take a look at the following quantities:

  • The coefficient of variation of the short-term energy. Speech tends to have higher values here than non-speech. Some experimentation with a few speech/music files shows that a threshold of 1.0 is OK for discriminating between both classes.
  • The standard deviation of the zero-crossing rate. Again, speech tends to have higher values here. A threshold of 0.05 seems to work.
  • The minimum entropy of energy. This is where speech usually has lower values than music. Also, different genres of music seem to exhibit different distributions here—this seems pretty cool to me. Here, I used a threshold of 2.5 to decide whether an audio file contains speech.

This is in no way backed by anything other than me trying and fiddling with some audio files. I hope this is interesting and useful to someone. I also hope that I will find the time to do more with it! Thus, no special git repository for this one (at least for now), but rather the complete source code. Enjoy:

#!/usr/bin/env python3
#
# Released under the HOT-BEVERAGE-OF-MY-CHOICE LICENSE: Bastian Rieck wrote
# this script. As long you retain this notice, you can do whatever you want
# with it. If we meet some day, and you feel like it, you can buy me a hot
# beverage of my choice in return.

import numpy
import scipy.io.wavfile
import scipy.stats
import sys

def chunks(l, k):
  """
  Yields chunks of size k from a given list.
  """
  for i in range(0, len(l), k):
    yield l[i:i+k]

def shortTermEnergy(frame):
  """
  Calculates the short-term energy of an audio frame. The energy value is
  normalized using the length of the frame to make it independent of said
  quantity.
  """
  return sum( [ abs(x)**2 for x in frame ] ) / len(frame)

def rateSampleByVariation(chunks):
  """
  Rates an audio sample using the coefficient of variation of its short-term
  energy.
  """
  energy = [ shortTermEnergy(chunk) for chunk in chunks ]
  return scipy.stats.variation(energy)

def zeroCrossingRate(frame):
  """
  Calculates the zero-crossing rate of an audio frame.
  """
  signs             = numpy.sign(frame)
  signs[signs == 0] = -1

  return len(numpy.where(numpy.diff(signs))[0])/len(frame)

def rateSampleByCrossingRate(chunks):
  """
  Rates an audio sample using the standard deviation of its zero-crossing rate.
  """
  zcr = [ zeroCrossingRate(chunk) for chunk in chunks ]
  return numpy.std(zcr)

def entropyOfEnergy(frame, numSubFrames):
  """
  Calculates the entropy of energy of an audio frame. For this, the frame is
  partitioned into a number of sub-frames.
  """
  lenSubFrame = int(numpy.floor(len(frame) / numSubFrames))
  shortFrames = list(chunks(frame, lenSubFrame))
  energy      = [ shortTermEnergy(s) for s in shortFrames ]
  totalEnergy = sum(energy)
  energy      = [ e / totalEnergy for e in energy ]

  entropy = 0.0
  for e in energy:
    if e != 0:
      entropy = entropy - e * numpy.log2(e)

  return entropy

def rateSampleByEntropy(chunks):
  """
  Rates an audio sample using its minimum entropy.
  """
  entropy = [ entropyOfEnergy(chunk, 20) for chunk in chunks ]
  return numpy.min(entropy)

#
# main
#

# Frame size in ms. Will use this quantity to collate the raw samples
# accordingly.
frameSizeInMs = 0.01

frequency          = 44100 # Frequency of the input data
numSamplesPerFrame = int(frequency * frameSizeInMs)

data        = scipy.io.wavfile.read( sys.argv[1] )
chunkedData = list(chunks(list(data[1]), numSamplesPerFrame))

variation = rateSampleByVariation(chunkedData)
zcr       = rateSampleByCrossingRate(chunkedData)
entropy   = rateSampleByEntropy(chunkedData)

print("Coefficient of variation  = %f\n"
      "Standard deviation of ZCR = %f\n"
      "Minimum entropy           = %f" % (variation, zcr, entropy) )

if variation >= 1.0:
  print("Coefficient of variation suggests that the sample contains speech")

if zcr >= 0.05:
  print("Standard deviation of ZCR suggests that the sample contains speech")

if entropy < 2.5:
  print("Minimum entropy suggests that the sample contains speech")