While learning about principal component analysis for the first time, I also encountered the intriguing world of eigenfaces. Briefly put, back in the mists of time—at least in computer science terms—Matthew Turk and Alex Pentland came up with an ingenious approach towards face recognition and classification. At it’s core, the method consists of nothing but an eigenvalue decomposition. Let me briefly outline the steps:

  1. Acquire a set of images showing faces. I am assuming that the images are grey-scale because this means that we do not have to think about different colour channels.
  2. Treat each digital image as a vector in a high-dimensional space by “flattening” the image vector. An image of dimensions 192 ×64 pixels becomes a 12288-dimensional vector, for example.
  3. Calculate the mean face of the data set and subtract it from all images. This corresponds to centring the distribution of faces.
  4. Put all vectors into a matrix. The result is a matrix with as many rows as the dimensions of the faces, and as many columns as there are faces.
  5. Calculate the eigenvalue decomposition of the sample covariance matrix. This step hides a clever trick: We do not calculate the sample covariance matrix of the original matrix, which would be a square matrix with the row dimension of the original data, for example 12288 ×12288. Instead, we calculate the transpose of this matrix, which uses the column dimension of the original data. This matrix will hence only have as many rows and columns as there are faces in the data. It can be shown that an any eigenvector of this matrix multiplied with the original matrix yields an eigenvector of the large covariance matrix.
  6. Of course, we do not use all eigenvalues and eigenvectors but rather only as many until we have reached, say, 95% of explained variance. This removes a surprisingly large amount of data.
  7. If we have the reduced eigensystem, we can project any new face to the (reduced) basis of eigenvectors. This allows us to embed them in what Turk and Pentland call the “face space”.
  8. The projection permits us to decide whether something is a face at all by checking whether it is sufficiently close to the other vectors in “face space”. Furthermore, it permits face classification by assigning a known face to its nearest neighbour (or to its nearest class of faces). These operations only require basic vector multiplication operations, so they are very cheap.

Such a pretty neat algorithm deserves a neat implementation. I tried to write one myself in Python. You can find the code in the Eigenfaces repository on my GitHub profile. As always, it is licensed under an MIT license. Note that the script currently only supports the reconstruction aspect of faces. It will randomly partition input data into training and test data, perform the eigensystem calculations on the training data, and attempt the reconstruction on the test data.

I used a classical database of faces, the Yale faces data, which has been preprocessed by some nice folks at MIT. In the repository, I only included the original and cropped variants of the faces—the preprocessed data set contains much more.

The results are interesting and highlight some of problems with the approach. Since the eigendecomposition is completely agnostic concerning any sorts of facial features, it may “learn” the wrong thing. Input data needs to be chosen carefully and should ideally be already cropped. Otherwise, the reconstruction quality is not very great. For example, this happens when I train on 100 out of 165 images with 95% explained variance:

Example reconstruction on original data

We can see that our reconstruction (left) captures mostly the uninteresting stuff, namely the position of the head, as well as some lighting details, in comparison to the original (right). Here”s a similar example with the cropped faces:

Example reconstruction on cropped data

This version is much better. Of course, it also contains the ghostly remains of the glasses, but all in all, the reconstruction is more true to the original.

For such a conceptually simple algorithm, these are rather nice results. It is very intriguing to see how well even simple classifiers can perform. If you want to play with the algorithm yourself, please grab a copy of the repository.

Posted Sunday afternoon, September 13th, 2015 Tags:

I recently downloaded the Enron corpus to try to visualize some interesting features. As a first step, I wanted to analyse how many swear words are used in the corpus. I wanted to find out if there are some users with a “saltier“ language than others.

Processing the data

I am only using the simplest procedure for walking through the data: First, I defined a list of offensive words (with a little help from Wikipedia, of course). Second, I used a basic Python script to process each e-mail, tokenize it, and check the words against the list of profanity. The counts are then tallied and printed out. Here’s a rough draft of the script:

#!/usr/bin/env python

import collections
import email
import hashlib
import nltk
import os

directory           = "maildir/"
swearWordsPerPerson = collections.Counter()
swearWordsTotal     = collections.Counter()
swearWords          = set()

parser = email.parser.HeaderParser()

with open("Swear_words.txt") as f:
  for line in f:
    swearWords.add( line.rstrip() )

for root, _, files in os.walk(directory):
  p = os.path.normpath(root)
  p = p.split(os.sep)

  person = ""
  if len(p) >= 2:
    person = p[1]

  print("Processing '%s'..." % person)

  for name in files:
    path = os.path.join(root,name)
    with open(path, errors="ignore", encoding="utf-8") as f:
      text   = f.read()
      text   = parser.parsestr(text, headersonly=True).get_payload()
      text   = text.encode("utf-8")
      words  = nltk.word_tokenize(text.decode("utf-8").lower())
      for index,word in enumerate(words):
        if word in swearWords:
          swearWordsPerPerson[person] += 1
          swearWordsTotal[word]       += 1

  print("...finished")
  print("%s: %d" % (person, swearWordsPerPerson[person]))

print("THE SWEARERS\n")

for person, numSwears in swearWordsPerPerson.most_common():
  print("  %s: %d" % (person, numSwears))

print("")
print("THE SWEARS\n")

for swear, number in swearWordsTotal.most_common():
  print("  %s: %d" % (swear, number))

The conversion of text from each e-mail looks needlessly elaborate, but it is required to ensure that words in the e-mail headers are not counted. Of course, I am not trying to detect any fancy things such as messages that have been forwarded multiple times. Furthermore, I have excluded some words from the list of swears in order not to trigger any false positives. One of the executives is called “Dick Jenkins”, and some of his co-workers refer to him by his first name, so…

The swearers

First, let’s look at a histogram of the absolute number of swears:

Histogram of the absolute number of swears

We can see that there are only few people who seem to use many swears. However, this is not normalized against the number of e-mails each person received. Let’s do this next!

Histogram of the relative number of swears

The pattern is similar, but the outliers are now “normalized” in a sense. If we focus on a normalized swear count of larger than 0.25 (with a unit of “swears per e-mail”), we are left with the following persons:

lucci-p   0.33
lenhart-m 0.28
brawner-s 0.27
quigley-d 0.19
ring-a    0.18
dorland-c 0.15

All the remaining 144 (!) persons use less than 0.15 swear words per e-mail.

The swears

So, what are the top ten swear words in all e-mails? Without further ado, here is a partially censored list (I don’t want to lose my family-friendly rating):

hell   2033
a**:   1724
sh*t   1583
sex    1521
balls  1235
d*mn    915
b*tch   711
f*ck    690
butt    615
crap    612

Conclusion

Does this mean that the Enron people like to swear? Unfortunately, no. First of all, the context of a word is extremely important. In some cases, hell might be a misspelling of he’ll. Second, the swear count might be multiplied by people forwarding (or replying) e-mails without removing the old content.

Upon further inspection, it turns out that most of the swear words are used in personal communications. People are forwarding the usual dirty jokes among their friends and co-workers, who in turn reply with a salty joke themselves.

The take-away messages from this experiment are:

  1. Figuring out words in context is hard.
  2. Analysing a corpus such as the Enron e-mails is also hard.
  3. Natural language is even harder.
Posted at midnight, September 28th, 2015 Tags: