While learning about principal component analysis
for the first time, I also encountered the intriguing world of *eigenfaces*. Briefly put, back in
the mists of time—at least in computer science terms—Matthew
Turk and Alex Pentland came up
with an ingenious approach towards face recognition and classification. At it’s core, the
method consists of nothing but an eigenvalue decomposition. Let me briefly outline the steps:

- Acquire a set of images showing faces. I am assuming that the images are grey-scale because this means that we do not have to think about different colour channels.
- Treat each digital image as a vector in a high-dimensional space by “flattening” the image vector. An image of dimensions 192 ×64 pixels becomes a 12288-dimensional vector, for example.
- Calculate the
*mean face*of the data set and subtract it from all images. This corresponds to centring the distribution of faces. - Put all vectors into a matrix. The result is a matrix with as many rows as the dimensions of the faces, and as many columns as there are faces.
- Calculate the eigenvalue decomposition of the sample covariance matrix. This step hides a clever
trick: We do not calculate the sample covariance matrix of the original matrix, which would be a
square matrix with the row dimension of the original data, for example 12288 ×12288.
Instead, we calculate the
*transpose*of this matrix, which uses the column dimension of the original data. This matrix will hence only have as many rows and columns as there are faces in the data. It can be shown that an any eigenvector of this matrix multiplied with the original matrix yields an eigenvector of the large covariance matrix. - Of course, we do not use
*all*eigenvalues and eigenvectors but rather only as many until we have reached, say, 95% of explained variance. This removes a surprisingly large amount of data. - If we have the reduced eigensystem, we can project any
*new*face to the (reduced) basis of eigenvectors. This allows us to embed them in what Turk and Pentland call the “face space”. - The projection permits us to decide whether something is a face at all by checking whether it is
sufficiently close to the other vectors in “face space”. Furthermore, it permits
*face classification*by assigning a known face to its nearest neighbour (or to its nearest class of faces). These operations only require basic vector multiplication operations, so they are very cheap.

Such a pretty neat algorithm deserves a neat implementation. I tried to write one myself in Python.
You can find the code in the *Eigenfaces* repository on my GitHub
profile. As always, it is licensed under an MIT
license. Note that the script currently only supports the *reconstruction* aspect of faces. It will
randomly partition input data into training and test data, perform the eigensystem calculations on
the training data, and attempt the reconstruction on the test data.

I used a classical database of faces, the *Yale faces* data, which has been preprocessed by some
nice folks at MIT. In the
repository, I only included the original and cropped variants of the faces—the preprocessed
data set contains much more.

The results are interesting and highlight some of problems with the approach. Since the eigendecomposition is completely agnostic concerning any sorts of facial features, it may “learn” the wrong thing. Input data needs to be chosen carefully and should ideally be already cropped. Otherwise, the reconstruction quality is not very great. For example, this happens when I train on 100 out of 165 images with 95% explained variance:

We can see that our reconstruction (left) captures mostly the uninteresting stuff, namely the
position of the head, as well as some lighting details, in comparison to the original (right).
Here”s a similar example with the *cropped* faces:

This version is much better. Of course, it also contains the ghostly remains of the glasses, but all in all, the reconstruction is more true to the original.

For such a conceptually simple algorithm, these are rather nice results. It is very intriguing to see how well even simple classifiers can perform. If you want to play with the algorithm yourself, please grab a copy of the repository.

I recently downloaded the Enron corpus to try to visualize
some interesting features. As a first step, I wanted to analyse how many *swear words* are used
in the corpus. I wanted to find out if there are some users with a “saltier“ language
than others.

# Processing the data

I am only using the simplest procedure for walking through the data: First, I defined a list of offensive words (with a little help from Wikipedia, of course). Second, I used a basic Python script to process each e-mail, tokenize it, and check the words against the list of profanity. The counts are then tallied and printed out. Here’s a rough draft of the script:

```
#!/usr/bin/env python
import collections
import email
import hashlib
import nltk
import os
directory = "maildir/"
swearWordsPerPerson = collections.Counter()
swearWordsTotal = collections.Counter()
swearWords = set()
parser = email.parser.HeaderParser()
with open("Swear_words.txt") as f:
for line in f:
swearWords.add( line.rstrip() )
for root, _, files in os.walk(directory):
p = os.path.normpath(root)
p = p.split(os.sep)
person = ""
if len(p) >= 2:
person = p[1]
print("Processing '%s'..." % person)
for name in files:
path = os.path.join(root,name)
with open(path, errors="ignore", encoding="utf-8") as f:
text = f.read()
text = parser.parsestr(text, headersonly=True).get_payload()
text = text.encode("utf-8")
words = nltk.word_tokenize(text.decode("utf-8").lower())
for index,word in enumerate(words):
if word in swearWords:
swearWordsPerPerson[person] += 1
swearWordsTotal[word] += 1
print("...finished")
print("%s: %d" % (person, swearWordsPerPerson[person]))
print("THE SWEARERS\n")
for person, numSwears in swearWordsPerPerson.most_common():
print(" %s: %d" % (person, numSwears))
print("")
print("THE SWEARS\n")
for swear, number in swearWordsTotal.most_common():
print(" %s: %d" % (swear, number))
```

The conversion of text from each e-mail looks needlessly elaborate, but it is required to ensure that words in the e-mail headers are not counted. Of course, I am not trying to detect any fancy things such as messages that have been forwarded multiple times. Furthermore, I have excluded some words from the list of swears in order not to trigger any false positives. One of the executives is called “Dick Jenkins”, and some of his co-workers refer to him by his first name, so…

# The swearers

First, let’s look at a histogram of the *absolute* number of swears:

We can see that there are only few people who seem to use many swears. However, this is not normalized against the number of e-mails each person received. Let’s do this next!

The pattern is similar, but the outliers are now “normalized” in a sense. If we focus on a normalized swear count of larger than 0.25 (with a unit of “swears per e-mail”), we are left with the following persons:

```
lucci-p 0.33
lenhart-m 0.28
brawner-s 0.27
quigley-d 0.19
ring-a 0.18
dorland-c 0.15
```

All the remaining 144 (!) persons use less than 0.15 swear words per e-mail.

# The swears

So, what are the top ten swear words in all e-mails? Without further ado, here is a partially censored list (I don’t want to lose my family-friendly rating):

```
hell 2033
a**: 1724
sh*t 1583
sex 1521
balls 1235
d*mn 915
b*tch 711
f*ck 690
butt 615
crap 612
```

# Conclusion

Does this mean that the Enron people like to swear? Unfortunately, no. First of all, the context of
a word is extremely important. In some cases, *hell* might be a misspelling of *he’ll*.
Second, the swear count might be multiplied by people forwarding (or replying) e-mails without
removing the old content.

Upon further inspection, it turns out that most of the swear words are used in personal communications. People are forwarding the usual dirty jokes among their friends and co-workers, who in turn reply with a salty joke themselves.

The take-away messages from this experiment are:

- Figuring out words in context is hard.
- Analysing a corpus such as the Enron e-mails is also hard.
- Natural language is even harder.