This is the personal blog of Bastian Rieck. I add posts in irregular intervals about things I consider relevant. Send any feedback, comments etc. to bastian@annwfn.net.

Software developers are already familiar with the KISS principle. Roughly speaking, it refers to the old wisdom that simple solutions are to be preferred, while unnecessary complexity should be avoided. In machine learning, this means that one should increase the complexity iteratively, starting from simple models and—if necessary—working upwards to more complicated ones. I was recently reminded and humbled by the wisdom of KISS while working on a novel topology-based kernel for machine learning.

Prelude: A Paper on Deep Graph Kernels

A recent paper by Yanardag and Vishwanathan introduced a way of combining graph kernels with deep learning techniques. The new framework basically yields a way of using graph kernels, such as graphlet kernels in a regular deep learning setting.

The two researchers used some interesting data sets for their publication. Most notably, they extracted a set of co-occurrence networks (more about that in a minute) from Reddit, a content aggregation and discussion site. Reddit consists of different communities, the subreddits. Each subreddit deals with a different topic, ranging from archaeology to zoology. The posting style of these subreddits varies a lot. There are several subreddits that are based on a question–answer format, while others are more centred around individual discussions.

Yanardag and Vishwanathan hence crawled the top submissions from the subreddits IamA, AskReddit, both of which are based on questions and answers, as well as from TrollXChromosomes, and atheism, which are discussion-based subreddits. From every submission, a graph was created by taking all the commenters of a thread as nodes and connecting two nodes by an edge if one user responds to the comment of another user. We can see that this is an extremely simple model—it represents only a fraction of the information available in every discussion. Nonetheless, there is some hope that qualitatively different behaviours will emerge. More precisely, the assumption of Yanardag and Vishwanathan is that there is some quantifiable difference between question–answer subreddits and discussion-based subreddits. Their paper aims to learn the correct classification for each thread. Hence, given a graph, we want to teach the computer to tell us whether the graph is more likely to arise from a question–answer subreddit or from a discussion-based one.

The two researchers refer to this data set as REDDIT-BINARY. It is now available in a repository of benchmark data sets for graph kernels, gracefully provided and lovingly curated by the CS department of Dortmund University.

Interlude: Looking at the Data

Prior to actually working with the data, let us first take a look at it in order to get a feel for its inherent patterns. To this end, I used Aleph, my library for topological data analysis read and convert the individual graphs of every discussion to the simpler GML format. See below if you are also interested in the data. Next, I used Gephi to obtain a force-directed visualization of some of the networks.

A look at a question–answer subreddit shows a central core structure of the data set, from which numerous strands—each corresponding most probably to smaller discussions—emerge:

A graph visualization of Q&A
subreddits

The discussion-based subreddit graph, on the other hand, exhibits a larger depth, manifesting themselves in a few long strands. Nonetheless, a similar central core structure is observable as well.

A graph visualization of discussion-based subreddits

Keep in mind that the selected examples are not necessarily representative—I merely picked two of the large graphs in the data set to obtain an idea of how the data looks.

A Complex Classification

A straightforward way to classify those networks would be to use graph kernels, such as the graphlet kernels. The basic idea behind these kernels is to measure a dissimilarity between two graphs by means of, for example, the presence or absence of certain subgraphs. At least this is the strategy pursued by the graphlet kernel. Other kernels may instead opt for comparing random walks on both graphs. A common theme of these kernels is that they are rather expensive to compute. In many applications, they are the only hope of obtaining suitable dissimilarity information without having to solve the graph isomorphism problem, which is even more computationally expensive. Hence, graph kernels are often the only suitable way of assessing the dissimilarity between graphs. They are quite flexible and, as the authors show, can even be integrated into a deep learning framework.

As for their performance, Yanardag and Vishwanathan report an accuracy of 77.34% (standard deviation 0.18) for graphlet kernels and 78.04% (standard deviation 0.39) for deep graphlet kernels, which is very good considering the baseline for random guessing is 50%.

A Simple Classification

In line with the KISS principle, I wanted to figure out alternative ways of achieving similar accuracy values with a better performance, if possible. This suggests a feature-based approach with features that can be calculated easily. There are numerous potential options for choosing these features. I figured that good choices include the average clustering coefficient of the graph, the average degree, the average shortest path length, the density, and the diameter of the graph. Among these, the average shortest path length and the diameter take longest to compute because they essentially have to enumerate all shortest paths in the graph. So I only used the remaining three features, all of which are computable in polynomial time.

I used the excellent NetworkX package for Python to do most of the heavy lifting. Reading a graph and calculating its features is as easy as it gets:

import networkx as nx

G = nx.read_gml(filename, label='id')

average_clustering_coefficient = nx.average_clustering(G)
average_degree                 = np.mean( [degree for _,degree in nx.degree(G) ] )
density                        = nx.density(G)

I collect these values in a pandas.DataFrame to simplify their handling. Now we have to choose some classifiers. I selected decision trees, support vector machines, and logistic regression. Next, let us train and test each of these classifiers by means of 10-fold cross-validation.

X = df.values
y = labels

classifiers = [ DecisionTreeClassifier(), LinearSVC(), LogisticRegression() ]
for clf in classifiers:
  scores = cross_val_score(clf, X, y, cv=10)
  print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

As you can see, I am using the default values for every classifier—no grid search or other technique for finding better hyperparameters for now. Instead, I want to see how these classifiers perform out of the box.

Astonishing Results

The initial test resulted in the following accuracy values: 77.84% (standard deviation 0.16) for decision trees, 64.14% (standard deviation 0.45) for support vector machines, and 55.54% (standard deviation 0.49) for logistic regression. At least the first result is highly astonishing—without any adjustments to the classifier whatsoever, we obtain a performance that is en par with more complex learning strategies! Recall that Yanardag and Vishwanathan adjusted the hyperparameters during training, while this approach merely used the default values.

Let us thus focus on the decision tree classifier first and figure out why it performs the way it did. To this end, let us take a look at what the selected features look like and how important they are. To do this, I just added a single training instance, based on standard test–train split of the decision tree classifier:

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

print(clf.feature_importances_)

This results in [ 0.18874114 0.34113588 0.47012298 ], meaning that all three features are somewhat important, with density accounting for almost 50% of the purity of a node—if you are unfamiliar with decision trees, the idea is to obtain nodes that are as “pure” as possible, meaning that there should be as few differences in class labels as possible. The density attribute appears to be important for splitting up impure nodes correctly.

To get a better feeling of the feature space, let us briefly visualize it using principal component analysis:

clf = PCA(n_components=2)
X_  = clf.fit_transform(X)

for label in set(labels):
  idx = y[0:,] == label
  plt.scatter(X_[idx, 0], X_[idx, 1], label=label, alpha=0.25)

plt.legend()
plt.show()

This results in the following image:

A PCA visualization of the features used in the decision tree classifier

Every dot in the image represents an individual graph, whereas distances in the plot roughly correspond to distances between features. The large amount of overlaps between graphs with different labels thus seems to suggest that the features are insufficient to separate the individual graphs. How does the decision tree manage to separate them, nonetheless? As it turns out, by creating a very deep and wide tree. To see this, let us export the decision tree classifier:

with open("/tmp/clf.txt", "w") as f:
  export_graphviz(clf, f)

Here is a small depiction of the resulting tree:

A small version of the resulting decision tree

There is also a larger variant of this tree for you to play around with. As you can see, the tree looks relatively complicated, so it is very much tailored to the given problem. That is not to say that the tree is necessarily suffering from overfitting—we explicitly used cross-validation to prevent this. What it does show is that the simple feature space with three features, while yielding very good classification results, is not detecting the true underlying structure of the problem. The rules generated by the decision tree are artificial in the sense that they do not help us understand what makes the two groups of graphs different from each other.

Coda

So, where does this leave us? We saw that we were able to perform as well as state-of-the-art methods by merely picking a simple feature space, consisting of three features, and a simple decision tree classifier. I would say that any classifier with a higher complexity should yield significant improvements over this simple baseline, both in terms of average accuracy and in terms of standard deviation. Hence, the KISS principle in machine learning: start simple and only add more complex building blocks, i.e. models, if you have established a useful baseline.

You can find both the code and the data in my repository on topological machine learning. To run the analysis described in this blog post, just execute

$ ./simple_network_analysis.py GML/????.gml Labels.txt

in the REDDIT-BINARY subfolder of the repository.

That is all for now, until next time!

Posted at teatime on Wednesday, October 18th, 2017 Tags:

A warning upfront: this post is sort of an advertisement for Aleph, my C++ library for topological data analysis. In this blog post I do not want to cover any of the mathematical algorithms present in Aleph—I rather want to focus on small but integral part of the project, viz. unit tests.

If you are not familiar with the concept of unit testing, the idea is (roughly) to write a small, self-sufficient test for every new piece of functionality that you write. Proponents of the methodology of test-driven development (TDD) even go so far as to require you to write the unit tests before you write your actual code. In this mindset, you first think of the results your code should achieve and which outputs you expect prior to writing any “real” code. I am putting the word real in quotation marks here because it may seem strange to focus on the tests before doing the heavy lifting.

However, this way of approaching software development may actually be quite beneficial, in particular if you are working on algorithms with a nice mathematical flavour. Here, thinking about the results you want to achieve with your code ensures that at least a few known examples are processed correctly by your code, making it more probable that the code will perform well in real-world scenarios.

When I started writing Aleph in 2016, I also wanted to add some unit tests, but I did not think that the size of the library warranted the inclusion of one of the big players, such as Google Test or Boost.Test. While arguably extremely powerful and teeming with more features than I could possibly imagine, they are also quite heavy and require non-trivial adjustments to any project.

Thus, in the best tradition of the not-invented-here-syndrome, I decided to roll my own testing framework, base on pure CMake and small dash of C++. My design decisions were rather simple:

  • Use CTest, the testing framework of CMake to run the tests. This framework is rather simple and just uses the return type of a unit test program to decide whether the test worked correctly.
  • Provide a set of routines to check the correctness of certain calculations within a unit test, throwing an error if something unexpected happened.
  • Collect unit tests for the “larger” parts of the project in a single executable program.

Yes, you read that right—my approach actually opts for throwing an error in order to crash the unit test program. Bear with me, though, for I think that this is actually a rather sane way of approaching unit tests. After all, if the tests fails, I am usually not interested in whether other parts of a test program—that may potentially depend on previous calculations—run through or not. As a consequence, adding a unit test to Aleph is as simple as adding the following lines to a CMakeLists.txt file, located in the tests subdirectory of the project:

ADD_EXECUTABLE( test_io_gml test_io_gml.cc )
ADD_TEST(            io_gml test_io_gml    )

While in the main CMakeLists.txt, I added the following lines:

ENABLE_TESTING()
ADD_SUBDIRECTORY( tests )

So far, so good. A test now looks like this:

#include <tests/Base.hh>

void testBasic()
{
  // do some nice calculation; store the results in `foo` and `bar`,
  // respectively

  ALEPH_ASSERT_THROW( foo != bar );
  ALEPH_ASSERT_EQUAL( foo, 2.0 );
  ALEPH_ASSERT_EQUAL( bar, 1.0 );
}

void testAdvanced()
{
  // a more advanced test
}

int main(int, char**)
{
  testBasic();
  testAdvanced();
}

That is basically the whole recipe for a simple unit test. Upon execution, main() will ensure that all larger-scale test routines, i.e. testSimple() and testAdvanced() are called. Within each of these routines, the calls to the corresponding macros—more on that in a minute— ensure that conditions are met, or certain values are equal to other values. Else, an error will be thrown, the test will abort, and CMake will throw an error upon test execution.

So, how do the macros look like? Here is a copy of the current version of Aleph:

#define ALEPH_ASSERT_THROW( condition )                             \
{                                                                   \
  if( !( condition ) )                                              \
  {                                                                 \
    throw std::runtime_error(   std::string( __FILE__ )             \
                              + std::string( ":" )                  \
                              + std::to_string( __LINE__ )          \
                              + std::string( " in " )               \
                              + std::string( __PRETTY_FUNCTION__ )  \
    );                                                              \
  }                                                                 \
}

#define ALEPH_ASSERT_EQUAL( x, y )                                  \
{                                                                   \
  if( ( x ) != ( y ) )                                              \
  {                                                                 \
    throw std::runtime_error(   std::string( __FILE__ )             \
                              + std::string( ":" )                  \
                              + std::to_string( __LINE__ )          \
                              + std::string( " in " )               \
                              + std::string( __PRETTY_FUNCTION__ )  \
                              + std::string( ": " )                 \
                              + std::to_string( ( x ) )             \
                              + std::string( " != " )               \
                              + std::to_string( ( y ) )             \
    );                                                              \
  }                                                                 \
}

Pretty simple, I would say. The ALEPH_ASSERT_EQUAL macro actually tries to convert the corresponding values to strings, which may not always work. Of course, you could use more complicated string conversion routines, as Boost.Test does. For now, though, these macros are sufficient to make up the unit test framework of Aleph, which at the time of me writing this, encompasses more than 4000 lines of code.

The only remaining question is how this framework is used in practice. By setting ENABLE_TESTING(), CMake actually exposes a new target called test. Hence, in order to run those tests, a simple make test is sufficient in the build directory. This is what the result may look like:

$ make test
Running tests...
Test project /home/bastian/Projects/Aleph/build
      Start  1: barycentric_subdivision
 1/36 Test  #1: barycentric_subdivision ............   Passed    0.00 sec
      Start  2: beta_skeleton

      [...]

34/36 Test #34: union_find .........................   Passed    0.00 sec
      Start 35: witness_complex
35/36 Test #35: witness_complex ....................   Passed    1.82 sec
      Start 36: python_integration
36/36 Test #36: python_integration .................   Passed    0.07 sec

100% tests passed, 0 tests failed out of 36

Total Test time (real) =   5.74 sec

In addition to being rather lean, this framework can easily be integrated into an existing Travis CI workflow by adding

- make test

as an additional step to the script target in your .travis.yml file.

If you are interested in using this testing framework, please take a look at the following files:

That is all for now, until next time—may your unit tests always work the way you expect them to!

Posted Sunday night, October 15th, 2017 Tags:

Many applications require the selection of a subset of objects from a larger set. For example, suppose you are down-sampling a large data set by choosing the indices that you want to keep at random. If you are satisfied with obtaining duplicate indices (in essence, if you are sampling with replacement), this is a trivial matter in most programming language. Python certainly makes this absurdly easy:

import random

def sample(k,n):
  return [random.choice( [x for x in range(n)] ) for _ in range(k)]

But what about sampling without replacement? More precisely, what if you require the selection of k indices from an array of n values without choosing duplicate values but still want the probability of choosing a certain element to be uniform? Well, if you use Python, you are in luck again:

import random

def sample_without_replacement(k,n):
  return random.sample([x for x in range(n)], k)

But what about other programming languages that may not have similar builtin functionality, such as C++? In this case, a very simple and elegant technique is given by none other than the great Donald E. Knuth. The following algorithm originally was given in The Art of Computer Programming, Volume 2, Section 3.4.2 on p. 137. Briefly put, it works like this:

  1. Set t = 0 and m = 0. m represents the number of items selected so far, while t is the total number of items that have already been traversed.

  2. Generate a random number U that is uniformly distributed between zero and one.

  3. If (N-t)*U >= n-m, skip the current item by increasing t by 1 and going back to step 2. Else, select the current item by increasing both t and m by 1. Afterwards, either go back to step 2 or stop the algorithm (if sufficiently many items have been sampled already).

A very basic implementation of this algorithm (in C++) is given in the gist below:

I do not know about you, but my gut reaction upon seeing this algorithm for the first time was How can this possibly work? So, let us briefly prove the algorithm to be correct. Basically, we have to show that every item has the same probability of being chosen. The key insight is to realize that the probability has to change depending on the number of elements that have already been selected. More precisely, we need to determine the probability of choosing item t+1 if m items have already been selected. This requires some combinatorics. There are Equation StartBinomialOrMatrix upper N hyphen t Choose n hyphen m EndBinomialOrMatrix ways of choosing n items from N such that m items are picked as the first t items. Or, equivalently, these are the number of potential permutations of the remaining n-m elements. Out of these, we are interested in all the ones that contain item t+1—but this is easy, because we can just take that item as granted and count the remaining combinations as Equation StartBinomialOrMatrix upper N hyphen t hyphen 1 Choose n hyphen m hyphen 1 EndBinomialOrMatrix , i.e. the number of ways to choose n-m-1 items out of N-t-1 ones. The quotient of these two numbers is exactly the probability with which we must choose item t+1 if we want to have a uniform probability for choosing a certain item. This quotient turns out to be Equation StartFraction n minus m Over upper N minus t EndFraction , which looks familiar. Finally, note that since U was chosen to be uniformly distributed between zero and one, the condition (N-t)*U >= n-m in the algorithm is satisfied with the required probability. Consequently, this method samples without replacement in the required manner!

If you are interested or require a more efficient algorithm, please refer to the paper An Efficient Algorithm for Sequential Random Sampling by Jeffrey Scott Vitter in ACM Transactions on Mathematical Software, Volume 13, Issue 1, pp. 58–67, for more details. The paper should be available free-of-charge from the link provided above, but there is also a mirror available, thanks to the good folks at INRIA, who require their publications to be kept in an open archive. Time permitting, I might provide a second blog post and implementation about the more efficient algorithm.

That is all for now, until next time!

Posted late Sunday evening, October 8th, 2017 Tags:

In previous posts, I wrote about a brief analysis of U.S. Presidential Inauguration Speeches and how to extend this analysis using tf–idf. In this post, I want to present an extended analysis based on sentiment analysis. Sentiment analysis encompasses a class of techniques for detecting whether sentences are mean either negatively, neutrally, or positively.

Depicting sentiment over time

Since every inauguration speech has a beginning and an end, it forms a natural time-series. Hence, I first calculated the sentiment scores for every sentence in a speech and scaled an artificial time parameter over the speech between 0 and 1. This yields a nice sentiment curve plot, in which the abscissa denotes the time of the speech, and the ordinate denotes the sentiment of a given sentence—with values close to +1 meaning that the sentence is extremely positive, 0 meaning that the sentence is neutral, and -1 meaning that the sentence is extremely negative.

Here are some example visualizations of the last three inauguration speeches. Positive sentences are shown in green, while negative ones are shown in red. I am filling the distance between true neutral sentences with the colour in order to show patterns and ‘rhythm’ of different speeches. A black line indicates the mean sentiment over the speech.

Sentiment curve for Barack Obama (2009)

Sentiment curve for Barack Obama (2013)

Sentiment curve for Donald J. Trump (2017)

It is interesting to see that Obama’s first speech appears to be more subdued and neutral than the subsequent speeches, which exhibit more peaks and thus a larger variability between extremely positive and extremely negative sentiment.

If you want to compare the different sentiment curves for your favourite presidents, you may do so—I have prepared a large visualization that combines all sentiment curves. Watching the appearance and disappearance of negative sentiments over time is quite fascinating. The 1945 speech by Roosevelt is a striking example of evoking very negative imagery for a prolonged period of time.

Comparing mean sentiments

For comparing individual presidents, the curves are well and good, but I was also interested in the global sentiment of a speech and how it evolves over time. To this end, I calculated the mean (average) sentiment over every speech and depicted it over time. This works because sentiments are always bounded between [-1:1], making their comparison very easy. Here is the average sentiment of a speech, plotted over time:

Average sentiment of a speech over time

We can see an interesting pattern: after the second World War, speeches become more positive on average. They remain that way until the first inauguration speech of Barack Obama, which, as I noted above, is somewhat subdued again. Afterwards, they pick up steam. Donald Trump’s speech evokes more positive sentiments, on average, than most of the speeches since 1945.

All in all, I think that this is a nice tool to assess patterns in speeches. If you want, go take a look at the individual sentiment curves, which are stored on GitHub. Maybe you pick up something interesting.

Code

I used the intriguing TextBlob Python module for this analysis. All visualizations are done using gnuplot. As usual, the scripts and data files—as well as the output—is stored in the GitHub repository for the project.

Have fun!

Posted at teatime on Sunday, September 17th, 2017 Tags:

Previously, I wrote about a brief analysis of U.S. Presidential Inauguration Speeches. I have since extended the analysis slightly using tf–idf.

For those of you who are unaware of this technique, it refers to a statistical method for assessing the importance of certain words in a corpus of documents. Without going into too much detail, the method determines which words are most relevant for a given document of the corpus, yielding a high-dimensional vector whose entries refer to the common vocabulary of the corpus.

I picked the five most relevant words for every speech to essentially obtain an extremely pithy summary of words that are relevant for the given speech only. Since there are 59 speeches in total, I first decided to do a small visualization of the last eight presidents only, starting from George H.W. Bush in 1989. Here are the results; the x-axis represents the time, while the y-axis shows the five most important words, scaled by their relative frequency in the speech:

A
visualization of the relative importance of words in the last eight
inauguration speeches of U.S. presidents

So, what can we see in this visualization? Here are some of my thoughts.

  • The don in the speech of George H.W. Bush is a shortened form of don’t. The algorithm picked up on his usage in the speech, which contains beautiful imagery such as this:

For the first time in this century, for the first time in perhaps all history, man does not have to invent a system by which to live. We don't have to talk late into the night about which form of government is better. We don't have to wrest justice from the kings. We only have to summon it from within ourselves.

  • It is also interesting to note that the new breeze George H.W. Bush is talking about is detected as a unique feature of his speech.

  • The speeches of Bill Clinton allude to the change that started after the end of the Cold War, as well as the promises that arise in the new century to come.

  • The second speech of George W. Bush tries to rally Americans in the War on Terror. Liberty and freedom are part of the underlying theme, these of course being American ideals that are worth fighting for.

  • With Barack Obama, a sort of rebirth takes place. He speaks to the new generation, expressing his hope that American becomes a new nation, and aligns everyone that today—not tomorrow—is the day to address these challenges. In his second speech, the great journey towards equality is presented to the Americans, making it clear that change does not stop.

  • With Donald Trump, the narrative changes. The important words are now the dreams of people, such as the hope that they will find new jobs. It is interesting to note that only the speeches at a time of crisis or abrupt change (Cold War or the War on Terror) exhibit the same occurrence of the words America and American. Maybe Donald Trump is trying to invoke a connection to these events in his speech?

These are only my random thoughts—I think it is fascinating that tf–idf is capable of picking up on these themes in such a reliable manner. Maybe a more competent person wants to comment on these findings? I look forward to any feedback you might have. In the meantime, please enjoy a variant of the visualization above that contains all speeches of all presidents so far. You will have to scroll around a lot for this. By the way, I have added the code to the GitHub repository. You may be particularly interested in tf_idf_analysis.py, which demonstrates the tf–idf analysis of all speeches. Moreover, I added a gnuplot script that demonstrates how to create the visualizations attached to this blog post.

Posted late Monday evening, September 11th, 2017 Tags:

In late 2015, I started fiddling with libclang in order to analyse abstract syntax trees. After multiple queries reached me via e-mail, I decided to create a public repository of my experiments.

The repository is called libclang-experiments and resides on my GitHub account. It contains the code for the first blog post about libclang, which presented some basic routines for walking an abstract syntax tree, as well as code for the second blog post, which dealt with counting the extents of a function.

Note that since libclang is somewhat fiddly to use, the repository also provides a CMake module for actually finding the library on your computer. Please refer to the CMakeLists.txt of the repository for usage information—and do not hesitate to contact me, either via e-mail or by opening an issue on GitHub if you have any questions.

Posted Monday afternoon, September 11th, 2017 Tags:

The current political climate in the United States appears to be somewhat heated, to say the least. The current president, Donald J. Trump, is considered by many to be the product of an ever-declining political culture, in which many people feel increasingly disenfranchised and hopeless.

I was wondering whether these changes in political culture would show up as patterns—and where better to look for patterns than in the inauguration speeches of U.S. presidents? These speeches are meant to give a grand vision of the new term. Ideally, they should unite but also rally the voters to support the new administration and look forward to the subsequent years.

In this post, I am going to show you how to obtain all inauguration speeches, perform a brief language analysis on them, and report some interesting results. Onwards!

Getting the data

Thanks to the good folks of Wikisource, obtaining all speeches is quite easy. There is a special category for all speeches, and since the formatting of every speech contains (more or less) the same tags, it is a simple exercise in using BeautifulSoup to obtain and store all the speeches. See my Python script for more details. As a result, each speech is stored in the format of YYYY_Name.txt. We thus have 1789_George_Washington.txt, for example. Other than that, the text files are left as-is. I did not make any attempts at extracting more information from them.

Analysing the data

The simplest form of analysis that one might apply to these speeches involves basic word counting or tokenization in general. I will go a small step further and use stemming to reduce every word to their root form.

This procedure gives us a way to answer the following questions:

  • How long is the speech (in sentences)?
  • How long is the speech (in words)?
  • What is the average sentence length?
  • What is the number of unique words?
  • What is the number of unique lemmas?

The last question is worth explaining. A lemma in the sense of NLP or natural language processing denotes the canonical form of words in a corpus. For example, run, runs, ran, and running belong to the same lemma, viz. run.

I stole this example from the Wikipedia page on lemma; please refer to it for more details.

The reason for counting lemmas is that they give us a rough estimate of the complexity of a text. If a text contains many unique lemmas, it is likely to be more complex than a text with fewer unique lemmas.

I am aware that this is not the linguistically correct way of complexity analysis, but it gives us a qualitative overview of a text without delving deeper into its vocabulary.

Consequently, I wrote another script to analyse the speeches and store the statistics mentioned above in files. It is time for a quick analysis!

Results

Let’s first take a look at the average sentence length of all speeches. You can hover over each data point to see the name of the president giving the speech.

A plot of the average sentence length, in words, of all presidential inauguration speeches

We can see that—alas— the average length of a sentence is declining. This is not necessarily a bad thing, as shorter sentences are often thought to be understood better by readers and listeners. Given that metric, there are a few interesting outliers, such as John Adams, whose speech contains a real beast of a sentence:

On this subject it might become me better to be silent or to speak with diffidence; but as something may be expected, the occasion, I hope, will be admitted as an apology if I venture to say that if a preference, upon principle, of a free republican government, formed upon long and serious reflection, after a diligent and impartial inquiry after truth; if an attachment to the Constitution of the United States, and a conscientious determination to support it until it shall be altered by the judgments and wishes of the people, expressed in the mode prescribed in it; if a respectful attention to the constitutions of the individual States and a constant caution and delicacy toward the State governments; if an equal and impartial regard to the rights, interest, honor, and happiness of all the States in the Union, without preference or regard to a northern or southern, an eastern or western, position, their various political opinions on unessential points or their personal attachments; if a love of virtuous men of all parties and denominations; if a love of science and letters and a wish to patronize every rational effort to encourage schools, colleges, universities, academies, and every institution for propagating knowledge, virtue, and religion among all classes of the people, not only for their benign influence on the happiness of life in all its stages and classes, and of society in all its forms, but as the only means of preserving our Constitution from its natural enemies, the spirit of sophistry, the spirit of party, the spirit of intrigue, the profligacy of corruption, and the pestilence of foreign influence, which is the angel of destruction to elective governments; if a love of equal laws, of justice, and humanity in the interior administration; if an inclination to improve agriculture, commerce, and manufacturers for necessity, convenience, and defense; if a spirit of equity and humanity toward the aboriginal nations of America, and a disposition to meliorate their condition by inclining them to be more friendly to us, and our citizens to be more friendly to them; if an inflexible determination to maintain peace and inviolable faith with all nations, and that system of neutrality and impartiality among the belligerent powers of Europe which has been adopted by this Government and so solemnly sanctioned by both Houses of Congress and applauded by the legislatures of the States and the public opinion, until it shall be otherwise ordained by Congress; if a personal esteem for the French nation, formed in a residence of seven years chiefly among them, and a sincere desire to preserve the friendship which has been so much for the honor and interest of both nations; if, while the conscious honor and integrity of the people of America and the internal sentiment of their own power and energies must be preserved, an earnest endeavor to investigate every just cause and remove every colorable pretense of complaint; if an intention to pursue by amicable negotiation a reparation for the injuries that have been committed on the commerce of our fellow-citizens by whatever nation, and if success can not be obtained, to lay the facts before the Legislature, that they may consider what further measures the honor and interest of the Government and its constituents demand; if a resolution to do justice as far as may depend upon me, at all times and to all nations, and maintain peace, friendship, and benevolence with all the world; if an unshaken confidence in the honor, spirit, and resources of the American people, on which I have so often hazarded my all and never been deceived; if elevated ideas of the high destinies of this country and of my own duties toward it, founded on a knowledge of the moral principles and intellectual improvements of the people deeply engraven on my mind in early life, and not obscured but exalted by experience and age; and, with humble reverence, I feel it to be my duty to add, if a veneration for the religion of a people who profess and call themselves Christians, and a fixed resolution to consider a decent respect for Christianity among the best recommendations for the public service, can enable me in any degree to comply with your wishes, it shall be my strenuous endeavor that this sagacious injunction of the two Houses shall not be without effect.

Compare this to the second inauguration speech of George Washington, which is the briefest one, while still using longer sentences than any of the presidents starting their term in the 20th or the 21st century.

What about the active vocabulary of the presidents? To this end, let us take a look at the number of unique lemmas in a speech. Note that these are raw counts, so they will also give us an indication of how long a speech is. Again, hover over a data point to display the name.

Unique lemmas of all presidential inauguration speeches

Here, there is not so much of a clear trend but rather some interesting outliers. According to this metric, most presidents in the 20th century and onwards use about the same number of unique lemmas in a speech. This means that the speeches have roughly the same length and also the same complexity, provided you buy my argument that the number of unique lemmas is something interesting to consider. We can see that Donald J. Trump is not really an outlier in this regard; both Gerald Ford—a republican—in 1974 and Jimmy Carter—a democrat—in 1977 gave speeches are are rated approximately the same.

Interestingly, some other patterns arise. First, we see that the second inauguration speech of George Washington is really short and sweet. There are less than 200 unique lemmas. Second, the inauguration speech of William Henry Harrison is obviously extremely long and convoluted, more so than any other speech (so far; who knows what the future holds?). We can also see that the third inauguration speech of Franklin D. Roosevelt in 1945 really is on point. His message is short and simple. Afterwards, things start to get more complicated again.

Conclusion

What can we make of this? First, it is interesting to see that these simple qualitative analysis steps only give us a very narrow picture of what is happening in a speech. At least the oratory culture of the United States seems to be doing well. Most presidents use about the same complexity for their speeches. The number of unique lemmas decreased for modern times, though. These may be an artefact of the stemming method, though. As language changes over time, a simple stemmer that is trained for modern speeches will have its troubles when analysing older speeches.

I hope that this provided some insights. You are welcome to play with the data and do your own analysis. Please take a look at the GitHub repository for more information.

Posted Sunday afternoon, August 20th, 2017 Tags:

Much has been written about the sorry state of academic publishing. Luckily for me, the visualization community works with a number of publishers that are very sane, welcoming, and considerate with respect to personal copies of papers. All of them permit the free use of pre-prints on a personal website as long as it contains a text similar to the following:

Author's Copy. Find the definitive version at example.com/doi/12345

After doing this manually for a couple of papers, I finally decided to solve it directly in LaTeX. This is what I came up with: first, you need to add \usepackage[absolute]{textpos} to the preamble of your document. Next, add the following lines:

\setlength{\TPHorizModule}{\paperwidth}
\setlength{\TPVertModule} {\paperheight}

They ensure that the subsequent calls for placing text work with respect to the current paper size. In order to place a copyright notice at the top of your paper, you only have to do the following:

\begin{textblock*}{20cm}(0.5cm,0.5cm)
  Author's copy. To appear in IEEE Transactions on Visualization and Computer Graphics.
\end{textblock*}

Place this somewhere in your document, for example directly after \begin{document} and you should be good to go.

Happy publishing!

Posted Thursday evening, August 3rd, 2017 Tags:

In a previous article, I discussed the distribution of Betti numbers of triangulations of 2-manifolds. This article now discusses the code. Coincidentally, it also demonstrates how to use Aleph, my (forthcoming) library for exploring persistent homology and its uses.

Aleph is header-only, so it can be readily deployed in your own projects. In the following, I am assuming that you have installed Aleph, following the official instructions. There are several tutorials available that cover different parts of the persistent homology calculation process. For our purpose, viz. the calculation of homology of triangulated spaces, no large machinery is needed. We start with some includes and using directives:

#include <aleph/topology/io/LexicographicTriangulation.hh>

#include <aleph/persistentHomology/Calculation.hh>

#include <aleph/topology/Simplex.hh>
#include <aleph/topology/SimplicialComplex.hh>

#include <iostream>
#include <string>
#include <vector>

using DataType          = bool;
using VertexType        = unsigned short;

using Simplex           = aleph::topology::Simplex<DataType, VertexType>;
using SimplicialComplex = aleph::topology::SimplicialComplex<Simplex>;

Nothing fancy happened so far. We set up a simplex class, which is the basic data type in most applications, as well as a simplicial complex, which is the basic storage class for multiple simplices. The only interesting thing is our choice of DataType and VertexType. Since our simplices need not store any additional data except for their vertices, we select bool as a data type in order to make the class smaller. This could potentially be achieved with EBCO as well, but I did not yet have time to test it adequately. In addition to the data type, we use unsigned short as the vertex type—the triangulations that we want to analyse only feature 9 or 10 vertices, so unsigned short is the best solution for storing vertex identifiers.

Next, we need some I/O code to read a lexicographic triangulation:

int main(int argc, char* argv[])
{
  if( argc <= 1 )
    return -1;

  std::string filename = argv[1];
  std::vector<SimplicialComplex> simplicialComplexes;

  aleph::topology::io::LexicographicTriangulationReader reader;
  reader( filename, simplicialComplexes );

  for( auto&& K : simplicialComplexes )
  {
    K.createMissingFaces();
    K.sort();
  }
}

Here, we used the LexicographicTriangulationReader, a reader class that supports reading files in the format defined by Frank H. Lutz. However, this format only provides the top-level simplices of a triangulation. Hence, for a 2-manifold, only the 2-simplices are specified. For calculating homology groups, however, all simplices are required. Luckily, the SimplicialComplex class offers a method for just this purpose. By calling createMissingFaces(), all missing faces of the simplicial complex are calculated and added to the simplicial complex. Afterwards, we use sort() to sort simplices in lexicographical order. This order is required to ensure that the homology groups can be calculated correctly—the calculation routines assume that the complex is being filtrated, so faces need to precede co-faces.

Having created and stored a list of simplicial complexes, we may now finally calculate their homology groups by adding the following code after the last for-loop:

for( auto&& K : simplicialComplexes )
{
  bool dualize                    = true;
  bool includeAllUnpairedCreators = true;

  auto diagrams
    = aleph::calculatePersistenceDiagrams( K,
                                          dualize,
                                          includeAllUnpairedCreators );
}

This code calls calculatePersistenceDiagrams(), which is usually employed to calculate, well, a set of persistence diagrams. The two flags dualize and includeAllUnpairedCreators also deserve some explanation. The first flag only instructs the convenience function as to whether the boundary matrix that is required for (persistent) homology calculations should be dualized or not. Dualization was shown to result in faster computations, so of course we want to use it as well. The second parameter depends on our particular application scenario. Normally, the persistent homology calculation ignores all creator simplices in the top dimension. The reason for this is simple: if we expand a neighbourhood graph to a Vietoris–Rips complex, the top-level simplices are an artefact of the expansion process. Most of these simplices cannot be paired with higher-dimensional simplices, hence they will appear to create a large number of high-dimensional holes. The persistent homology calculation convenience function hence ignores those simplices for the creation of persistence diagrams by default. For our application, however, keeping those simplices is actually the desired behaviour—the top-level simplices are well-defined and an integral part of the triangulation.

As a last step, we want to print the signature of Betti numbers for the given simplicial complex. This is easily achieved by adding a nested loop at the end of the for-loop:

for( auto&& diagram : diagrams )
{
  auto d = diagram.dimension();
  std::cout << "b_" << d << " = " << diagram.betti() << "\n";
}

This will give us all non-zero Betti numbers of the given triangulation. The output format is obviously not optimal—in a research setting, I like to use JSON for creating human-readable and machine-readable results. If you are interested in how this may look, please take a look at a more involved tool for dealing with triangulations. Be aware, however, that this tool is still a work in progress.

This concludes our little foray into Aleph. I hope that I piqued your interest! By the way, I am always looking for contributors…

Posted Sunday evening, July 30th, 2017 Tags:

For a new (forthcoming) research project, I have to brush up my knowledge about homology groups and triangulations. This constitutes an excellent opportunity to extend Aleph, my (also forthcoming) library for exploring persistent homology and its uses.

As a quick example, I decided to calculate some statistics about the homology of 2-manifolds. Thanks to the work of Frank H. Lutz, a list of triangulations with up to 9 vertices, as well as a list of triangulations with 10 vertices, exist. After adding support for the appropriate data format in Aleph, I was able to easily read a family of triangulations and subsequently process them. Even though Aleph’s main purpose is the calculation of persistent homology, ordinary homology calculations work just as well—ordinary homology being a special case of persistent homology.

As a starting point for further mathematical ruminations, let us take a look at the distribution of Betti numbers of 2-manifolds. More precisely, we analyse the distribution of the first Betti number of the manifold—according to PoincarĂ© duality, the zeroth and second Betti number have to equal, and we know the zeroth Betti number to be 1 for a manifold.

Without further ado, here is a tabulation of first Betti numbers for triangulations with up to 9 vertices:

Value Number of occurrences
0 73
1 154
2 313
3 133
4 37
5 2

The results are interesting: most triangulations appear to have the homology of a torus, i.e. a first Betti number of 2, followed by the homology of the real projective plane with a first Betti number of 1. Betti numbers larger than 3 are extremely rare. Intuitively, this makes sense—the triangulation only consists of at most 9 vertices, so there are natural limits on how high the Betti number can become.

For triangulations with 10 vertices, another distribution arises:

Value Number of occurrences
0 233
1 1210
2 6571
3 11784
4 14522
5 7050
6 1042
7 14

Here, the mode of Betti numbers is a value of 4, with 14522 occurrences out of 42426 triangulations. The homology type of this signature is the one of a genus-4 surface. Again, higher values get progressively less likely because only 10 vertices are permitted.

I am not yet sure what to make of this, but it sure is a nice test case and application scenario for Aleph. One of the next blog posts will give more details about this calculation, and its implementation using the library. Stay tuned!

Posted late Saturday night, July 30th, 2017 Tags: