This is the personal blog of Bastian Rieck. I add posts in irregular intervals about things I consider relevant. Send any feedback, comments etc. to bastian@annwfn.net.

The current political climate in the United States appears to be somewhat heated, to say the least. The current president, Donald J. Trump, is considered by many to be the product of an ever-declining political culture, in which many people feel increasingly disenfranchised and hopeless.

I was wondering whether these changes in political culture would show up as patterns—and where better to look for patterns than in the inauguration speeches of U.S. presidents? These speeches are meant to give a grand vision of the new term. Ideally, they should unite but also rally the voters to support the new administration and look forward to the subsequent years.

In this post, I am going to show you how to obtain all inauguration speeches, perform a brief language analysis on them, and report some interesting results. Onwards!

Getting the data

Thanks to the good folks of Wikisource, obtaining all speeches is quite easy. There is a special category for all speeches, and since the formatting of every speech contains (more or less) the same tags, it is a simple exercise in using BeautifulSoup to obtain and store all the speeches. See my Python script for more details. As a result, each speech is stored in the format of YYYY_Name.txt. We thus have 1789_George_Washington.txt, for example. Other than that, the text files are left as-is. I did not make any attempts at extracting more information from them.

Analysing the data

The simplest form of analysis that one might apply to these speeches involves basic word counting or tokenization in general. I will go a small step further and use stemming to reduce every word to their root form.

This procedure gives us a way to answer the following questions:

  • How long is the speech (in sentences)?
  • How long is the speech (in words)?
  • What is the average sentence length?
  • What is the number of unique words?
  • What is the number of unique lemmas?

The last question is worth explaining. A lemma in the sense of NLP or natural language processing denotes the canonical form of words in a corpus. For example, run, runs, ran, and running belong to the same lemma, viz. run.

I stole this example from the Wikipedia page on lemma; please refer to it for more details.

The reason for counting lemmas is that they give us a rough estimate of the complexity of a text. If a text contains many unique lemmas, it is likely to be more complex than a text with fewer unique lemmas.

I am aware that this is not the linguistically correct way of complexity analysis, but it gives us a qualitative overview of a text without delving deeper into its vocabulary.

Consequently, I wrote another script to analyse the speeches and store the statistics mentioned above in files. It is time for a quick analysis!

Results

Let’s first take a look at the average sentence length of all speeches. You can hover over each data point to see the name of the president giving the speech.

A plot of the average sentence length, in words, of all presidential inauguration speeches

We can see that—alas— the average length of a sentence is declining. This is not necessarily a bad thing, as shorter sentences are often thought to be understood better by readers and listeners. Given that metric, there are a few interesting outliers, such as John Adams, whose speech contains a real beast of a sentence:

On this subject it might become me better to be silent or to speak with diffidence; but as something may be expected, the occasion, I hope, will be admitted as an apology if I venture to say that if a preference, upon principle, of a free republican government, formed upon long and serious reflection, after a diligent and impartial inquiry after truth; if an attachment to the Constitution of the United States, and a conscientious determination to support it until it shall be altered by the judgments and wishes of the people, expressed in the mode prescribed in it; if a respectful attention to the constitutions of the individual States and a constant caution and delicacy toward the State governments; if an equal and impartial regard to the rights, interest, honor, and happiness of all the States in the Union, without preference or regard to a northern or southern, an eastern or western, position, their various political opinions on unessential points or their personal attachments; if a love of virtuous men of all parties and denominations; if a love of science and letters and a wish to patronize every rational effort to encourage schools, colleges, universities, academies, and every institution for propagating knowledge, virtue, and religion among all classes of the people, not only for their benign influence on the happiness of life in all its stages and classes, and of society in all its forms, but as the only means of preserving our Constitution from its natural enemies, the spirit of sophistry, the spirit of party, the spirit of intrigue, the profligacy of corruption, and the pestilence of foreign influence, which is the angel of destruction to elective governments; if a love of equal laws, of justice, and humanity in the interior administration; if an inclination to improve agriculture, commerce, and manufacturers for necessity, convenience, and defense; if a spirit of equity and humanity toward the aboriginal nations of America, and a disposition to meliorate their condition by inclining them to be more friendly to us, and our citizens to be more friendly to them; if an inflexible determination to maintain peace and inviolable faith with all nations, and that system of neutrality and impartiality among the belligerent powers of Europe which has been adopted by this Government and so solemnly sanctioned by both Houses of Congress and applauded by the legislatures of the States and the public opinion, until it shall be otherwise ordained by Congress; if a personal esteem for the French nation, formed in a residence of seven years chiefly among them, and a sincere desire to preserve the friendship which has been so much for the honor and interest of both nations; if, while the conscious honor and integrity of the people of America and the internal sentiment of their own power and energies must be preserved, an earnest endeavor to investigate every just cause and remove every colorable pretense of complaint; if an intention to pursue by amicable negotiation a reparation for the injuries that have been committed on the commerce of our fellow-citizens by whatever nation, and if success can not be obtained, to lay the facts before the Legislature, that they may consider what further measures the honor and interest of the Government and its constituents demand; if a resolution to do justice as far as may depend upon me, at all times and to all nations, and maintain peace, friendship, and benevolence with all the world; if an unshaken confidence in the honor, spirit, and resources of the American people, on which I have so often hazarded my all and never been deceived; if elevated ideas of the high destinies of this country and of my own duties toward it, founded on a knowledge of the moral principles and intellectual improvements of the people deeply engraven on my mind in early life, and not obscured but exalted by experience and age; and, with humble reverence, I feel it to be my duty to add, if a veneration for the religion of a people who profess and call themselves Christians, and a fixed resolution to consider a decent respect for Christianity among the best recommendations for the public service, can enable me in any degree to comply with your wishes, it shall be my strenuous endeavor that this sagacious injunction of the two Houses shall not be without effect.

Compare this to the second inauguration speech of George Washington, which is the briefest one, while still using longer sentences than any of the presidents starting their term in the 20th or the 21st century.

What about the active vocabulary of the presidents? To this end, let us take a look at the number of unique lemmas in a speech. Note that these are raw counts, so they will also give us an indication of how long a speech is. Again, hover over a data point to display the name.

Unique lemmas of all presidential inauguration speeches

Here, there is not so much of a clear trend but rather some interesting outliers. According to this metric, most presidents in the 20th century and onwards use about the same number of unique lemmas in a speech. This means that the speeches have roughly the same length and also the same complexity, provided you buy my argument that the number of unique lemmas is something interesting to consider. We can see that Donald J. Trump is not really an outlier in this regard; both Gerald Ford—a republican—in 1974 and Jimmy Carter—a democrat—in 1977 gave speeches are are rated approximately the same.

Interestingly, some other patterns arise. First, we see that the second inauguration speech of George Washington is really short and sweet. There are less than 200 unique lemmas. Second, the inauguration speech of William Henry Harrison is obviously extremely long and convoluted, more so than any other speech (so far; who knows what the future holds?). We can also see that the third inauguration speech of Franklin D. Roosevelt in 1945 really is on point. His message is short and simple. Afterwards, things start to get more complicated again.

Conclusion

What can we make of this? First, it is interesting to see that these simple qualitative analysis steps only give us a very narrow picture of what is happening in a speech. At least the oratory culture of the United States seems to be doing well. Most presidents use about the same complexity for their speeches. The number of unique lemmas decreased for modern times, though. These may be an artefact of the stemming method, though. As language changes over time, a simple stemmer that is trained for modern speeches will have its troubles when analysing older speeches.

I hope that this provided some insights. You are welcome to play with the data and do your own analysis. Please take a look at the GitHub repository for more information.

Posted Sunday afternoon, August 20th, 2017 Tags:

Much has been written about the sorry state of academic publishing. Luckily for me, the visualization community works with a number of publishers that are very sane, welcoming, and considerate with respect to personal copies of papers. All of them permit the free use of pre-prints on a personal website as long as it contains a text similar to the following:

Author's Copy. Find the definitive version at example.com/doi/12345

After doing this manually for a couple of papers, I finally decided to solve it directly in LaTeX. This is what I came up with: first, you need to add \usepackage[absolute]{textpos} to the preamble of your document. Next, add the following lines:

\setlength{\TPHorizModule}{\paperwidth}
\setlength{\TPVertModule} {\paperheight}

They ensure that the subsequent calls for placing text work with respect to the current paper size. In order to place a copyright notice at the top of your paper, you only have to do the following:

\begin{textblock*}{20cm}(0.5cm,0.5cm)
  Author's copy. To appear in IEEE Transactions on Visualization and Computer Graphics.
\end{textblock*}

Place this somewhere in your document, for example directly after \begin{document} and you should be good to go.

Happy publishing!

Posted Thursday evening, August 3rd, 2017 Tags:

In a previous article, I discussed the distribution of Betti numbers of triangulations of 2-manifolds. This article now discusses the code. Coincidentally, it also demonstrates how to use Aleph, my (forthcoming) library for exploring persistent homology and its uses.

Aleph is header-only, so it can be readily deployed in your own projects. In the following, I am assuming that you have installed Aleph, following the official instructions. There are several tutorials available that cover different parts of the persistent homology calculation process. For our purpose, viz. the calculation of homology of triangulated spaces, no large machinery is needed. We start with some includes and using directives:

#include <aleph/topology/io/LexicographicTriangulation.hh>

#include <aleph/persistentHomology/Calculation.hh>

#include <aleph/topology/Simplex.hh>
#include <aleph/topology/SimplicialComplex.hh>

#include <iostream>
#include <string>
#include <vector>

using DataType          = bool;
using VertexType        = unsigned short;

using Simplex           = aleph::topology::Simplex<DataType, VertexType>;
using SimplicialComplex = aleph::topology::SimplicialComplex<Simplex>;

Nothing fancy happened so far. We set up a simplex class, which is the basic data type in most applications, as well as a simplicial complex, which is the basic storage class for multiple simplices. The only interesting thing is our choice of DataType and VertexType. Since our simplices need not store any additional data except for their vertices, we select bool as a data type in order to make the class smaller. This could potentially be achieved with EBCO as well, but I did not yet have time to test it adequately. In addition to the data type, we use unsigned short as the vertex type—the triangulations that we want to analyse only feature 9 or 10 vertices, so unsigned short is the best solution for storing vertex identifiers.

Next, we need some I/O code to read a lexicographic triangulation:

int main(int argc, char* argv[])
{
  if( argc <= 1 )
    return -1;

  std::string filename = argv[1];
  std::vector<SimplicialComplex> simplicialComplexes;

  aleph::topology::io::LexicographicTriangulationReader reader;
  reader( filename, simplicialComplexes );

  for( auto&& K : simplicialComplexes )
  {
    K.createMissingFaces();
    K.sort();
  }
}

Here, we used the LexicographicTriangulationReader, a reader class that supports reading files in the format defined by Frank H. Lutz. However, this format only provides the top-level simplices of a triangulation. Hence, for a 2-manifold, only the 2-simplices are specified. For calculating homology groups, however, all simplices are required. Luckily, the SimplicialComplex class offers a method for just this purpose. By calling createMissingFaces(), all missing faces of the simplicial complex are calculated and added to the simplicial complex. Afterwards, we use sort() to sort simplices in lexicographical order. This order is required to ensure that the homology groups can be calculated correctly—the calculation routines assume that the complex is being filtrated, so faces need to precede co-faces.

Having created and stored a list of simplicial complexes, we may now finally calculate their homology groups by adding the following code after the last for-loop:

for( auto&& K : simplicialComplexes )
{
  bool dualize                    = true;
  bool includeAllUnpairedCreators = true;

  auto diagrams
    = aleph::calculatePersistenceDiagrams( K,
                                          dualize,
                                          includeAllUnpairedCreators );
}

This code calls calculatePersistenceDiagrams(), which is usually employed to calculate, well, a set of persistence diagrams. The two flags dualize and includeAllUnpairedCreators also deserve some explanation. The first flag only instructs the convenience function as to whether the boundary matrix that is required for (persistent) homology calculations should be dualized or not. Dualization was shown to result in faster computations, so of course we want to use it as well. The second parameter depends on our particular application scenario. Normally, the persistent homology calculation ignores all creator simplices in the top dimension. The reason for this is simple: if we expand a neighbourhood graph to a Vietoris–Rips complex, the top-level simplices are an artefact of the expansion process. Most of these simplices cannot be paired with higher-dimensional simplices, hence they will appear to create a large number of high-dimensional holes. The persistent homology calculation convenience function hence ignores those simplices for the creation of persistence diagrams by default. For our application, however, keeping those simplices is actually the desired behaviour—the top-level simplices are well-defined and an integral part of the triangulation.

As a last step, we want to print the signature of Betti numbers for the given simplicial complex. This is easily achieved by adding a nested loop at the end of the for-loop:

for( auto&& diagram : diagrams )
{
  auto d = diagram.dimension();
  std::cout << "b_" << d << " = " << diagram.betti() << "\n";
}

This will give us all non-zero Betti numbers of the given triangulation. The output format is obviously not optimal—in a research setting, I like to use JSON for creating human-readable and machine-readable results. If you are interested in how this may look, please take a look at a more involved tool for dealing with triangulations. Be aware, however, that this tool is still a work in progress.

This concludes our little foray into Aleph. I hope that I piqued your interest! By the way, I am always looking for contributors…

Posted Sunday evening, July 30th, 2017 Tags:

For a new (forthcoming) research project, I have to brush up my knowledge about homology groups and triangulations. This constitutes an excellent opportunity to extend Aleph, my (also forthcoming) library for exploring persistent homology and its uses.

As a quick example, I decided to calculate some statistics about the homology of 2-manifolds. Thanks to the work of Frank H. Lutz, a list of triangulations with up to 9 vertices, as well as a list of triangulations with 10 vertices, exist. After adding support for the appropriate data format in Aleph, I was able to easily read a family of triangulations and subsequently process them. Even though Aleph’s main purpose is the calculation of persistent homology, ordinary homology calculations work just as well—ordinary homology being a special case of persistent homology.

As a starting point for further mathematical ruminations, let us take a look at the distribution of Betti numbers of 2-manifolds. More precisely, we analyse the distribution of the first Betti number of the manifold—according to PoincarĂ© duality, the zeroth and second Betti number have to equal, and we know the zeroth Betti number to be 1 for a manifold.

Without further ado, here is a tabulation of first Betti numbers for triangulations with up to 9 vertices:

Value Number of occurrences
0 73
1 154
2 313
3 133
4 37
5 2

The results are interesting: most triangulations appear to have the homology of a torus, i.e. a first Betti number of 2, followed by the homology of the real projective plane with a first Betti number of 1. Betti numbers larger than 3 are extremely rare. Intuitively, this makes sense—the triangulation only consists of at most 9 vertices, so there are natural limits on how high the Betti number can become.

For triangulations with 10 vertices, another distribution arises:

Value Number of occurrences
0 233
1 1210
2 6571
3 11784
4 14522
5 7050
6 1042
7 14

Here, the mode of Betti numbers is a value of 4, with 14522 occurrences out of 42426 triangulations. The homology type of this signature is the one of a genus-4 surface. Again, higher values get progressively less likely because only 10 vertices are permitted.

I am not yet sure what to make of this, but it sure is a nice test case and application scenario for Aleph. One of the next blog posts will give more details about this calculation, and its implementation using the library. Stay tuned!

Posted late Saturday night, July 30th, 2017 Tags:

In this post, I want to briefly explain how to combine methods from topological data analysis and machine learning. Given a Shakespearean social network, I want to predict whether it is likely to be a comedy or not.

Preliminaries & terms

Let’s first discuss some terms: a Shakespearean social network is a co-occurrence network generated from an annotated corpus of Shakespeare’s plays. The basic idea is to calculate a graph from the play in which characters that co-occur in a scene are connected by an edge. The edge weight roughly indicates how many lines of text a character has—the idea being that less important characters should be connected by edges with smaller weights.

I already wrote two posts about these networks, with the first one talking about the general ideas behind these networks and providing some basic statistics, and the second one talking about a more formal evaluation in the form of a short paper titled ‘Shall I compare thee to a network?’—Visualizing the Topological Structure of Shakespeare’s Plays.

The short paper focuses more on visualizing structural similarities between plays. Much has happened in the meantime, however. I started working on Aleph, a library for exploring persistent homology. Moreover, encouraged by previous publications (namely, my two papers about analysing clustering algorithms and dimensionality reduction methods), I started to delve a little bit deeper into machine learning.

In some sense, this post is a continuation of my work in that area.

Topological data analysis & persistent homology

Since I want to keep this post brief (I plan on writing up a more detailed explanation as a follow-up post), I merely focus on the basic idea behind my analysis. Using persistent homology, a method from topological data analysis, it is possible to measure the amount of ‘holes’ in a network. Yes, you read that correctly. While many algorithms describe what is there, persistent homology focuses on what is missing. This paradigm is highly successful and has been applied in many different application domains so far—let me just mention Gunnar Carlsson’s paper Topological pattern recognition for point cloud data as one particularly striking example of this.

At the heart of persistent homology is the concept of a persistence diagram, a two-dimensional scatter plot that depicts scale information about each hole. For every dimension of an input data set, we have a separate persistence diagram. These diagrams are commonly taken as a feature descriptor that summarizes a data set. Their appeal lies in certain structural properties such as robustness. Furthermore, it is possible to calculate the Wasserstein distance between them. This makes it possible to compare two extremely complex objects, shapes, or spaces by a rather simple algorithm!

Distances & kernels

In the aforementioned short paper, I used the Wasserstein distance between persistence diagrams in order to obtain a low-dimensional embedding of Shakespeare’s plays. Proximity in this embedding means that the co-occurrence networks of two plays have a similar structure. Among other things, one of the results of the short paper was that comedies such as A Midsummer Night’s Dream appear to form clusters, while tragedies such as Hamlet appear to be spread out more.

It is thus a natural question to ask whether it is possible to decide if a random Shakespeare play, described as a co-occurrence network, is a comedy or not. This seems like a perfect job for machine learning! Based on the successful usages of persistent homology in other contexts, it seemed natural to me to use it in order to assess the dissimilarity between the networks. There are other options out there for doing this—and it would be an interesting side-project to compare all of them—but seeing that most of my research centres on persistent homology one way or the other, I will stick with it.

In order to unleash the power of machine learning on these unsuspecting networks, we require the definition of a suitable kernel. A kernel function permits us to use supervised learning algorithms without ever having to specify coordinates in the feature space. Given the fact that the feature space is extremely hard to describe—notwithstanding current efforts to introduce vector-based representations of persistence diagrams, such as persistence images—kernel methods are a blessing. Thanks to scikit-learn, they can be used quite easily.

There are multiple options for calculating kernels. Here, I experimented a little bit with a simple Wasserstein kernel (obtained from negating the Wasserstein distance values) as well as persistence indicator functions( of which I hope to write more in a subsequent post).

I am leaving out the generation process of the kernel matrices for now, because I plan on describing them in a subsequent blog post: my library for calculating persistent homology is still subject to change and I do not want to prematurely write about APIs that are not yet ready.

Experiments & results

After generating the kernel matrices with Aleph, we can use support vector machines to test how easily we may detect comedies. Some caution is necessary, though: there are only 37 plays, of which 17 are comedies. A standard splitting of the data set into training and test data is thus not useful here because there are not enough samples. I thus opted for cross-validation techniques, more precisely leave-p-out cross-validation, with p up to 3.

The results are as follows:

Kernel Accuracy Average ROC AUC
Persistence indicator functions, d=1 0.86 0.91
Persistence indicator functions, d=2 0.73 0.80
Persistence indicator functions, d=3 0.66 0.74
Wasserstein, d=1 0.86 0.95
Wasserstein, d=2 0.81 0.94
Wasserstein, d=3 0.81 0.93

The entries in the third column refer to the area under the curve of the receiver operating characteristic and give an idea about how the classifier will perform in general. These values are to be taken with a grain of salt, however, because there are few samples of both classes. I selected the accuracy performance measure because there are only two classes and they are not highly skewed. Given that the probability of a random play being a comedy is about 0.46, it is nice to see that we are able to detect comedies rather well.

Nonetheless, the values in the table are somewhat surprising. First of all, from personal experience I would have expected a slightly better performance from the second Wasserstein distance—it is the one that is commonly used because its local costs resemble the basic Euclidean distance. This has the nice property of suppressing topological features with low persistence values. It thus appears that these features are useful for classification purposes. Paul Bendich suspected something similar in his paper Persistent Homology Analysis of Brain Artery Trees:

That said, there is no guarantee of persistence being correlated with importance, just with reliability. Indeed, one of the findings in Section 5 is that dots of not-particularly-high persistence have the most distinguishing power in our specific application.

What now?

This was/is a fun project for me as it combines several interesting aspects. I look forward to trying out more kernels, such as the one proposed by Reininghaus et al. in A Stable Multi-Scale Kernel for Topological Machine Learning.

One of the results of this paper was that the Wasserstein distance is theoretically unsuitable as a kernel function. This is somewhat unfortunate because it still appears to work. I sincerely hope that the persistence indicator function proves to be a viable candidate for kernel methods because it can be calculated very quickly and, in contrast to persistence diagrams, permits the definition of a unique mean function.

As soon as time permits, I shall write a more detailed article about the technical details of this experiment. Please take a look at the GitHub repository that I created for this article. Here, you will find the raw data along with some scripts to reproduce the values reported in this blog post.

Have fun and feel free to drop me a line if you are interested in these (or related) methods. I will be more than happy to have an interesting discussion.

Posted Tuesday evening, May 16th, 2017 Tags:

Scatterplot matrices are one of the most practical direct visualization of unstructured data sets—provided that the dimensionality does not become too large. Despite their prominence in the visualization community, there are few non-commercial tools out there that are able to produce scatterplot matrices for publication-ready visualizations.

In this post, I want to briefly explain how to use a recent version of gnuplot, one of my favourite plotting tools, to create something that resembles a scatterplot matrix. In this example, I will use the famous Iris data set, which I am sure everyone is familiar with by now.

I prepared a slightly-modified version of the data set, which makes processing it with gnuplot easier. To this end, I shortened the labels and inserted newlines after each block of identical labels. This makes using colours much easier.

The basic idea of the plot is to use the multiplot mode of gnuplot and create all plots in a nested loop. The diagonal elements are filled with histogram of the corresponding attribute—which is of course also created using gnuplot. The only thing I am not proud of is the use of labels. I am using the x2label feature in order to put axis labels on top of a plot, but as gnuplot does not (yet) support a simpler mechanism for placing labels in multiple plots, I had to resort to a truly horrid way of getting this to work…

Without further ado, here’s the code:

And this is how the output looks like:

Scatterplot matrix of the Iris
data set

You can find the code for this script as a gist on GitHub.

Posted late Wednesday evening, April 5th, 2017 Tags:

Most of the time, PhD comics hits too close to home. Recently, they ran a strip showing the temporal evolution of a typical thesis document. As expected, the comic proves that efforts are “doubly redoubled” when the deadline approaches. Naturally, I wondered how my own thesis—still under review—fares in that regard.

I keep my thesis under version control because I am a paranoid and cranky person when it comes to data security and backups. Consequently, I required a script for reconstructing the amount of changes I did to the thesis sources over time. Here’s what I came up with:

If I let this script run, with the output being stored in a file somewhere, I can use the venerable gnuplot to create a nice graph. This is what it looks like for my dissertation:

Word count in my thesis over time

Note that the script is unable to count “proper” words for now. Hence, changes in the LaTeX markup will also result in increasing the total number of words. I guess the slope of my own efforts is not too similar to the comic. At the same time, you can see that most of the content of the thesis was added in a comparatively short amount of time.

Feel free to apply this to your own documents. The code is available as a gist on GitHub. This is my first real gist—thanks to Conrad for the suggestion!

Posted late Thursday evening, February 9th, 2017 Tags:

I recently had to convert several structured grids in VTK format for further processing. Since I could not find any other code for doing this, here is my quick and dirty approach:

#!/usr/bin/env python3

from PIL import Image

import numpy
import os
import re
import sys

reDimensions = re.compile( r'DIMENSIONS\s+(\d+)\s+(\d+)\s+(\d+)' )
reSpacing    = re.compile( r'SPACING\s+(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)' )
rePointData  = re.compile( r'POINT_DATA\s+(\d+)' )
reTableData  = re.compile( r'(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+\.\d+)' )

for filename in sys.argv[1:]:
    with open( filename ) as f:
        width  = 0
        height = 0
        depth  = 0

        data = None # Contains matrix data
        ii   = 0    # Current data index for filling matrix

        # Stores only those coordinates (given as an x and y position)
        # that are nonzero. This permits to match adjacent time steps
        # in a very concise and sparse manner.
        nonZero = list()

        for line in f:
            if reDimensions.match(line):
                result = reDimensions.match(line)
                width  = int( result.group(1) )
                height = int( result.group(2) )
                depth  = int( result.group(3) )

                assert width * height * depth > 0, "Invalid dimensions"

                data = numpy.zeros( (height,width) )
            elif reSpacing.match(line):
                result = reSpacing.match(line)
                sx     = result.group(1)
                sy     = result.group(2)
                sz     = result.group(3)

                # TODO: I am ignoring the spacing, hence I am only
                # able to load uniform rectilinear grids.
                assert sx == sy and sy == sz, "Invalid spacing"
            elif rePointData.match(line):
                result = rePointData.match(line)
                num    = int( result.group(1) )

                assert num == width * height * depth, "Inconsistent dimensions"
            elif reTableData.match(line):
                result = reTableData.match(line) 
                d      = (u,v,w) = (
                                    float( result.group(1) ),
                                    float( result.group(2) ),
                                    float( result.group(3) )
                                   )

                for k in range(0,3):
                    ix     = ii % width
                    iy     = ii // width 
                    ii     = ii + 1

                    if d[k] > 0:
                        nonZero.append( (ix,iy) )

                    data[iy][ix] = d[k]

    # TODO: `data` is now a matrix corresponding to the dimensions of
    # the original grid.Do something with it...


    #
    # Store non-zero coordinates
    #

    outputCoordinates = "/tmp/"\
             + os.path.splitext( os.path.basename(filename) )[0]\
             + ".txt"

    with open(outputCoordinates, "w") as g:
        for (x,y) in nonZero:
            print("%d\t%d" % (x,y), file=g)

    #
    # Store image
    #

    outputImage =   "/tmp/"\
             + os.path.splitext( os.path.basename(filename) )[0]\
             + ".png"

The code from above iterates over a list of VTK files (assumed to be in ASCII format) that contains POINT_DATA, i.e. a scalar value for each grid cell. Provided that the grid follows uniform spacing, the code is capable of parsing it and stores the POINT_DATA values in a numpy matrix. Moreover, the list of non-zero coordinates of the matrix is written to /tmp, along with a pixel-based image of grid. Both outputs are highly useful for debugging and understanding the input.

Feel free to do more with this code—I am hereby releasing it into the public domain.

Posted Wednesday afternoon, February 8th, 2017 Tags:

In a previous article, I have already talked about the motivation behind analysing social networks of Shakespeare’s plays. Since then, I have pursued this subject somewhat further. In particular, I took some time to formulate a more methodical analysis approach.

The result is a nice short paper, titled ‘Shall I compare thee to a network?’—Visualizing the Topological Structure of Shakespeare’s Plays. I was very honoured to present this at the 2016 Workshop on Visualization for the Digital Humanities at IEEE Vis 2016. It was a great venue with lots of interesting papers and I urge you to take a look at the list of submissions.

Our paper follows the basic strategy outlined in the previous blog post. Hence, we first extract a co-occurrence network between characters of a play is being extracted. There are numerous ways of doing that, and each different way influences the structures that may be detected in a subsequent analysis step. In the paper, I present two different weight schemes. One is based on the amount of speech uttered by each character, the other one measures the point in time at which characters start to interact with each other.

The co-occurrence network is then subjected to a topological analysis based on persistent homology. Basically, we consider the network to grow in accordance to the weights. This permits us to treat it as a graph-based filtration, which in turn means that we may calculate persistence diagrams and, from that, derive a measure of dissimilarity. The interesting thing here is that this approach results in well-defined distances between individual plays—meaning that we can measure structural differences and structural similarities.

The paper contains some results that appear to indicate the utility of this approach. Without further training, our algorithm is capable of figuring out that All’s Well That Ends Well is a comedy that is structurally extremely dissimilar to all remaining comedies. This is interesting, insofar as this play is considered to be one of Shakespeare’s problem plays.

In the paper, we describe more results of the same ilk. I have made the networks available on GitHub, where I also plan on uploading code for calculating the dissimilarity measure.

Posted late Monday evening, November 7th, 2016 Tags:

Debugging is one of the primary skills every computer science student should acquire. I want to provide an easy introduction to this topic, starring C++ and gdb, the debugger of the GNU project. In the following, I assume that the reader is somewhat familiar with the basic concepts of C++ programming.

A simple example

Let’s assume that we want to transform a string into a palindrome and write the following program to do so:

#include <iostream>
#include <string>

std::string palindrome( std::string s )
{
  for( auto i = s.size() - 1; i >= 0; i-- )
  {
    if( s[i] != s[s.size() - i - 1] )
      s[s.size() - i - 1] = s[i];
  }

  return s;
}

int main()
{
  std::cout << palindrome( "Hello, World!" ) << std::endl;
}

We compile the program using

$ g++ -std=c++11 foo.cc

and run ./a.out just to find out that the program crashes with a segmentation fault. Bummer. Let’s recompile the program and prepare it for debugging. To this end, we just add the -g switch on the command-line for the compiler. This ensures that debug symbols are included. Loosely speaking, these symbols are required for the debugger to connect the binary executable with our source code.

So, we issue

$ g++ -std=c++11 -g foo.cc

and run gdb with our executable as an argument:

$ gdb a.out

This should result in output that is somewhat similar to this:

GNU gdb (GDB) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
---Type <return> to continue, or q <return> to quit---
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
---Type <return> to continue, or q <return> to quit---
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from a.out...done.
(gdb)

The most important thing is that gdb greets us with its own prompt, meaning that from now on, we directly interact with the debugger instead of running the program directly.

We first run the binary to find out what happens:

(gdb) run

This should yield an output similar to

Program received signal SIGSEGV, Segmentation fault.
0x0000000000400bd4 in palindrome (s="!dlroW World!") at foo.cc:8
8       if( s[i] != s[s.size() - i - 1] )

and send us right back to the prompt. We can see that our program caused a segmentation fault. This error usually indicates that we accessed memory that does not belong to our program somewhere along the way.

How can we find out more about what went wrong? We first take a look at the backtrace of the error. Roughly speaking, these are all the functions that are currently active—currently meaning at the time of the error.

To find out more about the backtrace, we type bt or backtrace:

(gdb) bt

This gives us lines like these:

#0  0x0000000000400bd4 in palindrome (s="!dlroW World!") at foo.cc:8
#1  0x0000000000400c65 in main () at foo.cc:17

Not too surprisingly, the last function, i.e. function #2, is the main function of our program. After that, our palindrome function was called. Each of the lines in this list is referred to as a frame. To see a specific frame—i.e. a certain function along with its code—we type frame NUMBER or f NUMBER.

Let us first take a look at the main function:

(gdb) frame 1

This gives us some code:

#1  0x0000000000400c65 in main () at foo.cc:17
17    std::cout << palindrome( "Hello, World!" ) << std::endl;

We can see the line number (17) and some of the parameters of the function. Furthermore, we see the next function call, viz. palindrome, which in turn causes the segmentation fault.

Since the line looks relatively safe, there must be an issue with the variables that are used to access the characters of the string. So, let’s get back to the first frame:

(gdb) frame 0
#0  0x0000000000400bd4 in palindrome (s="!dlroW World!") at foo.cc:8
8       if( s[i] != s[s.size() - i - 1] )

Again, the if condition seems relatively safe. To cause a segmentation fault, we must access some bad memory somewhere. Thus, let us take a look at the values of the variables at this point. To this end, we can use the print command or p for short.

Let us begin with the string itself:

(gdb) print s
$1 = "!dlroW World!"

This seems to be correct so far. There is certainly nothing wrong here. So let us try the loop variable next:

(gdb) print i
$2 = 18446744073709549580

Alright. I am fairly confident that our string is not that long. So, we have identified the culprit—something is going wrong with the loop counter. At this point, we could use the edit command to open the file in our favourite editor and add some std::cout statements.

Yes, I am serious. std::cout can be a very powerful debugging tool! As long as we give some serious thought to identifying potential places in our code that may cause an error, there is nothing wrong with sprinkling some printing statements over your code. Of course we could also set breakpoints and watch how the value of i changes over the course of the program. But this is more advanced stuff, so I will rather cover it in another article.

So, let us assume that we have added enough std::cout statements. Running the program with these statements quickly shows that we are traversing the string incorrectly. As s.size() is always unsigned, the condition i >= 0 in the loop always evaluates to true. By traversing the string in a forward manner, e.g.

for( std::size_t i = 0; i < s.size(); i++ )

we finally get our program to work, resulting in the beautiful output of

Hello, ,olleH

Fun with make

If you have a Makefile—or if you are using CMake, for example—the debugger offers you a very cool integrated workflow. You can directly use edit to work on the code. Afterwards, you can type make to recompile the program, followed by run to check its behaviour again. For easily-fixable errors, this means that you never even have to leave the debugger. More about this in a subsequent article.

Conclusion

In the next article, we shall see how to combine debugging with make. Also, I want to show you the wonderful world of breakpoints. Until that time, here are the main lessons of debugging:

  • Try to figure out what causes the error
  • Always assume that your code is at fault
  • Do not be afraid of using std::cout or printf or whatever for debugging. Whatever works.
  • Always compile with extra warnings enabled. By compiling with -Wall -Wextra, the compiler already helps you catch a lot of potential issues. In our case, for example, the compiler would have complained with comparison of unsigned expression >= 0 is always true, which would have already caught the error before debugging.

Happy bug-fixing!

Posted late Wednesday evening, September 28th, 2016 Tags: