This is the personal blog of Bastian Rieck. I add posts in irregular intervals about things I consider relevant. Send any feedback, comments etc. to bastian@annwfn.net.

Even though C++ is still my favourite programming language, I am always amazed by the sheer number of headers I have to include to get something done. Most of my source files could contain #include <vector> at the beginning, because this is most likely what I am going to include anyway. Other headers are not treated this way. In fact, I am displaying a wanton disregard for many of the other headers. For example, I have never consciously use the execution header in real-world code.

I thus started wondering how other projects fared in that regard. So I cloned several of the larger C++ repositories on GitHub. More precisely, I started with the following repositories:

In total, these projects comprise more than 2 million lines of code—a reasonably-sized sample, I would say. To figure out how these projects use headers, I first extracted all STL headers from all files and counted their occurrences. This resulted in the following histogram (the counts are relative):

Pretty interesting, I would say. This is a nice long-tail distribution for which a few headers are used much more often than the rest. In fact, for these repositories, only four headers make up more than 50% of the usage:

• vector
• string
• memory
• utility

For vector and string, this is not surprising. Virtually every C++ programmer uses vector for almost anything. The same goes for string. Similarly, memory is not so surprising as it contains the different smart pointer classes—most prominently, shared_ptr. The last one of the list, utility, was slightly unexpected for me. It contains things such as std::make_pair and std::move. At least the latter one is required for any class that does its own memory management.

At the tail of the distribution, the more exotic headers await. The stack header, for example, appears not to be used in these projects too often, while the future header comes in dead last. I must confess that I have not used in real-world projects so far because I did not yet have to deal with asynchronous operations. The lack of enthusiasm for the regex header is somewhat sad but maybe this is to be expected in a language that does not really encourage the use of regular expressions? Also, C++ regular expressions are said to perform worse than their counterparts in other languages. To what extent the unfamiliarity of C++ programmers with regular expressions might contribute to this, I cannot say.

Let’s delve into another aspect of the headers. In my code, I noticed that some headers are almost always used together. For example, if there is an algorithm header, there is often also a functional header. Extending this to projects, I thought that it might be interesting to analyse the co-occurrence patterns of STL headers. To this end, I counted how often pairs of headers are being included the same file. This naturally gives rise to a co-occurrence matrix, in which rows and columns indicate headers, and the value indicates how often those header occur in the same file. If headers are sorted by their counts, this results in a beautiful picture:

This matrix tells us something about the universality of certain headers. The vector header, for example, co-occurs with almost every other header to some extent because vectors are such fundamental data types. The typeinfo header, on the other hand, is so very specific that is only co-occurs with typeindex. In fact, the structure of the matrix, i.e. the many dark cells, indicates that many combinations are highly unlikely to occur “in the wild”.

Some of the combinations really tell a story, though. For example, queue is used in conjunction with thread (possibly to implement patterns for multi-thread environments), but also with stack (possibly to implement different traversal strategies of implicit or explicit graph structures in these projects). I also see another pattern of my own code, namely the pair unordered_map and unordered_set. I tend to require either both of them (the set for iteration and storage, the map for, well associating more information with individual objects) or none of them.

# Conclusion

As a next step, it would be interesting to see whether the co-occurrence of certain headers makes it possible to guess the domain of a C++ program, just like certain pairs of words (I guess I should rather speak of bigrams here, to use the NLP term) are more indicative of certain genres of literature. Treating code like literature would certainly make for an interesting art project.

The code for this project is available on GitHub. You only have to supply the repositories for scanning.

Happy coding, until next time!

Update (2018-04-06): Changed the title because I was using the term meta-analysis incorrectly. Thanks, HN!

Posted Tuesday evening, April 3rd, 2018 Tags:

This article details some of my experiences with Travis CI, aiming to make this service more accessible to those who—like me— do not consider themselves to be “proper” software developers.

# Prelude and history

I am by no means a professional software developer, but my job offers me the possibility to churn out some code more often than not. Naturally, I want my code to be used by other people—especially when said code is part of a publication. There is just one catch: code does not exist in a vacuum but is part of some ecosystem. Even when you are developing everything in Python, a language that arguably tries to hide a lot of complexity so that one is able to focus on the task at hand, your code is not an island. Save for extremely trivial programs, you will probably be using libraries, such as the trinity of data science, i.e. import numpy as np, import scipy as sp, and last, but certainly not least, import matplotlib.pyplot as plt.

Your users, however, might have radically different configurations—it is well possible that some have more recent versions of all packages, while others will still be running the first release of Ubuntu. Some of them, and this is really hard for me to admit, may not even be running the best Linux distribution in the world. Some of them may not even be running a Linux operating system!

Given this untidy state of affairs, how can you at least make some notion of credible attempt at pretending to support your code for more than your machine only? That is where continuous integration comes in! Originally a technique in test-driven development, CI is supposed to make your lifer a tad easier by ensuring that the changes you make do not stop the project in total from working. This is achieved by integrating your changes often into the current master branch of your repository, and checking that all changes you make do not break the build.

Professional software development companies set up their own build servers, thereby ensuring that the product still is built as per spec. How does this pertain exactly to us academics?

Most software developed in academia is held together by the same ingredients, viz. hope, the tears of Ph.D. students, and sheer faith. Code is supposed to work until the deadline and, preferably, until the paper has been accepted. For many students, git is something one has heard of in that pesky software engineering class. The full power of version control systems is often not appreciated, even by research group leaders. Let me describe you a nice scenario: Suppose you are working towards the important deadline. Suddenly, one person (Alice) in your team discovers that the main algorithm has a bug. “Woe is me”, you say. But Alice, being very good at what she does, fixes the bug and after a few tense minutes she announces that the results you report in the paper still hold. You submit the paper early and go home to sleep.

This is the kind of magic that continuous integration can bring to your projects if you care to use it. Read on if you are interested.

# The magic

In a nutshell, Travis CI “merely” provides a set of virtual machines (featuring different operating systems even!) on which you can build and run your code. And the best thing is that this works automatically, whenever you update your repository via git push. No more worrying about whether that small change you did might have changed all the calculations—instead, a rather blissful existence.

Of course, this only works if you invest some time in setting up your project. At its core, Travis looks for a file called .travis.yml in your repository. It is this file that allows you to configure the steps that Travis performs for each update. This is what a simple file may look like if your project uses CMake as its build system and has no additional dependencies:

language: cpp

os:
- linux

compiler:
- gcc

script:
- mkdir build
- cd build
- cmake ../
- make


After each git push, Travis will dutifully clone the repository and execute all steps that you provide in the script section. Here, this means creating a separate build directory and checking that the project can be built. This is not much, of course, and does not really help in catching problems with an algorithm, but it is a start. Suppose you want to ensure that your project is also compilable with clang. Just change the corresponding section:

compiler:
- clang
- gcc


Or suppose that you want to add Mac OS support:

os:
- linux
- osx


It really is that simple. Travis will now create a build matrix, consisting of two operating systems (Linux and MacOS X) and two compilers (clang and gcc).

# But I want more magic!

Coming back to the scenario above, how can Travis help in this case? Well, of course you have to provide the tests that Travis needs to run. For example, I like to create a special test build target for CMake and let it execute my unit tests for me. Unit tests range from banal checks of classes to longer programs that compare expected results of algorithms to current ones. You will have to take my word for it, but tests like this have helped me multiple times in the past for various publications. If you are interested in how they may look, please refer to Aleph, my library for experimenting with topological data analysis. I do not claim that the code is perfect, but the tests subdirectory contains numerous unit tests that may drive home the point I am trying to make here. The best thing is that the configuration file is not getting needlessly complicated. Due to the nice feaures of CMake, it is again sufficient to extend the script section:

script:
- mkdir build
- cd build
- cmake ..
- make
- CTEST_OUTPUT_ON_FAILURE=1 make test


The ugly last line is only necessary in order to force the testing harness of CMake to be a little more verbose. You can see the output of Travis for this project here, and you will see that the unit tests are always executed—giving me peace and tranquility to some extent.

# So?

That is the basic gist of Travis. I think that using such a service can be beneficial for academical projects. Not only does give you the confidence that your code is doing something right, it can also be used to promote your research—by providing a repository that is tested under different platforms, with different compilers, you make it easier for people to actually use your cool algorithm (and cite you, of course, lest we forget about that).

# Dependencies

Now you are probably huffing at this article because your program is more complicated and has some dependencies. Well, Travis has you covered to some extent.

Travis permits you to change the build environment to some extent. For example, Aleph uses Boost and Eigen. Since those exist as packages under Ubuntu, the default Linux distribution used by Travis, I can easily install them in .travis.yml:

language: cpp

os: linux
sudo: false

apt:
packages:
- libboost-dev
- libeigen3-dev


For Mac OS X, support for Homebrew is available, but the use is slightly more complex:

before_install:
- if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install boost; fi - if [[ "$TRAVIS_OS_NAME" == "osx" ]]; then brew install eigen; fi


Overall, though, this works just fine.

# Troubles

I do not want to sound overly enthusiastic here. Adding support for multiple operating systems and configurations can be painful. Heck, I often have to commit multiple times because the build just breaks for some darn reason. It is a fine balance that one has to achieve here: code should be usable, but your time is not infinite.

Travis unfortunately makes my life sometimes harder than it has to be. While I am grateful for the free (free!) service they offer, some things just irk me:

• The default Ubuntu image is old. Like really, really old, viz. the version of Ubuntu from frigging 2014. As an Arch Linux user, I am stunned. Some of my time is thus spent correcting for some arcane problems with old package versions.

• The OS X images are strange: the update process of brew just stalls, and the same goes for the builds themselves sometimes. I get many e-mails that tell me that my build errored (which in Travis lingo does not mean that the build failed, but rather that some virtual machine could not be started).

• The number of available images is very small. Ideally, I would like to be able to check my software on even more variants of Ubuntu but also on other flavours of Linux. And what about BSD? One of my users may want to install the software while running an old version of OpenBSD on toaster, so where is the support for that?

• Do not get me started on supporting older and newer package versions…

Nonetheless, I am grateful for the Travis CI engineers. They are doing a heck of a job just so that I can pretend that my GitHub projects are actually useful to someone out there. Thank you (I really mean it)!

# Coda

All in all, Travis CI has been very beneficial for my projects. I sleep a little bit better knowing that I do not have to worry about breaking my algorithms just by changing the interface somewhere else. Even though it has it shortcomings—but nothing is perfect—I urge you to consider using it for your projects and your publications!

By the way: if you are a hardcore Bitbucket user, take a look at Pipelines. They provide similar functionality for those of us that are not living in the land of GitHub.

Happy coding/testing/integrating, until next time!

Posted late Tuesday evening, March 13th, 2018

ETH Zurich offers multiple wireless networks around campus. Our local IT people drilled into me that I should prefer to use the eth-5 SSID whenever possible. Obviously I had to figure out a way to support this profile for netctl. If you count yourself among the few, the happy few, the band of users if this fine piece of software, here is what you have to do:

• Create a new file /etc/netctl/WiFi_eth5
• Paste the following file, while taking care to replace $USER with your ETH username and $PASSWORD with your ETH network password (this is not your regular login password; I am talking about the other one), and $INTERFACE with the WiFi interface, e.g. wlp4s0 • Activate the new profile with netctl start WiFi_eth5 Description='ETH' Interface='$INTERFACE'
Connection='wireless'
IP='dhcp'
ESSID='eth-5'
Security='wpa-configsection'
WPAConfigSection=(
'ssid="eth-5"'
'key_mgmt=WPA-EAP'
'eap=PEAP'
'proto=WPA RSN'
'identity="$USER"' 'password="$PASSWORD"'
'phase2="auth=MSCHAPV2"'
)


Happy surfing, until next time!

Posted late Sunday evening, March 11th, 2018 Tags:

When I was writing my dissertation, I was planning on creating all images as vector graphics if possible. They scale better and give the document an altogether published look. Some of the figures containing meshes (of some 3D objects for an IEEE TVCG publication about the merits of topological data analysis) turned out to be problematic, though. The state-of-the-art tool for mesh analysis, the aptly named MeshLab, is not capable of exporting vector graphics. Luckily, the Inkscape vector graphics editor, which I already used for other graphics, shipped with a plugin for “rendering” meshes in Wavefront OBJ format.

There were only two snags with the plugin:

1. It seemed to be unmaintained or at the very least, somewhat undocumented.
2. It did not support colours.

I could not easily change the first point, but boy, do I love me some fancy colours, in particular when I am talking about the curvature of meshes, for example. So I added colour support for the plugin and re-christened it to the highly creative name of 3D Polyhedron with vertex colours. You can find the code on GitHub. In contrast to most of my projects, it is released under the GPL. This is due to the fact that I was modifying the code of the original plugin.

Usage instructions are available on GitHub, so I would rather end this brief post with an example of the output of the plugin.

Until next time!

Posted late Monday evening, March 5th, 2018 Tags:

After defending my thesis last year, I wanted to show my newfound admiration for all things LaTeX. Since I spent a lot of time getting the formatting just right for my purposes (following typographical advice of numerous sources), I decided to create some LaTeX packages for research-based documents.

# latex-mimore

latex-mimore is a minimal and modern template for reports, such as the ones you have to do for a seminar. You can also use the class for longer documents, such as a bachelor's thesis, but I would recommend using latex-mimosis, which I describe below. If you clone the repository and set up your LaTeX environment correctly, using the class is as easy as writing

\documentclass{mimore}


as the preamble to your document. Please take a look at the GitHub repository for more details. This is what a document looks like if formatted with latex-mimore.

# latex-mimosis

latex-mimosis is the bigger sibling of latex-mimore. It is meant for your Ph.D. dissertation thesis, your master's thesis, or your bachelor's thesis. Again, using it is as easy as adding

\documentclass{mimosis}


to your preamble. Please take a look at the GitHub repository for more details. This is what a document looks like if formatted with latex-mimosis.

# Design considerations

Both packages have been carefully crafted. They aim to be…

• clean: there is no LaTeX trickery involved; the page is neatly divided using standard typesetting practices. Adjustments to the defaults are documented and make sense. At least to me.
• minimal: there are no unnecessary adjustments of anything in there, no spurious decorations. The layout is inspired by Robert Bringhurst and his ideas about typography. You can also detect a hint of Edward Tufte in there, even though I am not such a big fan of the layout in his books; at least not for my own dissertation.
• modern: the template should be pleasing to the eye, without any of the cruft that is a remnant of typewriter times.

The templates are released under an MIT licence and I would love to hear your feedback about them. If anything is missing or can be improved, please open an issue in the corresponding repository.

Happy TeXing, until next time!

Posted at lunch time on Thursday, February 15th, 2018

If you are like me and a long-term git user, you will probably accumulate numerous remotes per repository (maybe one for your private server, one for work, and so on).

Pushing to all of them can get a bit tedious, so I defined a new commnad gpa, which is of course short for git push all:

git remote | xargs -L1 git push --all


Let us dissect this quickly:

• The need for git remote is obvious because it permits us to enumerate all remotes
• We need xargs -L1 in order to format the input to fit into one line
• Finally, the git push --all pushes all branches to the specified remote

I have alias gpa='git remote | xargs -L1 git push --all' in my ZSH configuration, so that I can use this command globally.

Happy committing, until next time!

Update: Conrad was kind enough to point out that one can just as well modify .gitconfig accordingly and make this command an alias of the git command:

[alias]
pushall = !git remote | xargs -L1 git push --all


Thanks!

Posted late Tuesday evening, February 13th, 2018 Tags:

I have taken up a new position as a postdoctoral researcher in the Machine Learning & Computational Biology Lab of Prof. Dr. Karsten Borgwardt at ETH Zürich. In hindsight—as is always the case—this now feels to be the logical move. During the last year, my research started to touch more on more upon concepts and issues in machine learning, and I let myself be drawn more and more into this rich and exciting field. I will of course try to apply my knowledge in scientific and information visualization in my new job as well and I hope that there will be many interesting opportunities for papers that span multiple fields.

Moreover, I am really excited to be part of a group that actively uses GitHub in their scientific work. Aleph, my topological data analysis framework, will finally have some interesting company. Until my own contributions start to make an appearance in the repositories of my lab, please take a look at the existing projects of the MLCB Lab on GitHub.

At the same time, it goes without saying that the views expressed on this website are entirely my own and have neither been reviewed nor endorsed by ETH Zurich.

Posted at lunch time on Wednesday, January 10th, 2018 Tags:

Software developers are already familiar with the KISS principle. Roughly speaking, it refers to the old wisdom that simple solutions are to be preferred, while unnecessary complexity should be avoided. In machine learning, this means that one should increase the complexity iteratively, starting from simple models and—if necessary—working upwards to more complicated ones. I was recently reminded and humbled by the wisdom of KISS while working on a novel topology-based kernel for machine learning.

# Prelude: A Paper on Deep Graph Kernels

A recent paper by Yanardag and Vishwanathan introduced a way of combining graph kernels with deep learning techniques. The new framework basically yields a way of using graph kernels, such as graphlet kernels in a regular deep learning setting.

The two researchers used some interesting data sets for their publication. Most notably, they extracted a set of co-occurrence networks (more about that in a minute) from Reddit, a content aggregation and discussion site. Reddit consists of different communities, the subreddits. Each subreddit deals with a different topic, ranging from archaeology to zoology. The posting style of these subreddits varies a lot. There are several subreddits that are based on a question–answer format, while others are more centred around individual discussions.

Yanardag and Vishwanathan hence crawled the top submissions from the subreddits IamA, AskReddit, both of which are based on questions and answers, as well as from TrollXChromosomes, and atheism, which are discussion-based subreddits. From every submission, a graph was created by taking all the commenters of a thread as nodes and connecting two nodes by an edge if one user responds to the comment of another user. We can see that this is an extremely simple model—it represents only a fraction of the information available in every discussion. Nonetheless, there is some hope that qualitatively different behaviours will emerge. More precisely, the assumption of Yanardag and Vishwanathan is that there is some quantifiable difference between question–answer subreddits and discussion-based subreddits. Their paper aims to learn the correct classification for each thread. Hence, given a graph, we want to teach the computer to tell us whether the graph is more likely to arise from a question–answer subreddit or from a discussion-based one.

The two researchers refer to this data set as REDDIT-BINARY. It is now available in a repository of benchmark data sets for graph kernels, gracefully provided and lovingly curated by the CS department of Dortmund University.

# Interlude: Looking at the Data

Prior to actually working with the data, let us first take a look at it in order to get a feel for its inherent patterns. To this end, I used Aleph, my library for topological data analysis read and convert the individual graphs of every discussion to the simpler GML format. See below if you are also interested in the data. Next, I used Gephi to obtain a force-directed visualization of some of the networks.

A look at a question–answer subreddit shows a central core structure of the data set, from which numerous strands—each corresponding most probably to smaller discussions—emerge:

The discussion-based subreddit graph, on the other hand, exhibits a larger depth, manifesting themselves in a few long strands. Nonetheless, a similar central core structure is observable as well.

Keep in mind that the selected examples are not necessarily representative—I merely picked two of the large graphs in the data set to obtain an idea of how the data looks.

# A Complex Classification

A straightforward way to classify those networks would be to use graph kernels, such as the graphlet kernels. The basic idea behind these kernels is to measure a dissimilarity between two graphs by means of, for example, the presence or absence of certain subgraphs. At least this is the strategy pursued by the graphlet kernel. Other kernels may instead opt for comparing random walks on both graphs. A common theme of these kernels is that they are rather expensive to compute. In many applications, they are the only hope of obtaining suitable dissimilarity information without having to solve the graph isomorphism problem, which is even more computationally expensive. Hence, graph kernels are often the only suitable way of assessing the dissimilarity between graphs. They are quite flexible and, as the authors show, can even be integrated into a deep learning framework.

As for their performance, Yanardag and Vishwanathan report an accuracy of 77.34% (standard deviation 0.18) for graphlet kernels and 78.04% (standard deviation 0.39) for deep graphlet kernels, which is very good considering the baseline for random guessing is 50%.

# A Simple Classification

In line with the KISS principle, I wanted to figure out alternative ways of achieving similar accuracy values with a better performance, if possible. This suggests a feature-based approach with features that can be calculated easily. There are numerous potential options for choosing these features. I figured that good choices include the average clustering coefficient of the graph, the average degree, the average shortest path length, the density, and the diameter of the graph. Among these, the average shortest path length and the diameter take longest to compute because they essentially have to enumerate all shortest paths in the graph. So I only used the remaining three features, all of which are computable in polynomial time.

I used the excellent NetworkX package for Python to do most of the heavy lifting. Reading a graph and calculating its features is as easy as it gets:

import networkx as nx

average_clustering_coefficient = nx.average_clustering(G)
average_degree                 = np.mean( [degree for _,degree in nx.degree(G) ] )
density                        = nx.density(G)


I collect these values in a pandas.DataFrame to simplify their handling. Now we have to choose some classifiers. I selected decision trees, support vector machines, and logistic regression. Next, let us train and test each of these classifiers by means of 10-fold cross-validation.

X = df.values
y = labels

classifiers = [ DecisionTreeClassifier(), LinearSVC(), LogisticRegression() ]
for clf in classifiers:
scores = cross_val_score(clf, X, y, cv=10)
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))


As you can see, I am using the default values for every classifier—no grid search or other technique for finding better hyperparameters for now. Instead, I want to see how these classifiers perform out of the box.

# Astonishing Results

The initial test resulted in the following accuracy values: 77.84% (standard deviation 0.16) for decision trees, 64.14% (standard deviation 0.45) for support vector machines, and 55.54% (standard deviation 0.49) for logistic regression. At least the first result is highly astonishing—without any adjustments to the classifier whatsoever, we obtain a performance that is en par with more complex learning strategies! Recall that Yanardag and Vishwanathan adjusted the hyperparameters during training, while this approach merely used the default values.

Let us thus focus on the decision tree classifier first and figure out why it performs the way it did. To this end, let us take a look at what the selected features look like and how important they are. To do this, I just added a single training instance, based on standard test–train split of the decision tree classifier:

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20)

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

print(clf.feature_importances_)


This results in [ 0.18874114 0.34113588 0.47012298 ], meaning that all three features are somewhat important, with density accounting for almost 50% of the purity of a node—if you are unfamiliar with decision trees, the idea is to obtain nodes that are as “pure” as possible, meaning that there should be as few differences in class labels as possible. The density attribute appears to be important for splitting up impure nodes correctly.

To get a better feeling of the feature space, let us briefly visualize it using principal component analysis:

clf = PCA(n_components=2)
X_  = clf.fit_transform(X)

for label in set(labels):
idx = y[0:,] == label
plt.scatter(X_[idx, 0], X_[idx, 1], label=label, alpha=0.25)

plt.legend()
plt.show()


This results in the following image:

Every dot in the image represents an individual graph, whereas distances in the plot roughly correspond to distances between features. The large amount of overlaps between graphs with different labels thus seems to suggest that the features are insufficient to separate the individual graphs. How does the decision tree manage to separate them, nonetheless? As it turns out, by creating a very deep and wide tree. To see this, let us export the decision tree classifier:

with open("/tmp/clf.txt", "w") as f:
export_graphviz(clf, f)


Here is a small depiction of the resulting tree:

There is also a larger variant of this tree for you to play around with. As you can see, the tree looks relatively complicated, so it is very much tailored to the given problem. That is not to say that the tree is necessarily suffering from overfitting—we explicitly used cross-validation to prevent this. What it does show is that the simple feature space with three features, while yielding very good classification results, is not detecting the true underlying structure of the problem. The rules generated by the decision tree are artificial in the sense that they do not help us understand what makes the two groups of graphs different from each other.

# Coda

So, where does this leave us? We saw that we were able to perform as well as state-of-the-art methods by merely picking a simple feature space, consisting of three features, and a simple decision tree classifier. I would say that any classifier with a higher complexity should yield significant improvements over this simple baseline, both in terms of average accuracy and in terms of standard deviation. Hence, the KISS principle in machine learning: start simple and only add more complex building blocks, i.e. models, if you have established a useful baseline.

You can find both the code and the data in my repository on topological machine learning. To run the analysis described in this blog post, just execute

$./simple_network_analysis.py GML/????.gml Labels.txt  in the REDDIT-BINARY subfolder of the repository. That is all for now, until next time! Posted at teatime on Wednesday, October 18th, 2017 Tags: A warning upfront: this post is sort of an advertisement for Aleph, my C++ library for topological data analysis. In this blog post I do not want to cover any of the mathematical algorithms present in Aleph—I rather want to focus on small but integral part of the project, viz. unit tests. If you are not familiar with the concept of unit testing, the idea is (roughly) to write a small, self-sufficient test for every new piece of functionality that you write. Proponents of the methodology of test-driven development (TDD) even go so far as to require you to write the unit tests before you write your actual code. In this mindset, you first think of the results your code should achieve and which outputs you expect prior to writing any “real” code. I am putting the word real in quotation marks here because it may seem strange to focus on the tests before doing the heavy lifting. However, this way of approaching software development may actually be quite beneficial, in particular if you are working on algorithms with a nice mathematical flavour. Here, thinking about the results you want to achieve with your code ensures that at least a few known examples are processed correctly by your code, making it more probable that the code will perform well in real-world scenarios. When I started writing Aleph in 2016, I also wanted to add some unit tests, but I did not think that the size of the library warranted the inclusion of one of the big players, such as Google Test or Boost.Test. While arguably extremely powerful and teeming with more features than I could possibly imagine, they are also quite heavy and require non-trivial adjustments to any project. Thus, in the best tradition of the not-invented-here-syndrome, I decided to roll my own testing framework, base on pure CMake and small dash of C++. My design decisions were rather simple: • Use CTest, the testing framework of CMake to run the tests. This framework is rather simple and just uses the return type of a unit test program to decide whether the test worked correctly. • Provide a set of routines to check the correctness of certain calculations within a unit test, throwing an error if something unexpected happened. • Collect unit tests for the “larger” parts of the project in a single executable program. Yes, you read that right—my approach actually opts for throwing an error in order to crash the unit test program. Bear with me, though, for I think that this is actually a rather sane way of approaching unit tests. After all, if the tests fails, I am usually not interested in whether other parts of a test program—that may potentially depend on previous calculations—run through or not. As a consequence, adding a unit test to Aleph is as simple as adding the following lines to a CMakeLists.txt file, located in the tests subdirectory of the project: ADD_EXECUTABLE( test_io_gml test_io_gml.cc ) ADD_TEST( io_gml test_io_gml )  While in the main CMakeLists.txt, I added the following lines: ENABLE_TESTING() ADD_SUBDIRECTORY( tests )  So far, so good. A test now looks like this: #include <tests/Base.hh> void testBasic() { // do some nice calculation; store the results in foo and bar, // respectively ALEPH_ASSERT_THROW( foo != bar ); ALEPH_ASSERT_EQUAL( foo, 2.0 ); ALEPH_ASSERT_EQUAL( bar, 1.0 ); } void testAdvanced() { // a more advanced test } int main(int, char**) { testBasic(); testAdvanced(); }  That is basically the whole recipe for a simple unit test. Upon execution, main() will ensure that all larger-scale test routines, i.e. testSimple() and testAdvanced() are called. Within each of these routines, the calls to the corresponding macros—more on that in a minute— ensure that conditions are met, or certain values are equal to other values. Else, an error will be thrown, the test will abort, and CMake will throw an error upon test execution. So, how do the macros look like? Here is a copy of the current version of Aleph: #define ALEPH_ASSERT_THROW( condition ) \ { \ if( !( condition ) ) \ { \ throw std::runtime_error( std::string( __FILE__ ) \ + std::string( ":" ) \ + std::to_string( __LINE__ ) \ + std::string( " in " ) \ + std::string( __PRETTY_FUNCTION__ ) \ ); \ } \ } #define ALEPH_ASSERT_EQUAL( x, y ) \ { \ if( ( x ) != ( y ) ) \ { \ throw std::runtime_error( std::string( __FILE__ ) \ + std::string( ":" ) \ + std::to_string( __LINE__ ) \ + std::string( " in " ) \ + std::string( __PRETTY_FUNCTION__ ) \ + std::string( ": " ) \ + std::to_string( ( x ) ) \ + std::string( " != " ) \ + std::to_string( ( y ) ) \ ); \ } \ }  Pretty simple, I would say. The ALEPH_ASSERT_EQUAL macro actually tries to convert the corresponding values to strings, which may not always work. Of course, you could use more complicated string conversion routines, as Boost.Test does. For now, though, these macros are sufficient to make up the unit test framework of Aleph, which at the time of me writing this, encompasses more than 4000 lines of code. The only remaining question is how this framework is used in practice. By setting ENABLE_TESTING(), CMake actually exposes a new target called test. Hence, in order to run those tests, a simple make test is sufficient in the build directory. This is what the result may look like: $ make test
Running tests...
Test project /home/bastian/Projects/Aleph/build
Start  1: barycentric_subdivision
1/36 Test  #1: barycentric_subdivision ............   Passed    0.00 sec
Start  2: beta_skeleton

[...]

34/36 Test #34: union_find .........................   Passed    0.00 sec
Start 35: witness_complex
35/36 Test #35: witness_complex ....................   Passed    1.82 sec
Start 36: python_integration
36/36 Test #36: python_integration .................   Passed    0.07 sec

100% tests passed, 0 tests failed out of 36

Total Test time (real) =   5.74 sec


In addition to being rather lean, this framework can easily be integrated into an existing Travis CI workflow by adding

- make test


as an additional step to the script target in your .travis.yml file.

If you are interested in using this testing framework, please take a look at the following files:

That is all for now, until next time—may your unit tests always work the way you expect them to!

Posted Sunday night, October 15th, 2017

Many applications require the selection of a subset of objects from a larger set. For example, suppose you are down-sampling a large data set by choosing the indices that you want to keep at random. If you are satisfied with obtaining duplicate indices (in essence, if you are sampling with replacement), this is a trivial matter in most programming language. Python certainly makes this absurdly easy:

import random

def sample(k,n):
return [random.choice( [x for x in range(n)] ) for _ in range(k)]


But what about sampling without replacement? More precisely, what if you require the selection of k indices from an array of n values without choosing duplicate values but still want the probability of choosing a certain element to be uniform? Well, if you use Python, you are in luck again:

import random

def sample_without_replacement(k,n):
return random.sample([x for x in range(n)], k)


But what about other programming languages that may not have similar builtin functionality, such as C++? In this case, a very simple and elegant technique is given by none other than the great Donald E. Knuth. The following algorithm originally was given in The Art of Computer Programming, Volume 2, Section 3.4.2 on p. 137. Briefly put, it works like this:

1. Set t = 0 and m = 0. m represents the number of items selected so far, while t is the total number of items that have already been traversed.

2. Generate a random number U that is uniformly distributed between zero and one.

3. If (N-t)*U >= n-m, skip the current item by increasing t by 1 and going back to step 2. Else, select the current item by increasing both t and m by 1. Afterwards, either go back to step 2 or stop the algorithm (if sufficiently many items have been sampled already).

A very basic implementation of this algorithm (in C++) is given in the gist below:

I do not know about you, but my gut reaction upon seeing this algorithm for the first time was How can this possibly work? So, let us briefly prove the algorithm to be correct. Basically, we have to show that every item has the same probability of being chosen. The key insight is to realize that the probability has to change depending on the number of elements that have already been selected. More precisely, we need to determine the probability of choosing item t+1 if m items have already been selected. This requires some combinatorics. There are ways of choosing n items from N such that m items are picked as the first t items. Or, equivalently, these are the number of potential permutations of the remaining n-m elements. Out of these, we are interested in all the ones that contain item t+1—but this is easy, because we can just take that item as granted and count the remaining combinations as , i.e. the number of ways to choose n-m-1 items out of N-t-1 ones. The quotient of these two numbers is exactly the probability with which we must choose item t+1 if we want to have a uniform probability for choosing a certain item. This quotient turns out to be , which looks familiar. Finally, note that since U was chosen to be uniformly distributed between zero and one, the condition (N-t)*U >= n-m in the algorithm is satisfied with the required probability. Consequently, this method samples without replacement in the required manner!

If you are interested or require a more efficient algorithm, please refer to the paper An Efficient Algorithm for Sequential Random Sampling by Jeffrey Scott Vitter in ACM Transactions on Mathematical Software, Volume 13, Issue 1, pp. 58–67, for more details. The paper should be available free-of-charge from the link provided above, but there is also a mirror available, thanks to the good folks at INRIA, who require their publications to be kept in an open archive. Time permitting, I might provide a second blog post and implementation about the more efficient algorithm.

That is all for now, until next time!

Posted late Sunday evening, October 8th, 2017 Tags: