Many applications require the selection of a subset of objects from
a larger set. For example, suppose you are down-sampling a large data
set by choosing the indices that you want to keep at random. If you are
satisfied with obtaining duplicate indices (in essence, if you are
sampling *with* replacement), this is a trivial matter in most
programming language. Python certainly makes this absurdly easy:

```
import random
def sample(k,n):
return [random.choice( [x for x in range(n)] ) for _ in range(k)]
```

But what about sampling *without* replacement? More precisely, what if
you require the selection of `k`

indices from an array of `n`

values
without choosing duplicate values but still want the probability of
choosing a certain element to be uniform? Well, if you use Python, you
are in luck again:

```
import random
def sample_without_replacement(k,n):
return random.sample([x for x in range(n)], k)
```

But what about other programming languages that may not have similar
builtin functionality, such as C++? In this case, a very simple and
elegant technique is given by none other than the great Donald E.
Knuth. The following
algorithm originally was given in *The Art of Computer Programming*,
Volume 2, Section 3.4.2 on p. 137. Briefly put, it works like this:

Set

`t = 0`

and`m = 0`

.`m`

represents the number of items selected so far, while`t`

is the total number of items that have already been traversed.Generate a random number

`U`

that is uniformly distributed between zero and one.If

`(N-t)*U >= n-m`

,*skip*the current item by increasing`t`

by 1 and going back to step 2. Else,*select*the current item by increasing both`t`

and`m`

by 1. Afterwards, either go back to step 2 or stop the algorithm (if sufficiently many items have been sampled already).

A very basic implementation of this algorithm (in C++) is given in the gist below:

I do not know about you, but my gut reaction upon seeing this algorithm
for the first time was *How can this possibly work*? So, let us briefly
prove the algorithm to be correct. Basically, we have to show that every
item has the same probability of being chosen. The key insight is to
realize that the probability has to change depending on the number of
elements that have already been selected. More precisely, we need to
determine the probability of choosing item `t+1`

if `m`

items have
already been selected. This requires some combinatorics. There are
ways of choosing `n`

items from `N`

such that `m`

items are picked as
the first `t`

items. Or, equivalently, these are the number of potential
permutations of the *remaining* `n-m`

elements. Out of these, we are
interested in all the ones that contain item `t+1`

—but this is
easy, because we can just take that item as granted and count the
*remaining* combinations as
, i.e. the number of ways to choose `n-m-1`

items out of `N-t-1`

ones. The quotient of these two numbers is exactly the probability with
which we must choose item `t+1`

if we want to have a uniform probability
for choosing a certain item. This quotient turns out to be
, which looks familiar. Finally, note that since `U`

was chosen to
be uniformly distributed between zero and one, the condition ```
(N-t)*U >=
n-m
```

in the algorithm is satisfied with the required probability.
Consequently, this method samples without replacement in the required
manner!

If you are interested or require a more efficient algorithm, please
refer to the paper *An Efficient Algorithm for Sequential Random
Sampling*
by Jeffrey Scott Vitter in *ACM Transactions on Mathematical Software*,
Volume 13, Issue 1, pp. 58–67, for more details. The paper
*should* be available free-of-charge from the link provided above, but
there is also
a mirror
available, thanks to the good folks at INRIA,
who require their publications to be kept in an open archive. Time
permitting, I might provide a second blog post and implementation about
the more efficient algorithm.

That is all for now, until next time!

A warning upfront: this post is sort of an advertisement for
Aleph, my C++ library for
topological data analysis. In this blog post I do not want to cover any
of the mathematical algorithms present in Aleph—I rather want to
focus on small but integral part of the project, viz. *unit tests*.

If
you are not familiar with the concept of unit testing, the idea
is (roughly) to write a small, self-sufficient test for every new
piece of functionality that you write. Proponents of the methodology of
*test-driven development* (TDD) even go so far as to require you to
write the unit tests *before* you write your actual code. In this
mindset, you first think of the results your code should achieve and
which outputs you expect prior to writing any “real” code.
I am putting the word *real* in quotation marks here because it may seem
strange to focus on the tests before doing the heavy lifting.

However, this way of approaching software development may actually be quite beneficial, in particular if you are working on algorithms with a nice mathematical flavour. Here, thinking about the results you want to achieve with your code ensures that at least a few known examples are processed correctly by your code, making it more probable that the code will perform well in real-world scenarios.

When I started writing Aleph in 2016, I also wanted to add some unit
tests, but I did not think that the size of the library warranted the
inclusion of one of the big players, such as Google
Test or
`Boost.Test`

. While
arguably extremely powerful and teeming with more features than I could
possibly imagine, they are also quite heavy and require non-trivial
adjustments to any project.

Thus, in the best tradition of the
*not-invented-here-syndrome*,
I decided to roll my own testing framework, base on pure `CMake`

and
small dash of `C++`

. My design decisions were rather simple:

- Use
`CTest`

, the testing framework of`CMake`

to run the tests. This framework is rather simple and just uses the return type of a unit test program to decide whether the test worked correctly. - Provide a set of routines to check the correctness of certain calculations within a unit test, throwing an error if something unexpected happened.
- Collect unit tests for the “larger” parts of the project in a single executable program.

Yes, you read that right—my approach actually opts for throwing an
error in order to crash the unit test program. Bear with me, though, for
I think that this is actually a rather sane way of approaching unit
tests. After all, if the tests fails, I am usually not interested in
whether other parts of a test program—that may potentially depend
on previous calculations—run through or not. As a consequence,
adding a unit test to Aleph is as simple as adding the following lines
to a `CMakeLists.txt`

file, located in the `tests`

subdirectory of the project:

```
ADD_EXECUTABLE( test_io_gml test_io_gml.cc )
ADD_TEST( io_gml test_io_gml )
```

While in the main `CMakeLists.txt`

, I added the following lines:

```
ENABLE_TESTING()
ADD_SUBDIRECTORY( tests )
```

So far, so good. A test now looks like this:

```
#include <tests/Base.hh>
void testBasic()
{
// do some nice calculation; store the results in `foo` and `bar`,
// respectively
ALEPH_ASSERT_THROW( foo != bar );
ALEPH_ASSERT_EQUAL( foo, 2.0 );
ALEPH_ASSERT_EQUAL( bar, 1.0 );
}
void testAdvanced()
{
// a more advanced test
}
int main(int, char**)
{
testBasic();
testAdvanced();
}
```

That is basically the whole recipe for a simple unit test. Upon
execution, `main()`

will ensure that all larger-scale test routines,
i.e. `testSimple()`

and `testAdvanced()`

are called. Within each of
these routines, the calls to the corresponding macros—more on that
in a minute— ensure that conditions are met, or certain values are
equal to other values. Else, an error will be thrown, the test will
abort, and `CMake`

will throw an error upon test execution.

So, how do the macros look like? Here is a copy of the current version of Aleph:

```
#define ALEPH_ASSERT_THROW( condition ) \
{ \
if( !( condition ) ) \
{ \
throw std::runtime_error( std::string( __FILE__ ) \
+ std::string( ":" ) \
+ std::to_string( __LINE__ ) \
+ std::string( " in " ) \
+ std::string( __PRETTY_FUNCTION__ ) \
); \
} \
}
#define ALEPH_ASSERT_EQUAL( x, y ) \
{ \
if( ( x ) != ( y ) ) \
{ \
throw std::runtime_error( std::string( __FILE__ ) \
+ std::string( ":" ) \
+ std::to_string( __LINE__ ) \
+ std::string( " in " ) \
+ std::string( __PRETTY_FUNCTION__ ) \
+ std::string( ": " ) \
+ std::to_string( ( x ) ) \
+ std::string( " != " ) \
+ std::to_string( ( y ) ) \
); \
} \
}
```

Pretty simple, I would say. The `ALEPH_ASSERT_EQUAL`

macro actually
tries to convert the corresponding values to strings, which may not
always work. Of course, you could use more complicated string conversion
routines, as
`Boost.Test`

does.
For now, though, these macros are sufficient to make up the unit test
framework of Aleph, which at the time of me writing this, encompasses
more than 4000 lines of code.

The only remaining question is how this framework is used in practice.
By setting `ENABLE_TESTING()`

, `CMake`

actually exposes a new target
called `test`

. Hence, in order to run those tests, a simple `make test`

is sufficient in the build directory. This is what the result may look
like:

```
$ make test
Running tests...
Test project /home/bastian/Projects/Aleph/build
Start 1: barycentric_subdivision
1/36 Test #1: barycentric_subdivision ............ Passed 0.00 sec
Start 2: beta_skeleton
[...]
34/36 Test #34: union_find ......................... Passed 0.00 sec
Start 35: witness_complex
35/36 Test #35: witness_complex .................... Passed 1.82 sec
Start 36: python_integration
36/36 Test #36: python_integration ................. Passed 0.07 sec
100% tests passed, 0 tests failed out of 36
Total Test time (real) = 5.74 sec
```

In addition to being rather lean, this framework can easily be integrated into an existing Travis CI workflow by adding

```
- make test
```

as an additional step to the `script`

target in your `.travis.yml`

file.

If you are interested in using this testing framework, please take a look at the following files:

- The
`tests`

subdirectory of`Aleph`

- The basic configuration file of all unit tests
- A simple example of a unit test for a single class

That is all for now, until next time—may your unit tests always work the way you expect them to!

Software developers are already familiar with the KISS principle.
Roughly speaking, it refers to the old wisdom that simple solutions are to be
preferred, while *unnecessary* complexity should be avoided. In machine
learning, this means that one should increase the complexity iteratively,
starting from simple models and—if necessary—working upwards to more
complicated ones. I was recently reminded and humbled by the wisdom of KISS
while working on a novel topology-based kernel for machine learning.

# Prelude: A Paper on Deep Graph Kernels

A recent paper by Yanardag and Vishwanathan introduced a way of combining graph kernels with deep learning techniques. The new framework basically yields a way of using graph kernels, such as *graphlet kernels* in a regular deep learning setting.

The two researchers used some interesting data sets for their publication. Most
notably, they extracted a set of co-occurrence networks (more about that in
a minute) from Reddit, a content aggregation and
discussion site. Reddit consists of different communities, the *subreddits*.
Each subreddit deals with a different topic, ranging from archaeology to
zoology. The posting style of these subreddits varies a lot. There are several
subreddits that are based on a question–answer format, while others are
more centred around individual discussions.

Yanardag and Vishwanathan hence crawled the top submissions from the subreddits
*IamA*,
*AskReddit*, both of which are based on
questions and answers, as well as from
*TrollXChromosomes*, and
*atheism*, which are discussion-based
subreddits. From every submission, a graph was created by taking all the
commenters of a thread as nodes and connecting two nodes by an edge if one user
responds to the comment of another user. We can see that this is an extremely
simple model—it represents only a fraction of the information available in
every discussion. Nonetheless, there is some hope that qualitatively different
behaviours will emerge. More precisely, the assumption of Yanardag and
Vishwanathan is that there is some quantifiable difference between
question–answer subreddits and discussion-based subreddits. Their paper
aims to *learn* the correct classification for each thread. Hence, given
a graph, we want to teach the computer to tell us whether the graph is more
likely to arise from a question–answer subreddit or from
a discussion-based one.

The two researchers refer to this data set as *REDDIT-BINARY*. It is now
available in a repository of benchmark data sets for graph
kernels,
gracefully provided and lovingly curated by the CS department of Dortmund
University.

# Interlude: Looking at the Data

Prior to actually *working* with the data, let us first take a look at it in
order to get a feel for its inherent patterns. To this end, I used Aleph, my
library for topological data analysis
read and convert the individual graphs of every discussion to the simpler GML
format. See below
if you are also interested in the data. Next, I used Gephi
to obtain a force-directed visualization of some of the networks.

A look at a question–answer subreddit shows a central *core* structure of
the data set, from which numerous strands—each corresponding most probably
to smaller discussions—emerge:

The discussion-based subreddit graph, on the other hand, exhibits a larger
*depth*, manifesting themselves in a few long strands. Nonetheless, a similar
central core structure is observable as well.

Keep in mind that the selected examples are not necessarily representative—I merely picked two of the large graphs in the data set to obtain an idea of how the data looks.

# A Complex Classification

A straightforward way to classify those networks would be to use graph kernels,
such as the *graphlet
kernels*.
The basic idea behind these kernels is to measure a dissimilarity between two
graphs by means of, for example, the presence or absence of certain subgraphs.
At least this is the strategy pursued by the *graphlet kernel*. Other kernels
may instead opt for comparing *random walks*
on both graphs. A common theme of these kernels is that they are rather
expensive to compute. In many applications, they are the only hope of obtaining
suitable dissimilarity information without having to solve the graph isomorphism
problem, which is even more computationally expensive. Hence, graph kernels are
often the only suitable way of assessing the dissimilarity between graphs. They
are quite flexible and, as the authors show, can even be integrated into a deep
learning framework.

As for their performance, Yanardag and Vishwanathan report an accuracy of
77.34% (standard deviation 0.18) for graphlet kernels and
78.04% (standard deviation 0.39) for *deep* graphlet kernels, which is very
good considering the baseline for random guessing is 50%.

# A Simple Classification

In line with the KISS principle, I wanted to figure out alternative ways of achieving similar accuracy values with a better performance, if possible. This suggests a feature-based approach with features that can be calculated easily. There are numerous potential options for choosing these features. I figured that good choices include the average clustering coefficient of the graph, the average degree, the average shortest path length, the density, and the diameter of the graph. Among these, the average shortest path length and the diameter take longest to compute because they essentially have to enumerate all shortest paths in the graph. So I only used the remaining three features, all of which are computable in polynomial time.

I used the excellent NetworkX package for Python to do most of the heavy lifting. Reading a graph and calculating its features is as easy as it gets:

```
import networkx as nx
G = nx.read_gml(filename, label='id')
average_clustering_coefficient = nx.average_clustering(G)
average_degree = np.mean( [degree for _,degree in nx.degree(G) ] )
density = nx.density(G)
```

I collect these values in a `pandas.DataFrame`

to simplify their handling. Now
we have to choose some classifiers. I selected *decision trees*, *support vector machines*, and *logistic regression*. Next, let us train and test each of these classifiers by means of 10-fold cross-validation.

```
X = df.values
y = labels
classifiers = [ DecisionTreeClassifier(), LinearSVC(), LogisticRegression() ]
for clf in classifiers:
scores = cross_val_score(clf, X, y, cv=10)
print("Accuracy: %0.4f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
```

As you can see, I am using the *default values* for every classifier—no
grid search or other technique for finding better hyperparameters for now.
Instead, I want to see how these classifiers perform out of the box.

# Astonishing Results

The initial test resulted in the following accuracy values:
77.84% (standard deviation 0.16) for *decision trees*,
64.14% (standard deviation 0.45) for *support vector machines*, and
55.54% (standard deviation 0.49) for *logistic regression*. At least the
first result is highly astonishing—without any adjustments to the
classifier *whatsoever*, we obtain a performance that is en par with more
complex learning strategies! Recall that Yanardag and Vishwanathan adjusted the
hyperparameters during training, while this approach merely used the default
values.

Let us thus focus on the decision tree classifier first and figure out
*why* it performs the way it did. To this end, let us take a look at
what the selected features look like and how important they are. To do
this, I just added a single training instance, based on standard
test–train split of the decision tree classifier:

```
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20)
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
print(clf.feature_importances_)
```

This results in `[ 0.18874114 0.34113588 0.47012298 ]`

, meaning that
all three features are somewhat important, with *density* accounting for
almost 50% of the purity of a node—if you are unfamiliar with
decision trees, the idea is to obtain nodes that are as
“pure” as possible, meaning that there should be as few
differences in class labels as possible. The *density* attribute appears
to be important for splitting up impure nodes correctly.

To get a better feeling of the feature space, let us briefly visualize it using principal component analysis:

```
clf = PCA(n_components=2)
X_ = clf.fit_transform(X)
for label in set(labels):
idx = y[0:,] == label
plt.scatter(X_[idx, 0], X_[idx, 1], label=label, alpha=0.25)
plt.legend()
plt.show()
```

This results in the following image:

Every dot in the image represents an individual graph, whereas distances
in the plot roughly correspond to distances between features. The large
amount of overlaps between graphs with different labels thus seems to
suggest that the features are *insufficient* to separate the individual
graphs. How does the decision tree manage to separate them, nonetheless?
As it turns out, by creating a *very* deep and wide tree. To see this,
let us export the decision tree classifier:

```
with open("/tmp/clf.txt", "w") as f:
export_graphviz(clf, f)
```

Here is a small depiction of the resulting tree:

There is also a larger variant of this
tree for you to play around
with. As you can see, the tree looks relatively complicated, so it is
very much tailored to the given problem. That is not to say that the
tree is necessarily suffering from *overfitting*—we explicitly
used cross-validation to prevent this. What it *does* show is that the
simple feature space with three features, while yielding very good
classification results, is not detecting the true underlying structure
of the problem. The rules generated by the decision tree are artificial
in the sense that they do not help us understand what makes the two
groups of graphs different from each other.

# Coda

So, where does this leave us? We saw that we were able to perform as
well as state-of-the-art methods by merely picking a simple feature
space, consisting of three features, and a simple decision tree
classifier. I would say that *any* classifier with a higher complexity
should yield significant improvements over this simple baseline, both in
terms of average accuracy and in terms of standard deviation. Hence, the
KISS principle in machine learning: start simple and only
add more complex building blocks, i.e. models, if you have established
a useful baseline.

You can find both the code and the data in my repository on topological machine learning. To run the analysis described in this blog post, just execute

```
$ ./simple_network_analysis.py GML/????.gml Labels.txt
```

in the `REDDIT-BINARY`

subfolder of the repository.

That is all for now, until next time!