Many applications require the selection of a subset of objects from a larger set. For example, suppose you are down-sampling a large data set by choosing the indices that you want to keep at random. If you are satisfied with obtaining duplicate indices (in essence, if you are sampling with replacement), this is a trivial matter in most programming language. Python certainly makes this absurdly easy:

import random

def sample(k,n):
  return [random.choice( [x for x in range(n)] ) for _ in range(k)]

But what about sampling without replacement? More precisely, what if you require the selection of k indices from an array of n values without choosing duplicate values but still want the probability of choosing a certain element to be uniform? Well, if you use Python, you are in luck again:

import random

def sample_without_replacement(k,n):
  return random.sample([x for x in range(n)], k)

But what about other programming languages that may not have similar builtin functionality, such as C++? In this case, a very simple and elegant technique is given by none other than the great Donald E. Knuth. The following algorithm originally was given in The Art of Computer Programming, Volume 2, Section 3.4.2 on p. 137. Briefly put, it works like this:

  1. Set t = 0 and m = 0. m represents the number of items selected so far, while t is the total number of items that have already been traversed.

  2. Generate a random number U that is uniformly distributed between zero and one.

  3. If (N-t)*U >= n-m, skip the current item by increasing t by 1 and going back to step 2. Else, select the current item by increasing both t and m by 1. Afterwards, either go back to step 2 or stop the algorithm (if sufficiently many items have been sampled already).

A very basic implementation of this algorithm (in C++) is given in the gist below:

I do not know about you, but my gut reaction upon seeing this algorithm for the first time was How can this possibly work? So, let us briefly prove the algorithm to be correct. Basically, we have to show that every item has the same probability of being chosen. The key insight is to realize that the probability has to change depending on the number of elements that have already been selected. More precisely, we need to determine the probability of choosing item t+1 if m items have already been selected. This requires some combinatorics. There are Equation StartBinomialOrMatrix upper N hyphen t Choose n hyphen m EndBinomialOrMatrix ways of choosing n items from N such that m items are picked as the first t items. Or, equivalently, these are the number of potential permutations of the remaining n-m elements. Out of these, we are interested in all the ones that contain item t+1—but this is easy, because we can just take that item as granted and count the remaining combinations as Equation StartBinomialOrMatrix upper N hyphen t hyphen 1 Choose n hyphen m hyphen 1 EndBinomialOrMatrix , i.e. the number of ways to choose n-m-1 items out of N-t-1 ones. The quotient of these two numbers is exactly the probability with which we must choose item t+1 if we want to have a uniform probability for choosing a certain item. This quotient turns out to be Equation StartFraction n minus m Over upper N minus t EndFraction , which looks familiar. Finally, note that since U was chosen to be uniformly distributed between zero and one, the condition (N-t)*U >= n-m in the algorithm is satisfied with the required probability. Consequently, this method samples without replacement in the required manner!

If you are interested or require a more efficient algorithm, please refer to the paper An Efficient Algorithm for Sequential Random Sampling by Jeffrey Scott Vitter in ACM Transactions on Mathematical Software, Volume 13, Issue 1, pp. 58–67, for more details. The paper should be available free-of-charge from the link provided above, but there is also a mirror available, thanks to the good folks at INRIA, who require their publications to be kept in an open archive. Time permitting, I might provide a second blog post and implementation about the more efficient algorithm.

That is all for now, until next time!

Posted late Sunday evening, October 8th, 2017 Tags:

A warning upfront: this post is sort of an advertisement for Aleph, my C++ library for topological data analysis. In this blog post I do not want to cover any of the mathematical algorithms present in Aleph—I rather want to focus on small but integral part of the project, viz. unit tests.

If you are not familiar with the concept of unit testing, the idea is (roughly) to write a small, self-sufficient test for every new piece of functionality that you write. Proponents of the methodology of test-driven development (TDD) even go so far as to require you to write the unit tests before you write your actual code. In this mindset, you first think of the results your code should achieve and which outputs you expect prior to writing any “real” code. I am putting the word real in quotation marks here because it may seem strange to focus on the tests before doing the heavy lifting.

However, this way of approaching software development may actually be quite beneficial, in particular if you are working on algorithms with a nice mathematical flavour. Here, thinking about the results you want to achieve with your code ensures that at least a few known examples are processed correctly by your code, making it more probable that the code will perform well in real-world scenarios.

When I started writing Aleph in 2016, I also wanted to add some unit tests, but I did not think that the size of the library warranted the inclusion of one of the big players, such as Google Test or Boost.Test. While arguably extremely powerful and teeming with more features than I could possibly imagine, they are also quite heavy and require non-trivial adjustments to any project.

Thus, in the best tradition of the not-invented-here-syndrome, I decided to roll my own testing framework, base on pure CMake and small dash of C++. My design decisions were rather simple:

  • Use CTest, the testing framework of CMake to run the tests. This framework is rather simple and just uses the return type of a unit test program to decide whether the test worked correctly.
  • Provide a set of routines to check the correctness of certain calculations within a unit test, throwing an error if something unexpected happened.
  • Collect unit tests for the “larger” parts of the project in a single executable program.

Yes, you read that right—my approach actually opts for throwing an error in order to crash the unit test program. Bear with me, though, for I think that this is actually a rather sane way of approaching unit tests. After all, if the tests fails, I am usually not interested in whether other parts of a test program—that may potentially depend on previous calculations—run through or not. As a consequence, adding a unit test to Aleph is as simple as adding the following lines to a CMakeLists.txt file, located in the tests subdirectory of the project:

ADD_EXECUTABLE( test_io_gml test_io_gml.cc )
ADD_TEST(            io_gml test_io_gml    )

While in the main CMakeLists.txt, I added the following lines:

ENABLE_TESTING()
ADD_SUBDIRECTORY( tests )

So far, so good. A test now looks like this:

#include <tests/Base.hh>

void testBasic()
{
  // do some nice calculation; store the results in `foo` and `bar`,
  // respectively

  ALEPH_ASSERT_THROW( foo != bar );
  ALEPH_ASSERT_EQUAL( foo, 2.0 );
  ALEPH_ASSERT_EQUAL( bar, 1.0 );
}

void testAdvanced()
{
  // a more advanced test
}

int main(int, char**)
{
  testBasic();
  testAdvanced();
}

That is basically the whole recipe for a simple unit test. Upon execution, main() will ensure that all larger-scale test routines, i.e. testSimple() and testAdvanced() are called. Within each of these routines, the calls to the corresponding macros—more on that in a minute— ensure that conditions are met, or certain values are equal to other values. Else, an error will be thrown, the test will abort, and CMake will throw an error upon test execution.

So, how do the macros look like? Here is a copy of the current version of Aleph:

#define ALEPH_ASSERT_THROW( condition )                             \
{                                                                   \
  if( !( condition ) )                                              \
  {                                                                 \
    throw std::runtime_error(   std::string( __FILE__ )             \
                              + std::string( ":" )                  \
                              + std::to_string( __LINE__ )          \
                              + std::string( " in " )               \
                              + std::string( __PRETTY_FUNCTION__ )  \
    );                                                              \
  }                                                                 \
}

#define ALEPH_ASSERT_EQUAL( x, y )                                  \
{                                                                   \
  if( ( x ) != ( y ) )                                              \
  {                                                                 \
    throw std::runtime_error(   std::string( __FILE__ )             \
                              + std::string( ":" )                  \
                              + std::to_string( __LINE__ )          \
                              + std::string( " in " )               \
                              + std::string( __PRETTY_FUNCTION__ )  \
                              + std::string( ": " )                 \
                              + std::to_string( ( x ) )             \
                              + std::string( " != " )               \
                              + std::to_string( ( y ) )             \
    );                                                              \
  }                                                                 \
}

Pretty simple, I would say. The ALEPH_ASSERT_EQUAL macro actually tries to convert the corresponding values to strings, which may not always work. Of course, you could use more complicated string conversion routines, as Boost.Test does. For now, though, these macros are sufficient to make up the unit test framework of Aleph, which at the time of me writing this, encompasses more than 4000 lines of code.

The only remaining question is how this framework is used in practice. By setting ENABLE_TESTING(), CMake actually exposes a new target called test. Hence, in order to run those tests, a simple make test is sufficient in the build directory. This is what the result may look like:

$ make test
Running tests...
Test project /home/bastian/Projects/Aleph/build
      Start  1: barycentric_subdivision
 1/36 Test  #1: barycentric_subdivision ............   Passed    0.00 sec
      Start  2: beta_skeleton

      [...]

34/36 Test #34: union_find .........................   Passed    0.00 sec
      Start 35: witness_complex
35/36 Test #35: witness_complex ....................   Passed    1.82 sec
      Start 36: python_integration
36/36 Test #36: python_integration .................   Passed    0.07 sec

100% tests passed, 0 tests failed out of 36

Total Test time (real) =   5.74 sec

In addition to being rather lean, this framework can easily be integrated into an existing Travis CI workflow by adding

- make test

as an additional step to the script target in your .travis.yml file.

If you are interested in using this testing framework, please take a look at the following files:

That is all for now, until next time—may your unit tests always work the way you expect them to!

Posted Sunday night, October 15th, 2017 Tags: