Around the holidays, I got interested in analysing the amount of speech occurring in different films. I now have some preliminary results for different feature films of 2014. See the project page for some (basic) visualizations using D3.js.

This should have become a stand-alone blog post, but Ikiwiki does not take kindly to including JavaScript files in inlined pages.

Posted at lunch time on Sunday, February 1st, 2015 Tags:

The debate about whether to use postincrement (i++) or preincrement (++i) when writing a loop in C++ has been going on for a very long time. It is probably one of the most favourite programming interview questions ever.

Let's take a closer look at what modern compilers have to say about it. First, this is the test code for postincrement, which I shall refer to as postincrement.cc:

#include <iostream>

int main(int, char**)
{
  for( unsigned i = 0; i < 42; i++ )
    std::cout << i << std::endl;

  return 0;
}

Likewise, preincrement.cc refers to the following code:

#include <iostream>

int main(int, char**)
{
  for( unsigned i = 0; i < 42; ++i )
    std::cout << i << std::endl;

  return 0;
}

In the following, let us take a look at the assembly code generate by both gcc 4.9.2 and clang 3.5.1. Fortunate for me, I don't have to actually understand the assembly code. It is perfectly sufficient to look at the differences between the two codes. If there are none, the two programs behave exactly the same.

Let's try it with gcc first (using -O0 to disable optimizations in order no to bias the results):

$ g++ -O0 -S postincrement.cc
$ g++ -O0 -S preincrement.cc

The result of diff postincrement.s preincrement.s is:

1c1
<   .file   "postincrement.cc"
---
>   .file   "preincrement.cc"

Thus, gcc generates the same code twice. This works of course without providing -O0, as well. So, how does clang fare?

$ clang -O0 -S postincrement.cc
$ clang -O0 -S preincrement.cc

The result of diff postincrement.s preincrement.s is:

2c2
<   .file   "postincrement.cc"
---
>   .file   "preincrement.cc"
79,80c79,80
<   .type   _GLOBAL__sub_I_postincrement.cc,@function
< _GLOBAL__sub_I_postincrement.cc:        # @_GLOBAL__sub_I_postincrement.cc
---
>   .type   _GLOBAL__sub_I_preincrement.cc,@function
> _GLOBAL__sub_I_preincrement.cc:         # @_GLOBAL__sub_I_preincrement.cc
95c95
<   .size   _GLOBAL__sub_I_postincrement.cc, .Ltmp11-_GLOBAL__sub_I_postincrement.cc
---
>   .size   _GLOBAL__sub_I_preincrement.cc, .Ltmp11-_GLOBAL__sub_I_preincrement.cc
103c103
<   .quad   _GLOBAL__sub_I_postincrement.cc
---
>   .quad   _GLOBAL__sub_I_preincrement.cc

Remember when I said to you that we did not have to understand assembly code for this experiment? Well, I lied. A little. However, although my assembly skills are extremely rusty, you may believe when I say that all differences depicted above merely pertain to renamed segments. There's nothing in there about spurious variable copies.

For integral types, there is thus apparently no difference between the two variants.

Let's spice it up by using iterators! Here's postincrement-iterator.cc:

#include <iostream>
#include <numeric>
#include <vector>

int main(int, char**)
{
  std::vector<unsigned> v( 42 );
  std::iota( v.begin(), v.end(), 0 );

  for( auto it = v.begin(); it != v.end(); it++ )
    std::cout << *it << std::endl;

  return 0;
}

And here's preincrement-iterator.cc:

#include <iostream>
#include <numeric>
#include <vector>

int main(int, char**)
{
  std::vector<unsigned> v( 42 );
  std::iota( v.begin(), v.end(), 0 );

  for( auto it = v.begin(); it != v.end(); ++it )
    std::cout << *it << std::endl;

  return 0;
}

Here, we experience a completely different behaviour. Both gcc and clang create an extra copy during the iteration–however, this extra copy does not seem to be used anywhere. clang, for example, creates the following assembly code for the postincrement operator:

.LBB1_7:                                #   in Loop: Header=BB1_3 Depth=1
        leaq    -88(%rbp), %rdi
        movl    $0, %esi
        callq   _ZN9__gnu_cxx17__normal_iteratorIPjSt6vectorIjSaIjEEEppEi
        movq    %rax, -104(%rbp)
        jmp .LBB1_3

The mangled symbol _ZN9__gnu_cxx17__normal_iteratorIPjSt6vectorIjSaIjEEEppEi refers to the postincrement operator. There is an additional instruction, movl $0, %esi, which is not required for the preincrement operator. So, the postincrement version requires more CPU cycles than the preincrement version.

Let's briefly take a look at how bad the problem is:

#include <chrono>
#include <iostream>
#include <numeric>
#include <vector>

int main(int, char**)
{
  std::vector<unsigned> v1( 1000000 );
  std::vector<unsigned> v2( 1000000 );

  std::iota( v1.begin(), v1.end(), 1 );
  std::iota( v2.begin(), v2.end(), 2 );

  std::chrono::time_point<std::chrono::high_resolution_clock> t1, t2, t3;

  t1 = std::chrono::high_resolution_clock::now();

  for( auto it = v1.begin(); it != v1.end(); it++ )
    *it *= 2;

  t2 = std::chrono::high_resolution_clock::now();

  for( auto it = v2.begin(); it != v2.end(); ++it )
    *it *= 2;

  t3 = std::chrono::high_resolution_clock::now();

  std::chrono::duration<double> d1 = t2 - t1;
  std::chrono::duration<double> d2 = t3 - t2;

  std::cout << "Postincrement: " << std::chrono::duration_cast<std::chrono::microseconds>( d1 ).count() << "us" << std::endl
            << "Preincrement:  " << std::chrono::duration_cast<std::chrono::microseconds>( d2 ).count() << "us" << std::endl;

  return 0;
}

This yields:

$ g++ -std=c++11 increment-timing.cc; ./a.out
Postincrement: 31531us
Preincrement:  20867us

For "real" measurements, this should of course be run numerous times. These are the results:

Histogram of postincrement and preincrement timings

Thus, preincrement is definitely faster than postincrement for iterators (or, to be even more precise, certain kinds of iterators).

Posted at teatime on Sunday, February 15th, 2015 Tags:

It is very common—not only in object-oriented programming—to have functions that generate a potentially complicated object (based on the input parameters) or modify a given one without changing the original object. For example, suppose we have a class Foo that stores integers and want to construct a new Foo object for storing a given number. There are two ways for writing such a method:

// 1st way
Foo f(unsigned int n)
{
  Foo f(n+1);
  return f;
}

// Usage: Foo foo = f(42);

// 2nd way
void g(Foo& foo, unsigned int n)
{
  foo.set(n+1);
}

// Usage:
//   Foo foo;
//   g(foo, 42);

Now, to be honest, I always cringe when I see the second way implemented. It just looks unnatural—why should I modify a given object when I essentially want a new one to be stored? The way g() is implemented seems to rather clumsy. Ostensibly, this implementation is faster because it does not involve creating a new Foo instance, setting its values, and copying it into a new object. Is this really true, though? Let's find out:

#include <iostream>

class Foo
{
public:
  Foo()
    : data(0)
  {
    std::cout << "\t" << __PRETTY_FUNCTION__ << std::endl;
  }

  Foo(unsigned int n)
    : data(n)
  {
    std::cout << "\t" << __PRETTY_FUNCTION__ << ": " << n << std::endl;
  }

  Foo( const Foo& other )
  {
    std::cout << "\t" << __PRETTY_FUNCTION__ << std::endl;
  }

  void set(unsigned int n)
  {
    std::cout << "\t" << __PRETTY_FUNCTION__ << ": " << n << std::endl;
    data = n;
  }

private:
  unsigned int data;
};

Foo f(unsigned int n)
{
  Foo f(n+1);
  return f;
}

void g(Foo& foo, unsigned int n)
{
  foo.set(n+1);
}

int main(int, char**)
{
  {
    std::cout << "f:" << std::endl;
    Foo foo = f(41);
  }

  {
    std::cout << "g: " << std::endl;
    Foo foo;
    g(foo,41);
  }

  return 0;
}

Using gcc 4.9.2, this produces the following output:

f:
        Foo::Foo(unsigned int): 42
g: 
        Foo::Foo()
        void Foo::set(unsigned int): 42

Darn. The copy constructor is not called. What manner of sorcery is this? This procedure is known as return value optimization. The compiler may, in certain situations, omit a copy operation of classes. This is one such situation. The object created within f() is a temporary and will not be required afterwards. A stupid compiler would create the object (one constructor call), return it, and call the copy operator of the second object (another constructor call). However, since the object within f() is not used afterwards, a smarter compiler can skip the whole shebang and simply create one object with the appropriate constructor call.

For the curious, we can disable this behaviour by setting the gcc flag -fno-elide-constructors:

$ g++ -fno-elide-constructors rvo.cc
$ ./a.out
f:
        Foo::Foo(unsigned int): 42
        Foo::Foo(const Foo&)
        Foo::Foo(const Foo&)
g: 
        Foo::Foo()
        void Foo::set(unsigned int): 42

Yikes! This is terrible, but at least the expected behaviour. Does this always work? As it turns out, no. It very much depends on the compiler you are using. For example, a standard example in which the return value optimization might not work is when multiple execution paths are present, like in this example:

Foo h(unsigned int n, bool more = false)
{
  if( more )
    return Foo(n+1);
  else
    return Foo(n);
}

However, my tests with gcc 4.9.2 indicate that at least this very simple example does not yet fool the compiler. Thus, I can only conclude that we should give the compiler more credit instead of assuming that a certain idiom "naturally" results in better performance (as the proponents of the second way are wont to do). In the words of Abelson and Sussman:

[...] Thus, programs must be written for people to read, and only incidentally for machines to execute.

In C++11, there is an additional optimization available in the form of move constructors (and move assignment operators, which I will not cover):

Foo( Foo&& other )
  : data(other.data)
{
  std::cout << "\t" << __PRETTY_FUNCTION__ << ": " << data << std::endl;
}

The move constructor is called for rvalue references. When disabling return value optimization, we can see that this constructor is called instead of the usual copy constructor:

$ g++ -fno-elide-constructors -std=c++11 rvo.cc
$ ./a.out
f:
Foo::Foo(unsigned int): 42
Foo::Foo(Foo&&): 42
Foo::Foo(Foo&&): 42

Thus, the moral of the story is:

  1. Do not worry about the return values.
  2. If your implementation permits it, use move constructors to achieve a better performance when the compiler is unable to use return value optimization.
  3. Measure whether a purported optimization really has an effect.
Posted Sunday evening, February 22nd, 2015 Tags: