I recently updated my vim configuration to include the fine YouCompleteMe plugin for C++ development. Since the software project I am mostly working on during my PhD project uses CMake, configuring YCM turned out to require some additional steps.

First, I added

SET( CMAKE_EXPORT_COMPILE_COMMANDS ON )

to the main CMakeLists.txt file of our project. This ensures that CMake creates the file compile_commands.json in the build directory of the project. Next, we need CMake to provision the file properly:

IF( EXISTS "${CMAKE_CURRENT_BINARY_DIR}/compile_commands.json" )
  EXECUTE_PROCESS( COMMAND ${CMAKE_COMMAND} -E copy_if_different
    ${CMAKE_CURRENT_BINARY_DIR}/compile_commands.json
    ${CMAKE_CURRENT_SOURCE_DIR}/compile_commands.json
  )
ENDIF()

This places the file in the source directory, permitting us to place a single configuration file for YCM there. I have to add that I am not very happen with this solution yet. For one, this removes the ability to have a debug build in some location and a release build somewhere else because the build that is configured last gets to copy its compile commands—which of course tend to differ with different configuration options. Another reason why this solution is not entirely smart is that the copy is usually performed prior to CMake updating the compile_commands.json file. If you change many compile options for different files, you will thus have to manually run the cmake binary twice.

Even with all these quirks, it’s still worth it—YCM is just so amazing to have. Being a previous user of the QtCreator IDE, I am blown away by the speed with which the plugin manages to perform its duties.

As a last step, we need to place a .ycm_extra_conf.py configuration file in the project folder. By default, YCM will use a pre-defined list of compilation flags to apply to all files. This is of course insufficient for most larger projects, so we need to teach it to use the compile_commands.json instead.

This is easily achieved with the following configuration file:

import os
import ycm_core

from clang_helpers import PrepareClangFlags

def DirectoryOfThisScript():
    return os.path.dirname(os.path.abspath(__file__))

# This is the single most important line in this script. Everything else is just nice to have but
# not strictly necessary.
compilation_database_folder = DirectoryOfThisScript()

# This provides a safe fall-back if no compilation commands are available. You could also add a
# includes relative to your project directory, for example.
flags = [
    '-Wall',
    '-std=c++11',
    '-stdlib=libc++',
    '-x',
    'c++',
    '-I',
    '.',
    '-isystem', '/usr/local/include',
    '-isystem', '/usr/include',
    '-I.',
]

if compilation_database_folder:
    database = ycm_core.CompilationDatabase(compilation_database_folder)
else:
    database = None

SOURCE_EXTENSIONS = [ '.cpp', '.cxx', '.cc', '.c', '.m', '.mm' ]

def MakeRelativePathsInFlagsAbsolute( flags, working_directory ):
  if not working_directory:
    return list( flags )
  new_flags = []
  make_next_absolute = False
  path_flags = [ '-isystem', '-I', '-iquote', '--sysroot=' ]
  for flag in flags:
    new_flag = flag

    if make_next_absolute:
      make_next_absolute = False
      if not flag.startswith( '/' ):
        new_flag = os.path.join( working_directory, flag )

    for path_flag in path_flags:
      if flag == path_flag:
        make_next_absolute = True
        break

      if flag.startswith( path_flag ):
        path = flag[ len( path_flag ): ]
        new_flag = path_flag + os.path.join( working_directory, path )
        break

    if new_flag:
      new_flags.append( new_flag )
  return new_flags


def IsHeaderFile( filename ):
  extension = os.path.splitext( filename )[ 1 ]
  return extension in [ '.h', '.hxx', '.hpp', '.hh' ]


def GetCompilationInfoForFile( filename ):
  # The compilation_commands.json file generated by CMake does not have entries
  # for header files. So we do our best by asking the db for flags for a
  # corresponding source file, if any. If one exists, the flags for that file
  # should be good enough.
  if IsHeaderFile( filename ):
    basename = os.path.splitext( filename )[ 0 ]
    for extension in SOURCE_EXTENSIONS:
      replacement_file = basename + extension
      if os.path.exists( replacement_file ):
        compilation_info = database.GetCompilationInfoForFile(
          replacement_file )
        if compilation_info.compiler_flags_:
          return compilation_info
    return None
  return database.GetCompilationInfoForFile( filename )


def FlagsForFile( filename, **kwargs ):
  if database:
    # Bear in mind that compilation_info.compiler_flags_ does NOT return a
    # python list, but a "list-like" StringVec object
    compilation_info = GetCompilationInfoForFile( filename )
    if not compilation_info:
      return None

    final_flags = MakeRelativePathsInFlagsAbsolute(
      compilation_info.compiler_flags_,
      compilation_info.compiler_working_dir_ )

  else:
    relative_to = DirectoryOfThisScript()
    final_flags = MakeRelativePathsInFlagsAbsolute( flags, relative_to )

  return {
    'flags': final_flags,
    'do_cache': True
  }

The single most important line of this script is where you specify the compilation commands data base, compilation_database_folder = DirectoryOfThisScript(). Apart from the that, the script is a modified version of the default YCM configuration file, which the author strongly suggests to modify.

A last word of advice: Depending on your YCM configuration, you may want to add let g:ycm_extra_conf_globlist = [ '/path/to/your/project/*' ] to your .vimrc. This ensures that YCM will load the configuration file automatically and not let you confirm anything due to security reasons.

Happy code completing!

Posted late Tuesday evening, December 15th, 2015 Tags:

Recently, I wanted to parse some C++ code to collect some information about my coding style. As always with C++, this turned out to require more effort than I had originally anticipated. Apparently, C++ is so complex that you require more or less a compiler to parse it. The great folks at the LLVM project had already done the work—a stroke of luck for me.

So I set out to use clang for parsing C++ code. The official documentation lists multiple options here. While LibTooling, a C++ interface to clang, would arguably have been the more appropriate choice, I settled for libclang, a C interface. Its main advantage is its stability. The LLVM developers are very reluctant to introduce breakages here (although they have been known to happen). The disadvantage of libclang is that we do not get full control over the abstract syntax tree (AST) of the program. I can live with that, I guess.

In this post, I only want to briefly introduce some basic concepts of libclang. I hope to show off more features in subsequent posts.

Dumping an AST

Suppose you have the following simple C++ snippet:

template <class T> bool f( T x )
{
  return x % 2;
}

Just like any compiler, clang creates an abstract syntax tree that represents the code we want to compile. The syntax of this AST resembles the written code very closely, which makes this a very interesting view on your code.

To show the AST of the previous snippet, we can clang with:

clang -Xclang -ast-dump -fsyntax-only foo.cc

This results in the following output:

TranslationUnitDecl 0x2fef1b0 <<invalid sloc>> <invalid sloc>
|-TypedefDecl 0x2fef6f0 <<invalid sloc>> <invalid sloc> implicit __int128_t '__int128'
|-TypedefDecl 0x2fef750 <<invalid sloc>> <invalid sloc> implicit __uint128_t 'unsigned __int128'
|-TypedefDecl 0x2fefb10 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list '__va_list_tag [1]'
`-FunctionTemplateDecl 0x2fefdc0 <foo.cc:1:1, line:4:1> line:1:25 f
  |-TemplateTypeParmDecl 0x2fefb60 <col:11, col:17> col:17 referenced class T
  `-FunctionDecl 0x2fefd20 <col:20, line:4:1> line:1:25 f '_Bool (T)'
    |-ParmVarDecl 0x2fefc20 <col:28, col:30> col:30 referenced x 'T'
    `-CompoundStmt 0x2fefea0 <line:2:1, line:4:1>
      `-ReturnStmt 0x2fefe80 <line:3:3, col:14>
        `-BinaryOperator 0x2fefe58 <col:10, col:14> '<dependent type>' '%'
          |-DeclRefExpr 0x2fefe10 <col:10> 'T' lvalue ParmVar 0x2fefc20 'x' 'T'
          `-IntegerLiteral 0x2fefe38 <col:14> 'int' 2

The important thing to note is that we have a FunctionTemplateDecl node that contains, among others, the body of the function (starting from the CompoundStmt node).

To familiarize ourselves with libclang, let’s try to reproduce parts of this AST ourselves. I propose writing a program that walks the AST and prints out node names.

First, we need to get clang to emit the AST in a binary format so that we traverse it later:

clang++ -emit-ast foo.cc

This should result in a file foo.ast.

Reading an AST with libclang

Understanding libclang requires reading copious amounts of documentation. I cannot claim in good conscience that I have fully understood and memorized every detail, so the code that follows may actually be very stupid.

One of the core functions of libclang is the creation of an index. This data structure contains multiple translation units, e.g. source files, that make up an executable. Given an index, a new translation unit can be created. There are various ways of doing this. It it possible to parse source code directly. For now, I will only present a way of parsing a dumped AST. Last, we need a cursor, which is the libclang term for a thing that keeps track of where we are in the AST.

The following snippet will create the required things for us and permit further parsing (for the hasty: scroll to the bottom for the complete example code):

#include <clang-c/Index.h>

int main( int argc, char** argv )
{
  if( argc < 2 )
    return -1;

  CXIndex index        = clang_createIndex( 0, 1 );
  CXTranslationUnit tu = clang_createTranslationUnit( index, argv[1] );

  if( !tu )
    return -1;
  
  CXCursor rootCursor  = clang_getTranslationUnitCursor( tu );

  clang_disposeTranslationUnit( tu );
  clang_disposeIndex( index );
  return 0;
}

This program will attempt to load a translation unit from an .ast file, get a cursor to the root node, and exit. The only noteworthy thing are the two _dispose calls at the end of the function. This is due to libclang being a C library—we actually need to think about memory management here. Darn.

Visiting nodes

To make the example application marginally more useful, we need to make use of the visitor pattern. libclang requires us to specify a sort of callback function that is called whenever we reach a node in the AST.

The function needs to have the following signature:

CXChildVisitResult visitor( CXCursor cursor, CXCursor parent, CXClientData clientData )

The result of the function decides how the traversal is being continued. Possibilities include CXChildVisit_Break to stop traversal, CXChildVisit_Continue to continue traversing the siblings of the current cursor without visiting its children, or CXChildVisit_Recurse to visit the children of the current cursor first. cursor refers to the current position in the tree. parent is the parent node in the tree. Last, CXClientData is a typedef for 'void* (a, don’t you love C as much as I do?) and permits passing additional information for each traversal. We will soon figure out a way of using this to our advantage.

The main entry point for visiting nodes is the clang_visitChildren function. Assuming that we have set up a function visitor with the proper prototype, we may add the following lines after rootCursor has been assigned:

unsigned int treeLevel = 0;

clang_visitChildren( rootCursor,
                     visitor,
                     &treeLevel );

This will call our visitor function with client data that initially refers to the upper-most level in the tree. A potential implementation for visitor could look like this:

CXChildVisitResult visitor( CXCursor cursor, CXCursor /* parent */, CXClientData clientData )
{
  CXSourceLocation location = clang_getCursorLocation( cursor );
  if( clang_Location_isFromMainFile( location ) == 0 )
    return CXChildVisit_Continue;

  CXCursorKind cursorKind = clang_getCursorKind( cursor );

  unsigned int curLevel  = *( reinterpret_cast<unsigned int*>( clientData ) );
  unsigned int nextLevel = curLevel + 1;

  std::cout << std::string( curLevel, '-' ) << " " << getCursorKindName(
  cursorKind ) << " (" << getCursorSpelling( cursor ) << ")\n";

  clang_visitChildren( cursor,
                       visitor,
                       &nextLevel ); 

  return CXChildVisit_Continue;
}

The visitor uses some interesting functions. First, we use clang_getCursorLocation() to get the location of the cursor with respect to the source file. This permits mapping a cursor, i.e. a region in a tree, to a region in the source file. We use it for a far simpler purpose, though. By calling clang_Location_isFromMainFile(), we can check whether a node is part of the source file we are analysing or not. This is extremely helpful for skipping over nodes that are not interesting for us because they appear in the AST after an #include, for example. As you can see, we skip over the subtree rooted at the current cursor when such a node has been reached.

The next interesting thing is the clang_getCursorKind. This yields the type of the cursor, giving us information about whether it refers to a statement, a function declaration, or what have you. To see all types (or rather kinds) that are available, please take a look at the Index.h file.

Having obtained information about the type of the cursor, we get information about our current level, using a reinterpret_cast. We then call two auxiliary functions that give us the name and spelling of the cursor (see below for whole code). By spelling, libclang means the name of the entity the cursor refers to. When positioned at the root node of a function declaration, the cursor spelling will be the name of the function, for example. Not all entities have a spelling—for some, we need to tokenize the node first. For reasons of simplicity, I am going to cover this in a later post, though.

The last interesting thing is the continued traversal within the visitor function. We call clang_visitChildren on the current cursor, with the updated level information. Thus, all children of the current cursor will eventually be visited. Last, we return CXChildVisit_Continue. This ensures that we do not visit the children of a particular node twice.

The output of this simple program when being applied to foo.ast is:

 FunctionTemplate (f)
- TemplateTypeParameter (T)
- ParmDecl (x)
-- TypeRef (T)
- CompoundStmt ()
-- ReturnStmt ()
--- BinaryOperator ()
---- DeclRefExpr (x)
---- IntegerLiteral ()

Which is nice for just about 50 lines of code in total. As promised, not all entities have names here, but the ones that do seem to make sense.

The complete code

Here’s the complete code of this example:

#include <clang-c/Index.h>

#include <iostream>
#include <string>

std::string getCursorKindName( CXCursorKind cursorKind )
{
  CXString kindName  = clang_getCursorKindSpelling( cursorKind );
  std::string result = clang_getCString( kindName );

  clang_disposeString( kindName );
  return result;
}

std::string getCursorSpelling( CXCursor cursor )
{
  CXString cursorSpelling = clang_getCursorSpelling( cursor );
  std::string result      = clang_getCString( cursorSpelling );

  clang_disposeString( cursorSpelling );
  return result;
}

CXChildVisitResult visitor( CXCursor cursor, CXCursor /* parent */, CXClientData clientData )
{
  CXSourceLocation location = clang_getCursorLocation( cursor );
  if( clang_Location_isFromMainFile( location ) == 0 )
    return CXChildVisit_Continue;

  CXCursorKind cursorKind = clang_getCursorKind( cursor );

  unsigned int curLevel  = *( reinterpret_cast<unsigned int*>( clientData ) );
  unsigned int nextLevel = curLevel + 1;

  std::cout << std::string( curLevel, '-' ) << " " << getCursorKindName(
  cursorKind ) << " (" << getCursorSpelling( cursor ) << ")\n";

  clang_visitChildren( cursor,
                       visitor,
                       &nextLevel ); 

  return CXChildVisit_Continue;
}

int main( int argc, char** argv )
{
  if( argc < 2 )
    return -1;

  CXIndex index        = clang_createIndex( 0, 1 );
  CXTranslationUnit tu = clang_createTranslationUnit( index, argv[1] );

  if( !tu )
    return -1;

  CXCursor rootCursor  = clang_getTranslationUnitCursor( tu );

  unsigned int treeLevel = 0;

  clang_visitChildren( rootCursor, visitor, &treeLevel );

  clang_disposeTranslationUnit( tu );
  clang_disposeIndex( index );

  return 0;
}

I am releasing the code into the public domain. Don’t forget to link against libclang when compiling it (subsequent posts will provide a find module for CMake). Should you consider this code useful, it would give me enormous pleasure if you were to drop me an e-mail.

To show that this would be worth your while, I am pointing out an interesting part in the code above: In the getCursorKindName() function, we use clang_getCursorKindSpelling() to obtain the name of the type the current cursor refers to. This is a CXString, which needs to be disposed later on with clang_disposeString() so as not to cause any memory leaks. To use the string as a regular C++ string, we need to call the helper function clang_getCString, which returns a null-terminated C string (i.e. a const char*, which we may finally assign to an std::string).

Wishing you a happy tree traversal! Maybe clang hides some gifts in the AST…

Posted Wednesday evening, December 23rd, 2015 Tags:

Since getting a Kindle last Christmas, I have started reading even more. This is my personal list of noteworthy books I read in 2015. In no particular order, we have:

  1. Seveneves, by Neal Stephenson: A truly epic tale of the near-extinction and re-birth of humanity. In some ways, this book even rivals the love I feel towards Stephenson’s Cryptonomicon. Read this if you are even slightly interested in science or space.

  2. The Psychology of Invention in the Mathematical Field, by Jacques Hadamard: I was overjoyed to find this in the bargain bin of a bookshop. Hadamard studies nothing less than the way mathematicians invent (or discover) mathematics. The result is an engaging work with (unfortunately) vague answers that are not generally applicable. Still, it rekindled my interest in these topics and I look forward to reading more formalized works from this area.

  3. Effective Modern C++, by Scott Meyers: Meyers continues to amaze me with his in-depth knowledge of this most peculiar programming language (…as majestic as troops with banners…). If you want to do serious work with C++11 or C++14, read this book. No excuses.

  4. Embassytown, by China MiĆ©ville: A captivating story about humanity’s contact with a truly alien race and their unique language. If you ever wondered about what a species would live like if the (discredited) Sapir-Whorf hypothesis was fully true, you will enjoy this book as much as I did.

  5. Parable of the Sower, by Octavia E. Butler: Set in a failing world that is not much unlike our own, Butler describes the tale of a young woman and her struggle for survival. In contrast to many other books describing an apocalyptic event, this one really hit a spot because the reader can feel the world crumble while reading—and the fall is not preceded by loud bangs but rather by quiet whispers.

  6. The Opposite of Loneliness: Essays and Stories, by Marina Keegan: Published posthumously after the untimely death of this young writer, this is a collection of her work. At times, it was painful to read with the knowledge that the spark of a great writer that is easily glimpsed in the book would never mature. The stories speak of a hunger for life, of possibilities, of uncertainty, and of fear. If you read only one essay, read Even Artichokes have Doubts.

  7. Goddess of the Market, by Jennifer Burns: After reading this biography of Ayn Rand, I may finally re-read her books with more sympathy and more appreciation. Read this if you think that Atlas Shrugged is “not a book to be tossed away lightly—it should be hurled with great forced”. At the very least, you may understand the thought patterns of political conservatives and libertarians better.

  8. After the Fall, Before the Fall, During the Fall, by Nancy Kress: In my ongoing quest to read more books written by women, I had already encountered the writings of Nancy Kress. Although this was also a book about the apocalypse, I enjoyed the way Kress intertwined multiple perspectives about The Fall. Although the resolution of the novel is slightly campy, this is still a gripping tale.

  9. Muse of Fire, by Dan Simmons: Shakespeare in space—what’s not to love? After Simmons wrote himself into my heart with his extremely intelligent Hyperion Cantos and the epic Ilium/Olympos books, Muse of Fire was another formidable book in the genre of “literary science fiction”.

  10. Visualization Analysis and Design, by Tamara Munzner: A must-read in my profession. Munzner gives us a flexible framework that helps us think in a more structured manner about doing visualization.

According to goodreads, I read 63 books this year, with a total of 19131 pages. The shortest book I read had 60 pages, while the longest had 880. The average number of pages was 314. I initially wanted to read 25 books this years but got somewhat carried away.

Good reading for 2016!

Posted Thursday afternoon, December 31st, 2015 Tags: