In the previous instalment of this little series, I already explained how to walk an abstract syntax tree. Since this requires a specific call to clang beforehand, I want to extend the example to be able to parse code directly.

We will not encounter any new concepts for code parsing here but rather some additional methods of libclang. The main entry point for parsing code directly is the clang_parseTranslationUnit() method. It requires a working compilation index (which we already encountered last time) as well as an optional number of additional compiler arguments. These arguments turn out to be extremely critical when trying to do sensible things with C++ code. Without, say, the proper include directories, clang will be incapable of deciding whether a series of tokens in a source code constitutes a type, for example.

Where to get compile arguments

The easiest way to obtain compilation parameters is to use a compilation database. Typically, this is a file called compile_commands.json that resides in the build directory of a software project. For each source file, it contains the complete call to the compiler, including all flags and other parameters. We can easily obtain such a compilation database if we specify

SET( CMAKE_EXPORT_COMPILE_COMMANDS ON )

in the main CMakeLists.txt file of our project (see my article about the YouCompleteMe engine and cmake in this very blog). Armed with this file, libclang offers numerous methods to help deal with the database. The following snippet (again, see the bottom of this post for the complete code) will attempt to load a database from a file and count the number of parameters:

#include <clang-c/CXCompilationDatabase.h>
#include <clang-c/Index.h>

// Somewhat later, in the main function:

CXCompilationDatabase_Error compilationDatabaseError;
CXCompilationDatabase compilationDatabase = clang_CompilationDatabase_fromDirectory( ".", &compilationDatabaseError );
CXCompileCommands compileCommands         = clang_CompilationDatabase_getCompileCommands( compilationDatabase, resolvedPath.c_str() );
unsigned int numCompileCommands           = clang_CompileCommands_getSize( compileCommands );

Let’s ignore the resolvedPath variable for the time being&emdash;it will be explained in the complete code. Our next task is to get all these parameters into the clang_parseTranslationUnit() function. Unfortunately, the interface for this method is somewhat clunky (at least I am not aware of a better solution). We have to convert each command individually and pass it in the form of two-dimensional char array:

CXCompileCommand compileCommand = clang_CompileCommands_getCommand( compileCommands, 0 );
unsigned int numArguments       = clang_CompileCommand_getNumArgs( compileCommand );
char** arguments                = new char*[ numArguments ];

for( unsigned int i = 0; i < numArguments; i++ )
{
  CXString argument       = clang_CompileCommand_getArg( compileCommand, i );
  std::string strArgument = clang_getCString( argument );
  arguments[i]            = new char[ strArgument.size() + 1 ];

  std::fill( arguments[i],
             arguments[i] + strArgument.size() + 1,
             0 );

  std::copy( strArgument.begin(), strArgument.end(),
             arguments[i] );

  clang_disposeString( argument );
}

translationUnit = clang_parseTranslationUnit( index, 0, arguments, numArguments, 0, 0, CXTranslationUnit_None );

for( unsigned int i = 0; i < numArguments; i++ )
  delete[] arguments[i];

delete[] arguments;

The salient point is the call to clang_parseTranslationUnit() in which all arguments obtained from the compilation database are used.

Counting function extents

Having a valid translation unit at hand, we can proceed as in the previous article by getting a cursor into the translation unit and visiting the syntax tree.

CXCursor rootCursor = clang_getTranslationUnitCursor( translationUnit );
clang_visitChildren( rootCursor, functionVisitor, nullptr );

With the functionVisitor being a simple visitor that only reacts to function definitions, class methods, and function template specifications:

CXChildVisitResult functionVisitor( CXCursor cursor, CXCursor /* parent */, CXClientData /* clientData */ )
{
  if( clang_Location_isFromMainFile( clang_getCursorLocation( cursor ) ) == 0 )
    return CXChildVisit_Continue;

  CXCursorKind kind = clang_getCursorKind( cursor );
  auto name         = getCursorSpelling( cursor );

  if( kind == CXCursorKind::CXCursor_FunctionDecl || kind == CXCursorKind::CXCursor_CXXMethod || kind == CXCursorKind::CXCursor_FunctionTemplate )
  {
    CXSourceRange extent           = clang_getCursorExtent( cursor );
    CXSourceLocation startLocation = clang_getRangeStart( extent );
    CXSourceLocation endLocation   = clang_getRangeEnd( extent );

    unsigned int startLine = 0, startColumn = 0;
    unsigned int endLine   = 0, endColumn   = 0;

    clang_getSpellingLocation( startLocation, nullptr, &startLine, &startColumn, nullptr );
    clang_getSpellingLocation( endLocation,   nullptr, &endLine, &endColumn, nullptr );

    std::cout << "  " << name << ": " << endLine - startLine << "\n";
  }

  return CXChildVisit_Recurse;
}

This time, we always recursively visit all children of the current node because we might encounter functions nested in namespace and suchlike. Apart from this, the visitor offers few surprises. We again use getCursorSpelling to obtain the name of the function.

If we encounter a function (which we can decide by checking the type of the cursor using the clang_getCursorKind() function), we get its extents within the source file. To this end, we call clang_getCursorExtent(), which results in a CXSourceRange. This is a type that specifies, well, a range of lines in the source code. The start and end location, respectively, are obtained using clang_getRangeStart() and clang_getRangeEnd(). Finally, we use clang_getSpellingLocation() to map the internal locations to external ones, in the form of a line and a column. We then print the name of the function and the amount of source code lines it takes. This includes comment and everything so it is not a good measure of the code complexity—as an introductory example into the power of libclang it should suffice, though.

By the by: This example also demonstrates the care the libclang developers have taken when specifying their API. Being capable of mapping entities encountered during the parse process back to actual lines of code offers a great amount of flexibility for tool developers. This is really nice!

What about default arguments?

As a fall-back, if no compile commands are available, we can also specify our own includes. This is surprisingly painless, thanks to std::extents:

constexpr const char* defaultArguments[] = {
  "-std=c++11",
  "-I/usr/include",
  "-I/usr/local/include"
};

translationUnit = clang_parseTranslationUnit( index,
                                              resolvedPath.c_str(),
                                              defaultArguments,
                                              std::extent<decltype(defaultArguments)>::value,
                                              0,
                                              0,
                                              CXTranslationUnit_None );

What about the mysterious resolved path?

At this point, the resolvedPath variable occurred multiple times and surely the suspense kept you on the edge of your seat. Let me resolve the mystery for you:

#ifdef __unix__
  #include <limits.h>
  #include <stdlib.h>
#endif

std::string resolvePath( const char* path )
{
  std::string resolvedPath;

#ifdef __unix__
  char* resolvedPathRaw = new char[ PATH_MAX ];
  char* result          = realpath( path, resolvedPathRaw );

  if( result )
    resolvedPath = resolvedPathRaw;

  delete[] resolvedPathRaw;
#else
  resolvedPath = path;
#endif

  return resolvedPath;
}

We only need this function to permit the user to specify relative paths on the command-line. For the compilation database and the translation unit parsing, however, we require absolute paths. The function above is nothing but a fancy wrapper for the realpath() function that returns the canonicalized absolute path name.

The complete code

This is what you have been waiting for:

#include <clang-c/CXCompilationDatabase.h>
#include <clang-c/Index.h>

#ifdef __unix__
  #include <limits.h>
  #include <stdlib.h>
#endif

#include <iostream>
#include <string>
#include <type_traits>

std::string getCursorSpelling( CXCursor cursor )
{
  CXString cursorSpelling = clang_getCursorSpelling( cursor );
  std::string result      = clang_getCString( cursorSpelling );

  clang_disposeString( cursorSpelling );
  return result;
}

/* Auxiliary function for resolving a (relative) path into an absolute path */
std::string resolvePath( const char* path )
{
  std::string resolvedPath;

#ifdef __unix__
  char* resolvedPathRaw = new char[ PATH_MAX ];
  char* result          = realpath( path, resolvedPathRaw );

  if( result )
    resolvedPath = resolvedPathRaw;

  delete[] resolvedPathRaw;
#else
  resolvedPath = path;
#endif

  return resolvedPath;
}

CXChildVisitResult functionVisitor( CXCursor cursor, CXCursor /* parent */, CXClientData /* clientData */ )
{
  if( clang_Location_isFromMainFile( clang_getCursorLocation( cursor ) ) == 0 )
    return CXChildVisit_Continue;

  CXCursorKind kind = clang_getCursorKind( cursor );
  auto name         = getCursorSpelling( cursor );

  if( kind == CXCursorKind::CXCursor_FunctionDecl || kind == CXCursorKind::CXCursor_CXXMethod || kind == CXCursorKind::CXCursor_FunctionTemplate )
  {
    CXSourceRange extent           = clang_getCursorExtent( cursor );
    CXSourceLocation startLocation = clang_getRangeStart( extent );
    CXSourceLocation endLocation   = clang_getRangeEnd( extent );

    unsigned int startLine = 0, startColumn = 0;
    unsigned int endLine   = 0, endColumn   = 0;

    clang_getSpellingLocation( startLocation, nullptr, &startLine, &startColumn, nullptr );
    clang_getSpellingLocation( endLocation,   nullptr, &endLine, &endColumn, nullptr );

    std::cout << "  " << name << ": " << endLine - startLine << "\n";
  }

  return CXChildVisit_Recurse;
}

int main( int argc, char** argv )
{
  if( argc < 2 )
    return -1;

  auto resolvedPath = resolvePath( argv[1] );
  std::cerr << "Parsing " << resolvedPath << "...\n";

  CXCompilationDatabase_Error compilationDatabaseError;
  CXCompilationDatabase compilationDatabase = clang_CompilationDatabase_fromDirectory( ".", &compilationDatabaseError );
  CXCompileCommands compileCommands         = clang_CompilationDatabase_getCompileCommands( compilationDatabase, resolvedPath.c_str() );
  unsigned int numCompileCommands           = clang_CompileCommands_getSize( compileCommands );

  std::cerr << "Obtained " << numCompileCommands << " compile commands\n";

  CXIndex index = clang_createIndex( 0, 1 );
  CXTranslationUnit translationUnit;

  if( numCompileCommands == 0 )
  {
    constexpr const char* defaultArguments[] = {
      "-std=c++11",
      "-I/usr/include",
      "-I/usr/local/include"
    };

    translationUnit = clang_parseTranslationUnit( index,
                                                  resolvedPath.c_str(),
                                                  defaultArguments,
                                                  std::extent<decltype(defaultArguments)>::value,
                                                  0,
                                                  0,
                                                  CXTranslationUnit_None );

  }
  else
  {
    CXCompileCommand compileCommand = clang_CompileCommands_getCommand( compileCommands, 0 );
    unsigned int numArguments       = clang_CompileCommand_getNumArgs( compileCommand );
    char** arguments                = new char*[ numArguments ];

    for( unsigned int i = 0; i < numArguments; i++ )
    {
      CXString argument       = clang_CompileCommand_getArg( compileCommand, i );
      std::string strArgument = clang_getCString( argument );
      arguments[i]            = new char[ strArgument.size() + 1 ];

      std::fill( arguments[i],
                 arguments[i] + strArgument.size() + 1,
                 0 );

      std::copy( strArgument.begin(), strArgument.end(),
                 arguments[i] );

      clang_disposeString( argument );
    }

    translationUnit = clang_parseTranslationUnit( index, 0, arguments, numArguments, 0, 0, CXTranslationUnit_None );

    for( unsigned int i = 0; i < numArguments; i++ )
      delete[] arguments[i];

    delete[] arguments;
  }

  CXCursor rootCursor = clang_getTranslationUnitCursor( translationUnit );
  clang_visitChildren( rootCursor, functionVisitor, nullptr );

  clang_disposeTranslationUnit( translationUnit );
  clang_disposeIndex( index );

  clang_CompileCommands_dispose( compileCommands );
  clang_CompilationDatabase_dispose( compilationDatabase );
  return 0;
}

Let me repeat myself here: I am releasing the code into the public domain. Don’t forget to link against libclang when compiling it (one of the subsequent posts is likely to provide a find module for CMake). Should you consider this code useful, it would give me enormous pleasure if you were to drop me an e-mail.

If I apply the sample program to its own source code, I get the following results:

Parsing [FILENAME REDACTED FOR SECURITY PURPOSES -GLADOS]
Obtained 1 compile commands
[FILENAME REDACTED FOR SECURITY PURPOSES -GLADOS]
  getCursorSpelling: 7
  resolvePath: 17
  functionVisitor: 24
  main: 74

May your code in 2016 be as easy to parse for you as this example!

Posted late Friday evening, January 1st, 2016 Tags:

I am big fan of Nature-style citations—they are rather unobtrusive and make using a citation as a noun impossible. For example the following sentence is fine:

Previously, Adams [42] showed the importance of always carrying a towel.

Now watch what happens when I try to use the citation in place of noun:

Previously, [42] showed the importance of always carrying a towel.

Looks stupid, doesn’t it? And indeed it should because using a citation key as a noun is rather bad form in my opinion. If you share my opinion and want to use this sort of citation style when using LaTeX, let me spare you some hours of “productive procrastination” (which is the time I should spent writing or doing research, but end up doing something else that is somewhat related to my thesis) and show you how I obtained beautiful Nature-style citations with BibLaTeX.

First, let’s load BibLaTeX with the required options:

\usepackage[%
  autocite    = superscript,
  backend     = bibtex,
  sortcites   = true,
  style       = numeric,
  ]{biblatex}

This tells BibLaTeX to use superscript citations by default when using \autocite. The numeric style is required in order to ensure that superscripts are typeset. In your LaTeX file, you may now use

Previously, Adams~\autocite{Adams42} showed the importance of always
carrying a towel.

The \autocite command is my best friend when using BibLaTeX. Not only can it be easily style by changing the setting in the preamble, it also is smart in the sense that it detects surrounding punctuation correctly and will place the actual citation properly, depending on your language settings. This is great.

However, when we typeset the example above, we get something along the lines of:

Previously, Adams 42 showed the importance of always carrying a towel.

Close, but not cigar. I want brackets to surround the citation. Furthermore, if you use postnotes like me, they will not be shown. In other words, if you like to write

Previously, Adams~\autocite[\ppno~41--42]{Adams42} showed the importance
of always carrying a towel.

in order to add additional information about page numbers to your citation, you will be sorely disappointed. The additional information, which is the postnote in LaTeX jargon, simply will not show up. To fix this, we need to redefine the superscript citation of BibLaTeX:

\DeclareCiteCommand{\supercite}[\mkbibsuperscript]{
    \iffieldundef{prenote}
    {}
    {\BibliographyWarning{Ignoring prenote argument}}%
    \iffieldundef{postnote}
    {}
    {}
  }
  {\bibopenbracket%
   \usebibmacro{citeindex}%
   \usebibmacro{cite}%
   \usebibmacro{postnote}%
   \bibclosebracket}
  {\supercitedelim}
  {}

In case you wonder, this is the original code for \supercite with some modifications for the postnote. The placement of % signs is critical, by the way. Else, additional whitespace will be introduced to the macro. If you use it like this, the citation should now look the way we want it to look:

Previously, Adams [42, pp. 41–42] showed the importance of always carrying a towel.

And now we may marvel at our LaTeX documents and care about the less important stuff, such as—in my case—actually producing some content.

By the way, if you care about typography as much as I do, you may want to check out the illustrated glossary of typographic terms that the nice folks at Canva compiled for you.

Posted Friday evening, January 29th, 2016 Tags:

Again, a tale from the trenches, i.e. the course on C++ programming taught by my colleague Filip Sadlo. This time, it is about the surprises that occur with name hiding in C++. Take the following example of a simple class hierarchy. No virtual functions, no funny stuff going on:

class A
{
public:
  void a() {}
};

class B : public A
{
public:
  void a(int) {}
};

int main()
{
  B b;
  b.a();
}

This code looks very innocent—but it does not compile. The compiler complains that there is no matching function for the call. The output of g++ (version 5.3.0) is rather terse:

foo.cc: In function ‘int main()’:
foo.cc:16:7: error: no matching function for call to ‘B::a()’
   b.a();
       ^
foo.cc:10:8: note: candidate: void B::a(int)
   void a(int) {}
        ^
foo.cc:10:8: note:   candidate expects 1 argument, 0 provided

clang++ (version 3.7.0) is more helpful for beginners:

foo.cc:16:5: error: too few arguments to function call, expected 1, have 0;
      did you mean 'A::a'?
  b.a();
    ^
    A::a
foo.cc:4:8: note: 'A::a' declared here
  void a() {}

What is going on here? This is a classical case of name hiding. Since class B does not contain an override for A::a(), this function is hidden by the compiler. In § 10.2, the C++ standard meticulously tells you that the “lookup set” that is used to, well, look up names is filled by the derived class first. § 10.2.5 explicitly states that base classes are only ever visited if the lookup set is empty—which is clearly not the case here.

We can fix this in multiple ways:

  1. We could add using A::a; in the body of B. Thus, we explicitly signal the compiler that we want this name to be included.
  2. We could provide the proper scope when calling a() by writing b.A::a(); instead of b.a(). Yes, that is horrible, but it actually works.

Of course, the real question is why the designers of C++ thought that this behaviour is useful. From a technical point of view, visiting base class to look up further names is a trivial matter. However, I would firmly argue that this does not make any sense. The addition of B::a(int) was a deliberate act made by the programmer. For me, this signifies that the programmer wants to change the interface of the class. If the programmer wants to keep the interface of A as well, this should warrant additional work, such as the using declaration.

Furthermore, this behaviour makes sense because it prevents ambiguities in the inheritance process (which I just realized sounded a lot like something a lawyer would say!). Suppose, we had a function A::a(float) and a function B::a(double). If A::a(float) was not hidden by default in B, we would call the base class function when calling b.a(0.f), even though a float can be promoted to a double.

The real fun with these ambiguities would start when a 0 is used instead of a nullptr in C++11—since a function with an integral parameter will always be a better match than a function taking a pointer parameter, this would result in agonizing, hard-to-trace bugs…

So, in short: Name hiding. It’s there for a reason.

Posted Sunday afternoon, January 31st, 2016 Tags: