In late 2015, I started fiddling with libclang in order to analyse abstract syntax trees. After multiple queries reached me via e-mail, I decided to create a public repository of my experiments.

The repository is called libclang-experiments and resides on my GitHub account. It contains the code for the first blog post about libclang, which presented some basic routines for walking an abstract syntax tree, as well as code for the second blog post, which dealt with counting the extents of a function.

Note that since libclang is somewhat fiddly to use, the repository also provides a CMake module for actually finding the library on your computer. Please refer to the CMakeLists.txt of the repository for usage information—and do not hesitate to contact me, either via e-mail or by opening an issue on GitHub if you have any questions.

Posted Monday afternoon, September 11th, 2017 Tags:

Previously, I wrote about a brief analysis of U.S. Presidential Inauguration Speeches. I have since extended the analysis slightly using tf–idf.

For those of you who are unaware of this technique, it refers to a statistical method for assessing the importance of certain words in a corpus of documents. Without going into too much detail, the method determines which words are most relevant for a given document of the corpus, yielding a high-dimensional vector whose entries refer to the common vocabulary of the corpus.

I picked the five most relevant words for every speech to essentially obtain an extremely pithy summary of words that are relevant for the given speech only. Since there are 59 speeches in total, I first decided to do a small visualization of the last eight presidents only, starting from George H.W. Bush in 1989. Here are the results; the x-axis represents the time, while the y-axis shows the five most important words, scaled by their relative frequency in the speech:

A
visualization of the relative importance of words in the last eight
inauguration speeches of U.S. presidents

So, what can we see in this visualization? Here are some of my thoughts.

  • The don in the speech of George H.W. Bush is a shortened form of don’t. The algorithm picked up on his usage in the speech, which contains beautiful imagery such as this:

For the first time in this century, for the first time in perhaps all history, man does not have to invent a system by which to live. We don't have to talk late into the night about which form of government is better. We don't have to wrest justice from the kings. We only have to summon it from within ourselves.

  • It is also interesting to note that the new breeze George H.W. Bush is talking about is detected as a unique feature of his speech.

  • The speeches of Bill Clinton allude to the change that started after the end of the Cold War, as well as the promises that arise in the new century to come.

  • The second speech of George W. Bush tries to rally Americans in the War on Terror. Liberty and freedom are part of the underlying theme, these of course being American ideals that are worth fighting for.

  • With Barack Obama, a sort of rebirth takes place. He speaks to the new generation, expressing his hope that American becomes a new nation, and aligns everyone that today—not tomorrow—is the day to address these challenges. In his second speech, the great journey towards equality is presented to the Americans, making it clear that change does not stop.

  • With Donald Trump, the narrative changes. The important words are now the dreams of people, such as the hope that they will find new jobs. It is interesting to note that only the speeches at a time of crisis or abrupt change (Cold War or the War on Terror) exhibit the same occurrence of the words America and American. Maybe Donald Trump is trying to invoke a connection to these events in his speech?

These are only my random thoughts—I think it is fascinating that tf–idf is capable of picking up on these themes in such a reliable manner. Maybe a more competent person wants to comment on these findings? I look forward to any feedback you might have. In the meantime, please enjoy a variant of the visualization above that contains all speeches of all presidents so far. You will have to scroll around a lot for this. By the way, I have added the code to the GitHub repository. You may be particularly interested in tf_idf_analysis.py, which demonstrates the tf–idf analysis of all speeches. Moreover, I added a gnuplot script that demonstrates how to create the visualizations attached to this blog post.

Posted late Monday evening, September 11th, 2017 Tags:

In previous posts, I wrote about a brief analysis of U.S. Presidential Inauguration Speeches and how to extend this analysis using tf–idf. In this post, I want to present an extended analysis based on sentiment analysis. Sentiment analysis encompasses a class of techniques for detecting whether sentences are mean either negatively, neutrally, or positively.

Depicting sentiment over time

Since every inauguration speech has a beginning and an end, it forms a natural time-series. Hence, I first calculated the sentiment scores for every sentence in a speech and scaled an artificial time parameter over the speech between 0 and 1. This yields a nice sentiment curve plot, in which the abscissa denotes the time of the speech, and the ordinate denotes the sentiment of a given sentence—with values close to +1 meaning that the sentence is extremely positive, 0 meaning that the sentence is neutral, and -1 meaning that the sentence is extremely negative.

Here are some example visualizations of the last three inauguration speeches. Positive sentences are shown in green, while negative ones are shown in red. I am filling the distance between true neutral sentences with the colour in order to show patterns and ‘rhythm’ of different speeches. A black line indicates the mean sentiment over the speech.

Sentiment curve for Barack Obama (2009)

Sentiment curve for Barack Obama (2013)

Sentiment curve for Donald J. Trump (2017)

It is interesting to see that Obama’s first speech appears to be more subdued and neutral than the subsequent speeches, which exhibit more peaks and thus a larger variability between extremely positive and extremely negative sentiment.

If you want to compare the different sentiment curves for your favourite presidents, you may do so—I have prepared a large visualization that combines all sentiment curves. Watching the appearance and disappearance of negative sentiments over time is quite fascinating. The 1945 speech by Roosevelt is a striking example of evoking very negative imagery for a prolonged period of time.

Comparing mean sentiments

For comparing individual presidents, the curves are well and good, but I was also interested in the global sentiment of a speech and how it evolves over time. To this end, I calculated the mean (average) sentiment over every speech and depicted it over time. This works because sentiments are always bounded between [-1:1], making their comparison very easy. Here is the average sentiment of a speech, plotted over time:

Average sentiment of a speech over time

We can see an interesting pattern: after the second World War, speeches become more positive on average. They remain that way until the first inauguration speech of Barack Obama, which, as I noted above, is somewhat subdued again. Afterwards, they pick up steam. Donald Trump’s speech evokes more positive sentiments, on average, than most of the speeches since 1945.

All in all, I think that this is a nice tool to assess patterns in speeches. If you want, go take a look at the individual sentiment curves, which are stored on GitHub. Maybe you pick up something interesting.

Code

I used the intriguing TextBlob Python module for this analysis. All visualizations are done using gnuplot. As usual, the scripts and data files—as well as the output—is stored in the GitHub repository for the project.

Have fun!

Posted at teatime on Sunday, September 17th, 2017 Tags: