I rarely do this, but it seemed worth preparing a brief methodological description of the data in my second Clegg verses Farage blog entry, which has just been published. I do this for two reasons. First, because it is good to be transparent when it comes to these kind of data. Second, because I am starting to work with some of these new techniques that allow for the analysis of bigger textual datasets. The Farage-Clegg dataset is reasonably small (approximately 30,000 words). In theory however, the same techniques could be scaled quite effectively for much larger datasets running into millions of words. So watch this space!
Constructing the dataset
For the PSA blog post, I was interested in examining how post-debate coverage presented the performance of the two men. In order to do this, I first used Lexis Nexis to gather a sample of all British newspaper articles that made major mentions of Clegg, Farage and debate between 27th March 2014 (the day after the first debate) and 4th April 2014 (two days after the second debate). You can find the results of this search in this document. In total, it includes approximately 480 articles or 178,000 words.
Generating the tag cloud
In an attempt to have a first look at the data, I entered it into the tag cloud generation site TagCrowd. I then cleaned it, removing any words that had been artificially created by Lexis Nexis, for example.
This generates quite an astehtically pleasing tag cloud, but its usefulness for this kind of exercise is actually quite limited. Why is this? Two reasons, really. First, the tag cloud chews through the whole document. It tells us about how often words appear in the coverage, but tells us little about how these words relate to each other. Second, the impression given by the tag cloud can be quite artificial. One obvious point to make: the size of the word reflects not just the number of appearances a word makes, but also the length of the word.
Cleaning the dataset to generate Clegg and Farage specific sentences
I particular, I wanted to examine what qualities were attributed to the two men’s performance by the media after the debates. So I now turned to two text analysis tools called QDA Miner and WordStat. In the first instance, I used QDA Miner to search for any sentences in the corpus that featured Clegg or Farage, and auto-coded them accordingly. I then exported these to Wordstat, where I could analyse the make-up of these two datasets, and most interestingly compare them.
It should be noticed that some sentences may feature twice in the dataset, as they could have featured both Clegg and Farage’s names. There is an option to exclude these double references, but since this was just quite a quick and dirty analysis, I let them appear twice.
I used Wordstat to pull out the 250 most used words. These appeared most frequently as a percentage across the whole dataset. The tables below show the calculations used to rank the words. The most distinctive Farage word is “PUTIN”. This is because it featured in 0.30 per cent of the words in the Farage dataset but only 0.18 per cent of the words in the Clegg dataset. Hence the difference was 0.12 per cent.