How moody are your sources?

Data Storytelling n.1

The words we use tell a lot about the way we feel. I used this simple wisdom in previous articles to explore the sentiment of Europeans concerning Brexit and the spread of Coronavirus fake news in the Italian peninsula. Despite it may sound strange, texts are data, and like any other kind of data, they can be analysed, tested and measured. The attempt in this rubric is to strip the numbers of their popular sense of inaccessibility and mysticism and to make them the characters of the narrative, so to speak.

In this edition of the Data Storytelling rubric, newspapers are the main protagonists. While reading an article, I sometimes wonder how much the opinion of the writer influences the formation of my own view. Natural Language Processing and Text Sentiment Analysis provide for methods and techniques to answer, up to a certain extent, to this question. By deploying the right model, it is possible to determine the emotional strength and subjectivity of any piece of news.

In order to narrow down the research, I focused on a single topic which has made headlines in the last months across the whole European continent: the European Union’s Recovery Plan. Rumours on EU’s response to the pandemic started from the very beginning of March, but the actual Commission Proposal arrived on the 26th of May. Many words have been made about the enactment of this plan as well as on its actual adoption at national levels. This topic has been highly debated, many potential instruments have been contested from some countries – see the opposition of the frugal four concerning the adoption of Coronabonds as well as the Italian internal dispute on the use of the ESM –, giving us a satisfactory level of opinionation and Subjectivity expressed by different authors on the subject. The analysis was performed on the articles published by national newspapers established in four of the Member States: Germany, Spain, France and Italy.

The Dataset

The sentiment and Subjectivity analysis was deployed on a dataset of almost 1300 pieces of news concerning the European COVID-19 Recovery Plan, the way each of the inquired national governments proposed to allocate this generous budget and the way these measures were perceived in these countries.

Articles were selected from a variety of sources for all four countries in order to cover different political alignments. Several search terms in the respective languages, such as “Recovery Fund”, “Next Generation EU” or “EU response to Covid”, were adopted to filter the news depending on the countries. The timeframe of the analysis ranged from the 1st of March up to the 28th of October, which was the day in which the articles were collected. The lower end of the time window was set at the beginning of March since during this month the pandemic started spreading across the whole European continent and many states started announcing lockdown measures, leading various editors to speculate about the possible measures that the Union would intend to adopt. In total, roughly 1600 articles from 16 different sources have been collected. Out of this, approximately 300 were discarded as, after a topic classification, they were found not to be related to the Recovery Plan. For each article, the title, the publication date, the author(s), the complete text and the summary were scraped. When the original source did not provide for a summary, an automatic one was generated. Finally, the keywords of each article were extracted.

An analysis of the distribution of the dataset over time will give us a clearer view of the picture. In the graph below, the distribution of the articles over the months is depicted. After the 26th of May, the average number of articles on the Recovery Plan increased from 4.1 up to 7.5. While the fact that the number of articles concerning the Recovery plan almost doubled after its official proposal does not come as a surprise, the interest in the topic gradually decreased over time reaching, during August and September, levels not much higher than those of April or the days of May preceding the 26th. In October, the number of articles peaked at 200; this sudden increase may be related to the surge in cases experienced and to the potential arrival of the second wave experienced by the countries.

Analyzing words

While the analysis of the sentiment in this article follows an approach based on the structure of the enquired document – namely the way in which words are contextualised within the text and the phrases are arranged – rather than depending on the presence of the single terms, a first understanding of the most common keywords of each article can be extremely insightful. The articles’ main terms have been aggregated by country to show national differences. For each country, the top 25 keywords were selected.

In the word-clouds below, the size of each term represents its frequency of appearance as an article’s keyword. The countries show overall a good degree of similarity. As expected, among the common terms we can find coronavirus, crisis, plan, EU, euro, recovery and fund. However, each country presents higher frequencies for given topics than others: in Germany, topics such as the eu budget and coronabonds are much discussed, while in Spain there is more focus on the national perspective rather than on the European one – despite from the wordcloud it cannot be clearly evinced, España has a frequency of almost 73% against that of UE and Europa which amounts respectively to 69% and 24%. In Italy ESM (MES) and GDP (PIL) are some of the most important keywords, while France is the only country presenting Trump, China and vaccine among the main terms.

Rather than merely focusing on the frequency of the terms, we could look at the correlation in their appearance. The following networks describe the co-occurrence matrix of the 25 keywords just presented for each of the inquired countries.

A first observation that can be done is that important keywords such as coronavirus and Europe are pretty central in all of the networks. However, many interesting differences may be noted, especially concerning Germany and Italy. Germany is the most connected network among them; this means that if two articles were to contain two keywords each and differ on one of the keywords, the probability of sharing the second term would be higher in Germany than in the other countries. In Germany’s network, eu-centric terms are much more common than in the other networks. Note, for instance, that the word EU is even more central than coronavirus (corona) or crisis (coronakrise). Interestingly, the network also presents terms such as dispute (streit) and negotiation (verhandlungen), moderately correlated to the words sparsamen and vier, namely the frugal four, hinting at the disputes among the EU member states concerning the creation of a Recovery Fund to tackle the pandemic. On the other hand of this spectrum lies Italy, with the lowest level of connectivity, but the highest average correlation between words; this is due to the presence of many strong connections, meaning that an article with president Conte as one of the main term will with high probability be focused on the GDP (pil in the graph) trend as well. Here, topics such as education and work (respectively, scuola and lavoro) make their appearance.