How moody are your sources?

Data Storytelling n.1

The words we use tell a lot about the way we feel. I used this simple wisdom in previous articles to explore the sentiment of Europeans concerning Brexit and the spread of Coronavirus fake news in the Italian peninsula. Despite it may sound strange, texts are data, and like any other kind of data, they can be analysed, tested and measured. The attempt in this rubric is to strip the numbers of their popular sense of inaccessibility and mysticism and to make them the characters of the narrative, so to speak.


In this edition of the Data Storytelling rubric, newspapers are the main protagonists. While reading an article, I sometimes wonder how much the opinion of the writer influences the formation of my own view. Natural Language Processing and Text Sentiment Analysis provide for methods and techniques to answer, up to a certain extent, to this question. By deploying the right model, it is possible to determine the emotional strength and subjectivity of any piece of news.


In order to narrow down the research, I focused on a single topic which has made headlines in the last months across the whole European continent: the European Union’s Recovery Plan. Rumours on EU’s response to the pandemic started from the very beginning of March, but the actual Commission Proposal arrived on the 26th of May. Many words have been made about the enactment of this plan as well as on its actual adoption at national levels. This topic has been highly debated, many potential instruments have been contested from some countries – see the opposition of the frugal four concerning the adoption of Coronabonds as well as the Italian internal dispute on the use of the ESM –, giving us a satisfactory level of opinionation and Subjectivity expressed by different authors on the subject. The analysis was performed on the articles published by national newspapers established in four of the Member States: Germany, Spain, France and Italy.



The Dataset


The sentiment and Subjectivity analysis was deployed on a dataset of almost 1300 pieces of news concerning the European COVID-19 Recovery Plan, the way each of the inquired national governments proposed to allocate this generous budget and the way these measures were perceived in these countries.


Articles were selected from a variety of sources for all four countries in order to cover different political alignments. Several search terms in the respective languages, such as “Recovery Fund”, “Next Generation EU” or “EU response to Covid”, were adopted to filter the news depending on the countries. The timeframe of the analysis ranged from the 1st of March up to the 28th of October, which was the day in which the articles were collected. The lower end of the time window was set at the beginning of March since during this month the pandemic started spreading across the whole European continent and many states started announcing lockdown measures, leading various editors to speculate about the possible measures that the Union would intend to adopt. In total, roughly 1600 articles from 16 different sources have been collected. Out of this, approximately 300 were discarded as, after a topic classification, they were found not to be related to the Recovery Plan. For each article, the title, the publication date, the author(s), the complete text and the summary were scraped. When the original source did not provide for a summary, an automatic one was generated. Finally, the keywords of each article were extracted.