Data Storytelling n.1
The words we use tell a lot about the way we feel. I used this simple wisdom in previous articles to explore the sentiment of Europeans concerning Brexit and the spread of Coronavirus fake news in the Italian peninsula. Despite it may sound strange, texts are data, and like any other kind of data, they can be analysed, tested and measured. The attempt in this rubric is to strip the numbers of their popular sense of inaccessibility and mysticism and to make them the characters of the narrative, so to speak.
In this edition of the Data Storytelling rubric, newspapers are the main protagonists. While reading an article, I sometimes wonder how much the opinion of the writer influences the formation of my own view. Natural Language Processing and Text Sentiment Analysis provide for methods and techniques to answer, up to a certain extent, to this question. By deploying the right model, it is possible to determine the emotional strength and subjectivity of any piece of news.
In order to narrow down the research, I focused on a single topic which has made headlines in the last months across the whole European continent: the European Union’s Recovery Plan. Rumours on EU’s response to the pandemic started from the very beginning of March, but the actual Commission Proposal arrived on the 26th of May. Many words have been made about the enactment of this plan as well as on its actual adoption at national levels. This topic has been highly debated, many potential instruments have been contested from some countries – see the opposition of the frugal four concerning the adoption of Coronabonds as well as the Italian internal dispute on the use of the ESM –, giving us a satisfactory level of opinionation and Subjectivity expressed by different authors on the subject. The analysis was performed on the articles published by national newspapers established in four of the Member States: Germany, Spain, France and Italy.
The sentiment and Subjectivity analysis was deployed on a dataset of almost 1300 pieces of news concerning the European COVID-19 Recovery Plan, the way each of the inquired national governments proposed to allocate this generous budget and the way these measures were perceived in these countries.
Articles were selected from a variety of sources for all four countries in order to cover different political alignments. Several search terms in the respective languages, such as “Recovery Fund”, “Next Generation EU” or “EU response to Covid”, were adopted to filter the news depending on the countries. The timeframe of the analysis ranged from the 1st of March up to the 28th of October, which was the day in which the articles were collected. The lower end of the time window was set at the beginning of March since during this month the pandemic started spreading across the whole European continent and many states started announcing lockdown measures, leading various editors to speculate about the possible measures that the Union would intend to adopt. In total, roughly 1600 articles from 16 different sources have been collected. Out of this, approximately 300 were discarded as, after a topic classification, they were found not to be related to the Recovery Plan. For each article, the title, the publication date, the author(s), the complete text and the summary were scraped. When the original source did not provide for a summary, an automatic one was generated. Finally, the keywords of each article were extracted.
An analysis of the distribution of the dataset over time will give us a clearer view of the picture. In the graph below, the distribution of the articles over the months is depicted. After the 26th of May, the average number of articles on the Recovery Plan increased from 4.1 up to 7.5. While the fact that the number of articles concerning the Recovery plan almost doubled after its official proposal does not come as a surprise, the interest in the topic gradually decreased over time reaching, during August and September, levels not much higher than those of April or the days of May preceding the 26th. In October, the number of articles peaked at 200; this sudden increase may be related to the surge in cases experienced and to the potential arrival of the second wave experienced by the countries.
While the analysis of the sentiment in this article follows an approach based on the structure of the enquired document – namely the way in which words are contextualised within the text and the phrases are arranged – rather than depending on the presence of the single terms, a first understanding of the most common keywords of each article can be extremely insightful. The articles’ main terms have been aggregated by country to show national differences. For each country, the top 25 keywords were selected.
In the word-clouds below, the size of each term represents its frequency of appearance as an article’s keyword. The countries show overall a good degree of similarity. As expected, among the common terms we can find coronavirus, crisis, plan, EU, euro, recovery and fund. However, each country presents higher frequencies for given topics than others: in Germany, topics such as the eu budget and coronabonds are much discussed, while in Spain there is more focus on the national perspective rather than on the European one – despite from the wordcloud it cannot be clearly evinced, España has a frequency of almost 73% against that of UE and Europa which amounts respectively to 69% and 24%. In Italy ESM (MES) and GDP (PIL) are some of the most important keywords, while France is the only country presenting Trump, China and vaccine among the main terms.
Rather than merely focusing on the frequency of the terms, we could look at the correlation in their appearance. The following networks describe the co-occurrence matrix of the 25 keywords just presented for each of the inquired countries.
A first observation that can be done is that important keywords such as coronavirus and Europe are pretty central in all of the networks. However, many interesting differences may be noted, especially concerning Germany and Italy. Germany is the most connected network among them; this means that if two articles were to contain two keywords each and differ on one of the keywords, the probability of sharing the second term would be higher in Germany than in the other countries. In Germany’s network, eu-centric terms are much more common than in the other networks. Note, for instance, that the word EU is even more central than coronavirus (corona) or crisis (coronakrise). Interestingly, the network also presents terms such as dispute (streit) and negotiation (verhandlungen), moderately correlated to the words sparsamen and vier, namely the frugal four, hinting at the disputes among the EU member states concerning the creation of a Recovery Fund to tackle the pandemic. On the other hand of this spectrum lies Italy, with the lowest level of connectivity, but the highest average correlation between words; this is due to the presence of many strong connections, meaning that an article with president Conte as one of the main term will with high probability be focused on the GDP (pil in the graph) trend as well. Here, topics such as education and work (respectively, scuola and lavoro) make their appearance.
Sentiment and Subjectivity analysis: the methodology
Having looked at terms correlation, we may now focus on the analysis of the sentiment. The articles will be evaluated on two different metrics: Valence and Subjectivity. The former measures the intensity of the sentiment, regardless of its direction (i.e. if it is positive or negative), while the latter defines the extent to which the statements in the article have been influenced by the author’s personal viewpoint.
Articles sentiment analysis is not that common in the literature, as news are usually classified on other parameters (e.g. categories such as politics, sport and lifestyle) rather than on Subjectivity and Sentiment. In addition, the length of the texts, combined with the potential complexity of their structures, make unfeasible the adoption of many classification models such as content-based and word-based ones. Given the limited availability of labelled articles sentiment training sets, I opted for a Transfer Learning methodology, exploiting the power of the pre-trained Google’s BERT model and fine tuning it respectively on the Subjectivity and Sentiment News Articles Corpi from the companion papers of Aker et al. (2019). Simply put, Transfer Learning consists in taking an already existing model that has been trained on massive amounts of data and tailor it to the specific needs of a small dataset. Given that the dataset on which the model was fine-tuned were in english, the articles and their summaries were translated in english. While it is true that this operation may entail some losses in translation, it should be considered that now all articles are classified by the same model, allowing for their comparison. The classification was performed on both the summaries and the complete texts. The summaries however are not as informative as the full texts, resulting in attenuated results. The estimates that resulted from the summary analysis were all much closer to zero, potentially ignoring much of the useful data provided by the full text. Thus, I proceeded by analyzing only this latter feature.
Sentiment and Subjectivity analysis: the results
While these measures are difficult to interpret alone, they are useful to make comparisons among different countries or sources. In the graph below, the density function of both Subjectivity and sentiment Valence of the articles regarding the EU's Recovery Plan are plotted for Germany, France, Spain and Italy. It appears that the majority of French articles present low levels of both Subjectivity and sentiment. Spain and Italy follow similar trends, characterized by a majority of articles with moderate Valence and a particularly dispersed Subjectivity distribution. The majority of German articles instead present a modest level of Subjectivity, but their Valence is fairly dispersed and long-tailed towards the highest values. Similar results may be evinced from the mean of these metrics: Italy and Spain present the highest values of Subjectivity on average, while France and Germany are much lower in this sense. While Italy and Spain also present fairly high average Valence with respect to France, Germany is clearly far above with respect to the other countries. The results on Italy and Spain are not surprising: these countries' health systems have suffered a lot from the dramatic surges in the number of coronavirus cases and their economies resented this shock much more than those of Germany and France. Thus, it comes natural to think that Spanish and Italian journalists are more invested in the matter. On the contrary, while the fact that German newspapers are the most objective sources among those presented in this analysis would confirm the stereotype, their spike in sentiment Valence is somehow surprising. To further investigate this number, I looked at the subset of German articles concerning Coronabonds and the frugal four. Noticeably, the average Valence of these articles rockets at 79%. It appears that much of the heavy sentiment in German articles come from these topics and it is justified by the fact that Germany was previously considered one of the frugal Member States, losing this title in early May, when Chancellor Merkel and President Macron proposed to allocate an amount as consistent as €500 billion for the Recovery Fund.
We can then study how the press opinion changed with the official release of the proposal on the 26th of May by comparing the average level of Subjectivity and Sentiment Valence before and after the release date. We observe that both metrics increased in all the countries after the release; the increase in both Valence and Subjectivity is justified by a mild positive correlation of the two metrics, equal to almost 22%. While there is no marked difference in the Valence increase of the different states, a neat variation may be noticed in the change of Subjectivity values. In fact, the increase in Subjectivity experienced in Italian and Spanish sources is much more substantial than that of German and French ones. This, once again, may be justified by the investment of the Italian and Spanish newswriters in the matter, due to the critical condition in which their countries were and still are versing.
Finally, the scores may be analyzed by sources. In the graph below, the various editors are plotted; their position indicates the Subjectivity and Valence averages of their articles. It appears that conservative sources produce more subjective pieces of news with respect to the average. In general, however, it is interesting to observe that the majority of the best-selling editors of each nation are those with the lowest Subjectivity and Sentiment Valence.
From the analysis we just performed, the following may be determined:
From the extracted keywords, it appears that French and Germany’s articles focused particularly on the matter at a European level, while Spain and Italy looked more at the national adoption of the funds.
Italy and Spain exhibited the highest levels of Subjectivity, suggesting an high opinionation of the sources that may be emotionally invested due to the impact that the adoption of these measures has on their countries.
Germany exhibited suspiciously high levels of Valence; further inspections clarified that articles concerning Coronabonds and the dispute with the frugal four were the main contributors to this inflated figure.
Conservative sources show higher values of Subjectivity with respect to those expressing more moderate views.
As expected, top-sellers tend to publish the most objective and dispassionate articles.