#Infodemic: assessing trustworthiness of COVID-19 news on Twitter

In February, the World Health Organization made common the use of the term infodemic in newspapers and journals. This expression describes the malevolent effect of a pervasive spread of information; in this context, it is referring to the excessive diffusion of content concerning coronavirus.

It would be quite a cliché to talk about the high level of connectivity to which we are used to in the 21st century, and it is also becoming quite mainstream to talk about the damage provoked by the surplus of information.

However, it is important to stress that the rapid circulation of news, combined with the common misconception that opinions and facts are the same, gives rise to the phenomenon of fake news and hoaxes propagating in traditional media formats.

The objective of this article isn’t to talk about such phenomenon per se; rather it aims to describe with few numbers the dramatic COVID-19 pandemic in Italy and the parallel infodemic which formed subsequently, allowing fake news to spread deliberately and pollute the already overloaded information system.

The dataset

The analysis was executed on a dataset

composed of almost 100,000 Italian tweets containing the hashtag #Coronavirus or #Covid-19 collected between the 6th and the 15th of March 2020; the main events in this time range were the announcement of the Lombardy region lockdown on the 9th March and the enforcement of the national “I stay at home” decree on the following day. Retweets and tweets containing a media (gif, image, or video) were ignored to avoid redundancy and to ease the analysis. Approximately 10% of the tweets are quoting other tweets and roughly 8% of them were replies to another tweet.

An assessment of tweet reliability

The first objective of this project was to discern between reliable and unreliable tweets. The model used for this classification problem was a simple logistic regression, which achieved a good level of efficiency as shown in the table. The tweets were split into reliable ones and unreliable ones. In this analysis, unreliable refers to tweets containing unverified or false information, clickbait and conspiracy theories. The model was trained on approximately 1,500 tweets that were tagged manually specifically for this project. The sources used to verify the truthfulness of the news were the website of ANSA – the leading news agency service in Italy – and Bufale.net – a well-known Italian fact-checking website.

Out of the all analysed tweets, approximately 11% were considered unreliable. This result however was not satisfactory for two reasons: first, the main aim of the analysis was to check the diffusion of fake news among people, but many tweets in this sample were authored by trustworthy newspapers and news agencies – e.g. @LaStampa or @Corriere. In addition, the model discernment was made on the base of a probability (i.e. a given tweet is reliable with a probability X or unreliable with a probability 1-X); hence the more ambiguous results were classified as reliable or not on the basis of a slight difference in probabilities (the most extreme example would be a post with a 51% chance of being reliable against a 49% chance of being unreliable). For these reasons, a probability threshold of 75% was set and the main editorial authors were removed. The restricted sample, which will be used in the rest of the analysis, resulted in 48,882 tweets. Out of these tweets, 8% was classified as unreliable.

Although these figures seem reassuring, the fact that the data collected was filtered for retweets should not be ignored. In fact, fake news propagates more than reliable news; actually, it should be noted that the average number of retweets of unreliable tweets was almost five times higher than that of reliable ones (10.74 against 2.23). Multiplying the results previously obtained by the respective average retweet rates +1, the total number of reliable and unreliable tweets (including retweets) was obtained. These new results are not as encouraging as the previous ones: approximately 25% of the Italian tweets concerning the coronavirus posted between the 6th and the 15th of March were unreliable.

A further division was done among the unreliable tweets. At this stage, they were classified in four different categories depending on their content: misinformation (M), clickbait (Cl), conspiracy (Co) and other (O). The model used in this case was a Random Forest which obtained a discrete overall accuracy.

The graph below represents the distribution of unreliable tweets across the above mentioned classes. It appears that roughly half of the tweets fell into the misinformation category: they are either misinformative (information that is unintentionally incorrect or misleading) or disinformative (spread false information deliberately). The most important topics discussed in the tweets of this category were alternative methods to fight and beat the virus (e.g. homeopathy, vitamin C, garlic), the sudden discovery of a miraculous cure, channels of infection which have been disproved by experts, and fake news concerning celebrities and football players.