Machine Learning for Classification of COVID-19 Vaccine Misinformation on Twitter
Alana Foreman is affiliated with the Health and Language Processing center of the University of Pennsylvania. She is currently a senior at Byram Hills High School conducting informatics research under the mentorship of Dr. Graciela Gonzalez-Hernandez. She will be attending Binghamton University in the fall to study Computer Science.
Since the outbreak of COVID-19, misinformation and conspiracy theories about the pandemic and vaccines have been harmful to physical and mental health, have increased stigmatization, and have led to poor observance of public health measures, thus reducing their effectiveness and endangering our ability to put an end to this pandemic. Misinformation spreads rapidly throughout social media, including on Twitter, which boasts over 330 million users. Posts pertaining to vaccination are one of the most active vectors for the spread of misinformation by both humans and bots. Thus, this study created a novel, computationally efficient, and scalable pipeline to classify COVID-19 vaccine misinformation that relies on Tweet and user-level metadata, extracts significant features of misinformation, investigates the distribution of misinformation between bots and humans, and identifies hidden topics within bot- and human-authored vaccine Tweets. An optimized Random Forest model achieved the highest accuracy of 87%, signifying a strong learning algorithm for the data. Feature selection results show that propagation-based and emotion features are significant predictors of Tweet veracity. Of the collected Tweets, 56% originated from bots, where 41% were misinformation, compared to 32% of human-authored Tweets that contained misinformation. Notably, the most frequent topic of bot-authored Tweets contained misinformation keywords, suggesting that bots amplify the spread of misinformation. A prototype was then designed to scale these findings on Twitter. This insight has the potential to mitigate the spread of COVID-19 vaccine misinformation on social media, which is key to building a safer Web and improving the effectiveness of public health measures.
KeywordsSocial media, natural language processing, public health, misinformation, Twitter, bots, COVID-19, machine learning
To understand health and disease today, we need new thinking and novel science —the kind we create when multiple disciplines work together from the ground up. That is why this department has put forward a bold vision in population-health science: a single academic home for biostatistics, epidemiology and informatics.