Toward Using Twitter Data for Tracking COVID-19: A Natural Language Processing Pipeline
Ari Z. Klein, PhD is a Research Associate in the Health Language Processing Center, in the Division of Informatics.
In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results have presented challenges for actively monitoring its spread based on testing alone. We developed an automatic natural language processing pipeline for using Twitter data to identify potential cases of COVID-19 that are not based on testing and, thus, may not have been reported to the CDC. Beginning January 23, 2020, we collected English tweets from the Twitter Streaming API that mention keywords related to COVID-19. We applied hand-written regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (e.g., quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing those that self-report potential cases from those that do not. A supervised deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases. We deployed the pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020, identifying 13,714 tweets that self-report potential cases and have United States state-level geolocations. This publicly available data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.
Keywordsnatural language processing; social media; data mining; COVID-19; coronavirus; pandemics; epidemiology
Commenting is now closed.
To understand health and disease today, we need new thinking and novel science —the kind we create when multiple disciplines work together from the ground up. That is why this department has put forward a bold vision in population-health science: a single academic home for biostatistics, epidemiology and informatics. LEARN MORE ABOUT US