Ari Klein

Toward Using Twitter Data for Tracking COVID-19: A Natural Language Processing Pipeline

Thumbnail of Poster PDF
Click to View


Default Presenter Image
Ari Klein, Informatics

Ari Z. Klein, PhD is a Research Associate in the Health Language Processing Center, in the Division of Informatics.


A Klein1, A Magge1, K O'Connor1, I Flores1, D Weissenbacher1, G Gonzalez-Hernandez1

  1. Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania


In the United States, the rapidly evolving COVID-19 outbreak, the shortage of available testing, and the delay of test results have presented challenges for actively monitoring its spread based on testing alone. We developed an automatic natural language processing pipeline for using Twitter data to identify potential cases of COVID-19 that are not based on testing and, thus, may not have been reported to the CDC. Beginning January 23, 2020, we collected English tweets from the Twitter Streaming API that mention keywords related to COVID-19. We applied hand-written regular expressions to identify tweets indicating that the user potentially has been exposed to COVID-19. We automatically filtered out “reported speech” (e.g., quotations, news headlines) from the tweets that matched the regular expressions, and two annotators annotated a random sample of 8976 tweets that are geo-tagged or have profile location metadata, distinguishing those that self-report potential cases from those that do not. A supervised deep neural network classifier, based on a BERT model that was pretrained on tweets related to COVID-19, achieved an F1-score of 0.76 (precision=0.76, recall=0.76) for detecting tweets that self-report potential cases. We deployed the pipeline on more than 85 million unlabeled tweets that were continuously collected between March 1 and August 21, 2020, identifying 13,714 tweets that self-report potential cases and have United States state-level geolocations. This publicly available data set presents the opportunity for future work to assess the utility of Twitter data as a complementary resource for tracking the spread of COVID-19.


natural language processing; social media; data mining; COVID-19; coronavirus; pandemics; epidemiology

Commenting is now closed.

About Us

To understand health and disease today, we need new thinking and novel science —the kind  we create when multiple disciplines work together from the ground up. That is why this department has put forward a bold vision in population-health science: a single academic home for biostatistics, epidemiology and informatics. 

© 2023 Trustees of the University of Pennsylvania. All rights reserved.. | Disclaimer

Follow Us