ML-Assisted Events Data Coding

I am working on a number of projects to help teams of researchers incorporate machine-learning models into their events data coding. I have had the most success using GPT models to help locate relevant news articles and to extract relevant details from those relevant news articles.

I have also produced a number of Shiny applications that help speed up human coding or verification of ML-models’ outputs and reduce human error.

Working papers

Coding with the machines: machine-assisted coding of rare event data, with Johanna Birnir, Ernesto Calvo, Keith Chew, Jordan Dewar, Roman Hlatky, Amy Liu, Henry Overos, and Ojashwi Pathak.

While machine coding of data has dramatically advanced in recent years, we do not know the relative performance of supervised and semi-supervised algorithms when coding political data – especially when compared to human coding. In this note, we juxtapose (1) a semi-supervised dictionary-based classifier with (2) an out-of-the-box GPT 4 model and (3) a fine-tuned BERT transformer – and compare all three to human coding of rare events data. We find that the transformer models outperform the bag-of-words approach, especially when fine tuned. None of the models beat human coding – but only in well-known contexts. Our results, however, suggest that the transformers do better than humans in less familiar contexts and that with additional training they can rapidly close the remaining gap to humans. We conclude by discussing the benefits and drawbacks of machine coding.

Presentations

UMD Methods Workshop (Fall 2023)

Do you research political events, including protests, mediation, voting, or sanctions? Our traditional method for collecting these events data involves reading thousands of news articles to extract the relevant information and put it into a dataset that we can use for our analysis. This process is very time-consuming and can be prone to error. Happily, advanced large language models, including GPT, can assist us. These models are easy to use in R and can be seamlessly integrated into your data collection process. I will step through how to use out-of-the-box GPT models in R to extract useful information from news articles. You can use these models to extract information from thousands of news articles in mere minutes.