Greetings,
A number of us will meet next Thursday 10/26 from 10:00-11:00 Central / 8:00-9:00 Pacific to discuss emerging statistical methods for correctly incorporating expensive and precise data with cheap and inaccurate data in statistical estimates. For example, an algorithmic classifier or large language model might make predictions about "content" such as text or images. "Validation data" by human annotators is often used to quantify the accuracy of these predictions; however, as long as predictions are perfectly accurate this does not prove that inaccurate predictions do not invalidate statistical conclusions. These methods use both forms of data to create more precise estimates consistent with the validation data.
I think these methods open up new powerful measurement strategies and study designs. I hope you will join our discussion :)
I think this discussion will partly be an orientation to this methodological literature and partly an occasion to brainstorm how we might use (or already are using) these techniques in our studies.
Here are some links to relevant articles. I don't expect you to read them all deeply before our meeting. Most are very technical. Looking at these articles can help orient you to this literature and prepare you for this discussion.
https://doi.org/10.1080/19312458.2023.2293713
This is my paper published back in 2023. I showed how to use an error modeling framework to correct the bias. If I say so myself, I think this is a pretty clear and easy to follow explanation of the problem. However, I think that the solution I proposed isn't a great fit for "black box" models such as LLMs.
https://arxiv.org/abs/2501.18577
This is the latest in the line of "Prediction Powered Inference" (PPI) papers. It's extremely technical, but I think its the most generally applicable method currently available. Unlike my paper, this approach does not require any difficult assumptions about classifier performance. I tried out the R implementation just yesterday and it is fairly usable. Here's a tutorial: https://dankluger.github.io/PTDBootTutorial/Tutorial.html
https://naokiegami.com/paper/dsl_ss.pdf
This "designed based supervised learning" approach also claims to be very general and to work with "black box models". It is similar in spirit to PPI, but since it involves creating intermediate predictive models is more complex.
https://journals.sagepub.com/doi/abs/10.1177/00491241251326865. While most other studies have in mind something like using a model as an auxiliary coder in a content analysis. This paper suggests something a bit more radical by using language models as a proxy for human study participants.
I'm following up with a more specific plan for our discussion, per discussion in matrix. Let's focus on Kluger et al.'s paper (https://arxiv.org/abs/2501.18577) and the accompanying tutorial to the method in R (https://dankluger.github.io/PTDBootTutorial/Tutorial.html).
The paper is very technical and written for an audience of statisticians. For the purposes of our discussion it is totally good to gloss over the parts you don't understand. I recommend attempting to grasp the introduction (sections 1.1-1.3). The assumptions in section 2.1 are useful to understand as well. From there you can reasonably jump down to section 4, which demonstrates the method in empirical examples.
The tutorial has additional examples and will clarify what using the method entails in practice. See you next Thursday, June 26, at 8:00 PT / 10:00 CT / 11:00 ET!
--
Nate
collective-ut@communitydata.science