June 2025 - Collective-ut - communitydata.science

Fwd: Recruiting participants for a study about Wikimedia data
by Aaron Shaw 28 Jun '25

28 Jun '25

Greetings CDSC! Priyanka Nanayakkara (TSB alum, current postdoc at Harvard, cc'd here) and some colleagues are conducting a study that you might be eligible to participate in regarding the use of privacy-noised data. Please see below and contact Priyanka with any questions. all the best, Aaron ---- BEGIN FORWARDED MESSAGE ---- We are researchers at Columbia University, Cornell University, Georgetown University, Harvard University, and Northeastern University. We are recruiting participants for a remote study to understand how data users engage with privacy-noised data. We are looking for participants who: 1. Are at least 18 years old 2. Have experience with quantitative data analysis 3. Are familiar with using Python and Jupyter notebooks for data analysis In this study, you will conduct data analysis tasks, answer interview questions, and share your perceptions about Wikimedia data that includes privacy protections. The session will take approximately 1 hour via a Zoom meeting, and it will be recorded. Upon completion of the interview, you will receive a $50 Amazon gift card as a thank-you for your time. If you are interested, please fill out this online eligibility survey: https://neu.co1.qualtrics.com/jfe/form/SV_b3dVIHUf3ZKxaES <https://urldefense.com/v3/__https://neu.co1.qualtrics.com/jfe/form/SV_b3dVI…>We will reach out if you are selected to participate via email, typically within 2 weeks. Thanks so much! Priyanka Nanayakkara, on behalf of the study team

1 0

Prediction Powered Inference Discussion
by Nathan TeBlunthuis 20 Jun '25

20 Jun '25

Greetings, A number of us will meet next Thursday 10/26 from 10:00-11:00 Central / 8:00-9:00 Pacific to discuss emerging statistical methods for correctly incorporating expensive and precise data with cheap and inaccurate data in statistical estimates. For example, an algorithmic classifier or large language model might make predictions about "content" such as text or images. "Validation data" by human annotators is often used to quantify the accuracy of these predictions; however, as long as predictions are perfectly accurate this does not prove that inaccurate predictions do not invalidate statistical conclusions. These methods use both forms of data to create more precise estimates consistent with the validation data. I think these methods open up new powerful measurement strategies and study designs. I hope you will join our discussion :) I think this discussion will partly be an orientation to this methodological literature and partly an occasion to brainstorm how we might use (or already are using) these techniques in our studies. Here are some links to relevant articles. I don't expect you to read them all deeply before our meeting. Most are very technical. Looking at these articles can help orient you to this literature and prepare you for this discussion. https://doi.org/10.1080/19312458.2023.2293713 This is my paper published back in 2023. I showed how to use an error modeling framework to correct the bias. If I say so myself, I think this is a pretty clear and easy to follow explanation of the problem. However, I think that the solution I proposed isn't a great fit for "black box" models such as LLMs. https://arxiv.org/abs/2501.18577 This is the latest in the line of "Prediction Powered Inference" (PPI) papers. It's extremely technical, but I think its the most generally applicable method currently available. Unlike my paper, this approach does not require any difficult assumptions about classifier performance. I tried out the R implementation just yesterday and it is fairly usable. Here's a tutorial: https://dankluger.github.io/PTDBootTutorial/Tutorial.html https://naokiegami.com/paper/dsl_ss.pdf This "designed based supervised learning" approach also claims to be very general and to work with "black box models". It is similar in spirit to PPI, but since it involves creating intermediate predictive models is more complex. https://journals.sagepub.com/doi/abs/10.1177/00491241251326865. While most other studies have in mind something like using a model as an auxiliary coder in a content analysis. This paper suggests something a bit more radical by using language models as a proxy for human study participants. -- Nathan TeBlunthuis Assistant Professor School of Information University of Texas at Austin https://teblunthuis.cc

1 1

Group Feedback
by Madison Deyo 13 Jun '25

13 Jun '25

Hey everyone, At the global meeting yesterday, we discussed the group tasks / teams that have pretty much evaporated this last school year. We’re trying to reorganize the tasks and teams in a way that best serves the group, so please fill out this survey<https://docs.google.com/forms/d/e/1FAIpQLScPPKvJMhWlS9n7efa5yhvjurRDgYwrj3b…> to help us adjust the groups for next year. We’ll discuss as a full group at the retreat in October. Let me know if you have any questions ! Best, Madison Deyo Program Coordinator Northwestern University The Center for Human-Computer Interaction + Design<https://www.hci.northwestern.edu/> Community Data Science Collective<https://wiki.communitydata.science/Main_Page> --

1 0

2025

Collective-ut June 2025 ----- 2025 ----- July 2025 June 2025 May 2025 April 2025 March 2025 February 2025 January 2025

Collective-ut June 2025