Greetings CDSC!
Priyanka Nanayakkara (TSB alum, current postdoc at Harvard, cc'd here)
and some colleagues are conducting a study that you might be eligible to
participate in regarding the use of privacy-noised data.
Please see below and contact Priyanka with any questions.
all the best,
Aaron
---- BEGIN FORWARDED MESSAGE ----
We are researchers at Columbia University, Cornell University,
Georgetown University, Harvard University, and Northeastern University.
We are recruiting participants for a remote study to understand how data
users engage with privacy-noised data. We are looking for participants who:
1. Are at least 18 years old
2. Have experience with quantitative data analysis
3. Are familiar with using Python and Jupyter notebooks for data analysis
In this study, you will conduct data analysis tasks, answer interview
questions, and share your perceptions about Wikimedia data that includes
privacy protections.
The session will take approximately 1 hour via a Zoom meeting, and it
will be recorded. Upon completion of the interview, you will receive a
$50 Amazon gift card as a thank-you for your time.
If you are interested, please fill out this online eligibility survey:
https://neu.co1.qualtrics.com/jfe/form/SV_b3dVIHUf3ZKxaES
<https://urldefense.com/v3/__https://neu.co1.qualtrics.com/jfe/form/SV_b3dVI…>We
will reach out if you are selected to participate via email, typically
within 2 weeks.
Thanks so much!
Priyanka Nanayakkara, on behalf of the study team
Greetings,
A number of us will meet next Thursday 10/26 from 10:00-11:00
Central / 8:00-9:00 Pacific to discuss emerging statistical
methods for correctly incorporating expensive and precise data
with cheap and inaccurate data in statistical estimates. For
example, an algorithmic classifier or large language model might
make predictions about "content" such as text or images.
"Validation data" by human annotators is often used to quantify
the accuracy of these predictions; however, as long as predictions
are perfectly accurate this does not prove that inaccurate
predictions do not invalidate statistical conclusions. These
methods use both forms of data to create more precise estimates
consistent with the validation data.
I think these methods open up new powerful measurement strategies
and study designs. I hope you will join our discussion :)
I think this discussion will partly be an orientation to this
methodological literature and partly an occasion to brainstorm how
we might use (or already are using) these techniques in our
studies.
Here are some links to relevant articles. I don't expect you to
read them all deeply before our meeting. Most are very technical.
Looking at these articles can help orient you to this literature
and prepare you for this discussion.
https://doi.org/10.1080/19312458.2023.2293713
This is my paper published back in 2023. I showed how to use an
error modeling framework to correct the bias. If I say so myself,
I think this is a pretty clear and easy to follow explanation of
the problem. However, I think that the solution I proposed isn't a
great fit for "black box" models such as LLMs.
https://arxiv.org/abs/2501.18577
This is the latest in the line of "Prediction Powered Inference"
(PPI) papers. It's extremely technical, but I think its the most
generally applicable method currently available. Unlike my paper,
this approach does not require any difficult assumptions about
classifier performance. I tried out the R implementation just
yesterday and it is fairly usable. Here's a tutorial:
https://dankluger.github.io/PTDBootTutorial/Tutorial.htmlhttps://naokiegami.com/paper/dsl_ss.pdf
This "designed based supervised learning" approach also claims to
be very general and to work with "black box models". It is similar
in spirit to PPI, but since it involves creating intermediate
predictive models is more complex.
https://journals.sagepub.com/doi/abs/10.1177/00491241251326865.
While most other studies have in mind something like using a model
as an auxiliary coder in a content analysis. This paper suggests
something a bit more radical by using language models as a proxy
for human study participants.
--
Nathan TeBlunthuis
Assistant Professor
School of Information
University of Texas at Austin
https://teblunthuis.cc
Hey everyone,
At the global meeting yesterday, we discussed the group tasks / teams that have pretty much evaporated this last school year. We’re trying to reorganize the tasks and teams in a way that best serves the group, so please fill out this survey<https://docs.google.com/forms/d/e/1FAIpQLScPPKvJMhWlS9n7efa5yhvjurRDgYwrj3b…> to help us adjust the groups for next year.
We’ll discuss as a full group at the retreat in October.
Let me know if you have any questions !
Best,
Madison Deyo
Program Coordinator
Northwestern University
The Center for Human-Computer Interaction + Design<https://www.hci.northwestern.edu/>
Community Data Science Collective<https://wiki.communitydata.science/Main_Page>
--