More than Good and Bad: Human Assessments of Machine Labeling Quality Have Multiple Dimensions

This project develops a novel measure for human assessments of quality in machine labeling tasks. The paper tests this measure across two studies, one using an unsupervised task (generating labels for topic models) and one using a supervised task (labeling framing in political news coverage). For each label, study participants responded to several items asking them to assess each label according to a variety of different criteria.


Exploratory factor analysis of these items reveals a two-factor latent structure in participants’ assessments of label quality that is consistent across both studies. Subsequent analysis demonstrates that this multi-item, two-factor measure can reveal nuances that would be missed using either a single-item measure of perceived label quality or established calculable performance metrics. The paper concludes by suggesting future directions for the development of human-centered approaches to evaluating NLP and ML systems more broadly.

This paper will be submitted soon…

ยซ
ยป