Social Foundations of Computation


2024


no image
Decline Now: A Combinatorial Model for Algorithmic Collective Action

Sigg, D., Hardt, M., Mendler-Dünner, C.

arXiv preprint arXiv:2410.12633, October 2024 (conference) Submitted

[BibTex]

2024

[BibTex]


no image
Limits to Scalable Evaluation at the Frontier: LLM as Judge Won’t Beat Twice the Data

Dorner, F. E., Nastl, V. Y., Hardt, M.

arXiv preprint arXiv:2410.13341, October 2024 (conference) Submitted

Abstract
High-quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high-quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation and points out promising avenues for future work.

[BibTex]


no image
Training on the Test Task Confounds Evaluation and Emergence

Dominguez-Olmedo, R., Dorner, F. E., Hardt, M.

arXiv preprint arXiv:2407.07890, October 2024 (conference) In revision

Abstract
We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.

ArXiv [BibTex]

ArXiv [BibTex]


no image
Predictors from Causal Features Do Not Generalize Better to New Domains

Nastl, V. Y., Hardt, M.

arXiv preprint arXiv:2402.09891, September 2024 (conference) Accepted

Abstract
We study how well machine learning models trained on causal features generalize across domains. We consider 16 prediction tasks on tabular datasets covering applications in health, employment, education, social benefits, and politics. Each dataset comes with multiple domains, allowing us to test how well a model trained in one domain performs in another. For each prediction task, we select features that have a causal influence on the target of prediction. Our goal is to test the hypothesis that models trained on causal features generalize better across domains. Without exception, we find that predictors using all available features, regardless of causality, have better in-domain and out-of-domain accuracy than predictors using causal features. Moreover, even the absolute drop in accuracy from one domain to the other is no better for causal predictors than for models that use all features. If the goal is to generalize to new domains, practitioners might as well train the best possible model on all available features.

ArXiv link (url) [BibTex]


no image
Evaluating Language Models as Risk Scores

Cruz, A. F., Hardt, M., Mendler-Dünner, C.

arXiv preprint arXiv:2407.14614, September 2024 (conference) Accepted

Abstract
Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks. Conditioned on a question and answer-key, does the most likely token match the ground truth? Such benchmarks necessarily fail to evaluate language models' ability to quantify outcome uncertainty. In this work, we focus on the use of language models as risk scores for unrealizable prediction tasks. We introduce folktexts, a software package to systematically generate risk scores using large language models, and evaluate them against benchmark prediction tasks. Specifically, the package derives natural language tasks from US Census data products, inspired by popular tabular data benchmarks. A flexible API allows for any task to be constructed out of 28 census features whose values are mapped to prompt-completion pairs. We demonstrate the utility of folktexts through a sweep of empirical insights on 16 recent large language models, inspecting risk scores, calibration curves, and diverse evaluation metrics. We find that zero-shot risk sores have high predictive signal while being widely miscalibrated: base models overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and generate over-confident risk scores.

ArXiv [BibTex]

ArXiv [BibTex]


no image
Questioning the Survey Responses of Large Language Models

Dominguez-Olmedo, R., Hardt, M., Mendler-Dünner, C.

arXiv preprint arXiv:2306.07951, September 2024 (conference)

Abstract
As large language models increase in capability, researchers have started to conduct surveys of all kinds on these models in order to investigate the population represented by their responses. In this work, we critically examine language models' survey responses on the basis of the well-established American Community Survey by the U.S. Census Bureau and investigate whether they elicit a faithful representations of any human population. Using a de-facto standard multiple-choice prompting technique and evaluating 39 different language models using systematic experiments, we establish two dominant patterns: First, models' responses are governed by ordering and labeling biases, leading to variations across models that do not persist after adjusting for systematic biases. Second, models' responses do not contain the entropy variations and statistical signals typically found in human populations. As a result, a binary classifier can almost perfectly differentiate model-generated data from the responses of the U.S. census. At the same time, models' relative alignment with different demographic subgroups can be predicted from the subgroups' entropy, irrespective of the model's training data or training strategy. Taken together, our findings suggest caution in treating models' survey responses as equivalent to those of human populations.

ArXiv link (url) [BibTex]


no image
Algorithmic Collective Action in Recommender Systems: Promoting Songs by Reordering Playlists

Baumann, J., Mendler-Dünner, C.

arXiv preprint arXiv:2404.04269, September 2024 (conference) Accepted

Abstract
We investigate algorithmic collective action in transformer-based recommender systems. Our use case is a collective of fans aiming to promote the visibility of an artist by strategically placing one of their songs in the existing playlists they control. The success of the collective is measured by the increase in test-time recommendations of the targeted song. We introduce two easily implementable strategies towards this goal and test their efficacy on a publicly available recommender system model released by a major music streaming platform. Our findings reveal that even small collectives (controlling less than 0.01% of the training data) can achieve up 25x amplification of recommendations by strategically choosing the position at which to insert the song. We then focus on investigating the externalities of the strategy. We find that the performance loss for the platform is negligible, and the recommendations of other songs are largely preserved, minimally impairing the user experience of participants. Moreover, the costs are evenly distributed among other artists. Taken together, our findings demonstrate how collective action strategies can be effective while not necessarily being adversarial, raising new questions around incentives, social dynamics, and equilibria in recommender systems.

arXiv link (url) [BibTex]


no image
Allocation Requires Prediction Only if Inequality Is Low

Shirali, A., Abebe*, R., Hardt*, M.

In Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR, July 2024, *equal contribution (inproceedings)

Abstract
Algorithmic predictions are emerging as a promising solution concept for efficiently allocating societal resources. Fueling their use is an underlying assumption that such systems are necessary to identify individuals for interventions. We propose a principled framework for assessing this assumption: Using a simple mathematical model, we evaluate the efficacy of prediction-based allocations in settings where individuals belong to larger units such as hospitals, neighborhoods, or schools. We find that prediction-based allocations outperform baseline methods using aggregate unit-level statistics only when between-unit inequality is low and the intervention budget is high. Our results hold for a wide range of settings for the price of prediction, treatment effect heterogeneity, and unit-level statistics’ learnability. Combined, we highlight the potential limits to improving the efficacy of interventions through prediction

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Zhang, G., Hardt, M.

In Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR, July 2024 (inproceedings)

Abstract
We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes. The codes and data are available at https://socialfoundations.github.io/benchbench/.

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
Don’t Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

Dorner, F. E., Hardt, M.

In Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR, July 2024 (inproceedings)

Abstract
We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cram\'er's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
Causal Inference from Competing Treatments

Stoica, A., Nastl, V. Y., Hardt, M.

In Proceedings of the 41st International Conference on Machine Learning (ICML), PMLR, July 2024 (inproceedings)

Abstract
Many applications of RCTs involve the presence of multiple treatment administrators -- from field experiments to online advertising -- that compete for the subjects' attention. In the face of competition, estimating a causal effect becomes difficult, as the position at which a subject sees a treatment influences their response, and thus the treatment effect. In this paper, we build a game-theoretic model of agents who wish to estimate causal effects in the presence of competition, through a bidding system and a utility function that minimizes estimation error. Our main technical result establishes an approximation with a tractable objective that maximizes the sample value obtained through strategically allocating budget on subjects. This allows us to find an equilibrium in our model: we show that the tractable objective has a pure Nash equilibrium, and that any Nash equilibrium is an approximate equilibrium for our general objective that minimizes estimation error under broad conditions. Conceptually, our work successfully combines elements from causal inference and game theory to shed light on the equilibrium behavior of experimentation under competition.

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
Causal Inference out of Control: Estimating Performativity without Treatment Randomization

Cheng, G., Hardt, M., Mendler-Dünner, C.

In Proceedings of the 41st International Conference on Machine Learning, PMLR, July 2024 (inproceedings)

Abstract
Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on user consumption. In pursuit of estimating this effect from observational data, we identify a set of assumptions that permit causal identifiability without assuming randomized platform actions. Our results are applicable to platforms that rely on machine-learning-powered predictions and leverage knowledge from historical data. The key novelty of our approach is to explicitly model the dynamics of consumption over time, exploiting the repeated interaction of digital platforms with their participants to prove our identifiability results. By viewing the platform as a controller acting on a dynamical system, we can show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying the causal effect of interest. We complement our claims with an analysis of ready-to-use finite sample estimators and empirical investigations. More broadly, our results deriving identifiability conditions tailored to digital platform settings illustrate a fruitful interplay of control theory and causal inference

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
Fairness in Social Influence Maximization via Optimal Transport

Chowdhary, S., Pasquale, G. D., Lanzetti, N., Stoica, A., Dorfler, F.

arXiv preprint arXiv:2406.17736, June 2024 (conference) Submitted

Abstract
We study fairness in social influence maximization, whereby one seeks to select seeds that spread a given information throughout a network, ensuring balanced outreach among different communities (e.g. demographic groups). In the literature, fairness is often quantified in terms of the expected outreach within individual communities. In this paper, we demonstrate that such fairness metrics can be misleading since they ignore the stochastic nature of information diffusion processes. When information diffusion occurs in a probabilistic manner, multiple outreach scenarios can occur. As such, outcomes such as "in 50% of the cases, no one of group 1 receives the information and everyone in group 2 receives it and in other 50%, the opposite happens", which always results in largely unfair outcomes, are classified as fair by a variety of fairness metrics in the literature. We tackle this problem by designing a new fairness metric, mutual fairness, that captures variability in outreach through optimal transport theory. We propose a new seed selection algorithm that optimizes both outreach and mutual fairness, and we show its efficacy on several real datasets. We find that our algorithm increases fairness with only a minor decrease (and at times, even an increase) in efficiency.

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
Fairness Rising from the Ranks: HITS and PageRank on Homophilic Networks

Stoica, A., Litvak, N., Chaintreau, A.

In Proceedings of the ACM on Web Conference 2024, Association for Computing Machinery (ACM), May 2024 (inproceedings)

Abstract
In this paper, we investigate the conditions under which link analysis algorithms prevent minority groups from reaching high-ranking slots. We find that the most common link-based algorithms using centrality metrics, such as PageRank and HITS, can reproduce and even amplify bias against minority groups in networks. Yet, their behavior differs: on the one hand, we empirically show that PageRank mirrors the degree distribution for most of the ranking positions and it can equalize representation of minorities among the top-ranked nodes; on the other hand, we find that HITS amplifies pre-existing bias in homophilic networks through a novel theoretical analysis, supported by empirical results. We find the root cause of bias amplification in HITS to be the level of homophily present in the network, modeled through an evolving network model with two communities. We illustrate our theoretical analysis on both synthetic and real datasets and we present directions for future work.

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
An Engine Not a Camera: Measuring Performative Power of Online Search

Mendler-Dünner, C., Carovano, G., Hardt, M.

Advances in Neural Information Processing Systems 37 (NeurIPS 2024), May 2024 (conference) Accepted

Abstract
The power of digital platforms is at the center of major ongoing policy and regulatory efforts. To advance existing debates, we designed and executed an experiment to measure the power of online search providers, building on the recent definition of performative power. Instantiated in our setting, performative power quantifies the ability of a search engine to steer web traffic by rearranging results. To operationalize this definition we developed a browser extension that performs unassuming randomized experiments in the background. These randomized experiments emulate updates to the search algorithm and identify the causal effect of different content arrangements on clicks. We formally relate these causal effects to performative power. Analyzing tens of thousands of clicks, we discuss what our robust quantitative findings say about the power of online search engines. More broadly, we envision our work to serve as a blueprint for how performative power and online experiments can be integrated with future investigations into the economic power of digital platforms.

ArXiv [BibTex]

ArXiv [BibTex]


no image
Unprocessing Seven Years of Algorithmic Fairness

Cruz, A. F., Hardt, M.

In The Twelfth International Conference on Learning Representations (ICLR), May 2024 (inproceedings)

Abstract
Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation. Interpreting our findings, we recall a widely overlooked theoretical argument, present seven years ago, that accurately predicted what we observe.

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
Test-Time Training on Nearest Neighbors for Large Language Models

Hardt, M., Sun, Y.

In The Twelfth International Conference on Learning Representations (ICLR), May 2024 (inproceedings)

Abstract
Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. For example, test-time training with nearest neighbors significantly narrows the performance gap between a small GPT-2 and a GPT-Neo model more than 10 times larger. Sufficient index quality and size, however, are necessary. Our work establishes a first baseline of test-time training for language modeling.

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
ImageNot: A Contrast with ImageNet Preserves Model Rankings

Salaudeen, O., Hardt, M.

arXiv preprint arXiv:2404.02112, April 2024 (conference) Submitted

Abstract
We introduce ImageNot, a dataset designed to match the scale of ImageNet while differing drastically in other aspects. We show that key model architectures developed for ImageNet over the years rank identically when trained and evaluated on ImageNot to how they rank on ImageNet. This is true when training models from scratch or fine-tuning them. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. We further give evidence that ImageNot has a similar utility as ImageNet for transfer learning purposes. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset.

ArXiv [BibTex]

ArXiv [BibTex]


no image
What Makes ImageNet Look Unlike LAION

Shirali, A., Hardt, M.

The Twelfth International Conference on Learning Representations (ICLR 2024), February 2024 (conference) Submitted

link (url) [BibTex]

link (url) [BibTex]


no image
Lawma: The Power of Specialization for Legal Tasks

Dominguez-Olmedo, R., Nanda, V., Abebe, R., Bechtold, S., Engel, C., Frankenreiter, J., Gummadi, K., Hardt, M., Livermore, M.

arXiv preprint arXiv:2407.16615, 2024 (conference) In revision

Abstract
Annotation and classification of legal text are central components of empirical legal research. Traditionally, these tasks are often delegated to trained research assistants. Motivated by the advances in language modeling, empirical legal scholars are increasingly turning to prompting commercial models, hoping that it will alleviate the significant cost of human annotation. Despite growing use, our understanding of how to best utilize large language models for legal tasks remains limited. We conduct a comprehensive study of 260 legal text classification tasks, nearly all new to the machine learning community. Starting from GPT-4 as a baseline, we show that it has non-trivial but highly varied zero-shot accuracy, often exhibiting performance that may be insufficient for legal work. We then demonstrate that a lightly fine-tuned Llama 3 model vastly outperforms GPT-4 on almost all tasks, typically by double-digit percentage points. We find that larger models respond better to fine-tuning than smaller models. A few tens to hundreds of examples suffice to achieve high classification accuracy. Notably, we can fine-tune a single model on all 260 tasks simultaneously at a small loss in accuracy relative to having a separate model for each task. Our work points to a viable alternative to the predominant practice of prompting commercial models. For concrete legal tasks with some available labeled data, researchers are better off using a fine-tuned open-source model.

ArXiv [BibTex]

ArXiv [BibTex]

2023


no image
Collaborative Learning via Prediction Consensus

Fan, D., Mendler-Dünner, C., Jaggi, M.

In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), Curran Associates, Inc., December 2023 (inproceedings)

Abstract
We consider a collaborative learning setting where the goal of each agent is to improve their own model by leveraging the expertise of collaborators, in addition to their own training data. To facilitate the exchange of expertise among agents, we propose a distillation-based method leveraging shared unlabeled auxiliary data, which is pseudo-labeled by the collective. Central to our method is a trust weighting scheme that serves to adaptively weigh the influence of each collaborator on the pseudo-labels until a consensus on how to label the auxiliary data is reached. We demonstrate empirically that our collaboration scheme is able to significantly boost the performance of individual models in the target domain from which the auxiliary data is sampled. By design, our method adeptly accommodates heterogeneity in model architectures and substantially reduces communication overhead compared to typical collaborative learning methods. At the same time, it can probably mitigate the negative impact of bad models on the collective.

ArXiv link (url) [BibTex]

2023

ArXiv link (url) [BibTex]


no image
Is Your Model Predicting the Past?

Hardt, M., Kim, M. P.

In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO), Association for Computing Machinery, October 2023 (inproceedings)

Abstract
When does a machine learning model predict the future of individuals and when does it recite patterns that predate the individuals? In this work, we propose a distinction between these two pathways of prediction, supported by theoretical, empirical, and normative arguments. At the center of our proposal is a family of simple and efficient statistical tests, called backward baselines, that demonstrate if, and to what extent, a model recounts the past. Our statistical theory provides guidance for interpreting backward baselines, establishing equivalences between different baselines and familiar statistical concepts. Concretely, we derive a meaningful backward baseline for auditing a prediction system as a black box, given only background variables and the system’s predictions. Empirically, we evaluate the framework on different prediction tasks derived from longitudinal panel surveys, demonstrating the ease and effectiveness of incorporating backward baselines into the practice of machine learning.

link (url) [BibTex]

link (url) [BibTex]


no image
Incentivizing Honesty among Competitors in Collaborative Learning and Optimization

Dorner, F. E., Konstantinov, N., Pashaliev, G., Vechev, M.

In September 2023 (inproceedings)

link (url) [BibTex]

link (url) [BibTex]


no image
Algorithmic Collective Action in Machine Learning

Hardt, M., Mazumdar, E., Mendler-Dünner, C., Zrnic, T.

In Proceedings of the 40th International Conference on Machine Learning (ICML), PMLR, July 2023 (inproceedings)

Abstract
We initiate a principled study of algorithmic collective action on digital platforms that deploy machine learning algorithms. We propose a simple theoretical model of a collective interacting with a firm’s learning algorithm. The collective pools the data of participating individuals and executes an algorithmic strategy by instructing participants how to modify their own data to achieve a collective goal. We investigate the consequences of this model in three fundamental learning-theoretic settings: nonparametric optimal learning, parametric risk minimization, and gradient-based optimization. In each setting, we come up with coordinated algorithmic strategies and characterize natural success criteria as a function of the collective’s size. Complementing our theory, we conduct systematic experiments on a skill classification task involving tens of thousands of resumes from a gig platform for freelancers. Through more than two thousand model training runs of a BERT-like language model, we see a striking correspondence emerge between our empirical observations and the predictions made by our theory. Taken together, our theory and experiments broadly support the conclusion that algorithmic collectives of exceedingly small fractional size can exert significant control over a platform’s learning algorithm.

link (url) [BibTex]

link (url) [BibTex]


no image
AI and the EU Digital Markets Act: Addressing the Risks of Bigness in Generative AI

Yasar, A. G., Chong, A., Dong, E., Gilbert, T. K., Hladikova, S., Maio, R., Mougan, C., Shen, X., Singh, S., Stoica, A., Thais, S., Zilka, M.

arXiv preprint arXiv:2308.02033, July 2023 (conference) Submitted

Abstract
As AI technology advances rapidly, concerns over the risks of bigness in digital markets are also growing. The EU's Digital Markets Act (DMA) aims to address these risks. Still, the current framework may not adequately cover generative AI systems that could become gateways for AI-based services. This paper argues for integrating certain AI software as core platform services and classifying certain developers as gatekeepers under the DMA. We also propose an assessment of gatekeeper obligations to ensure they cover generative AI services. As the EU considers generative AI-specific rules and possible DMA amendments, this paper provides insights towards diversity and openness in generative AI services.

ArXiv [BibTex]

ArXiv [BibTex]


no image
A Theory of Dynamic Benchmarks

Shirali, A., Abebe, R., Hardt, M.

In The Eleventh International Conference on Learning Representations (ICLR) , OpenReview.net, May 2023 (inproceedings)

Abstract
Dynamic benchmarks interweave model fitting and data collection in an attempt to mitigate the limitations of static benchmarks. In contrast to an extensive theoretical and empirical study of the static setting, the dynamic counterpart lags behind due to limited empirical studies and no apparent theoretical foundation to date. Responding to this deficit, we initiate a theoretical study of dynamic benchmarking. We examine two realizations, one capturing current practice and the other modeling more complex settings. In the first model, where data collection and model fitting alternate sequentially, we prove that model performance improves initially but can stall after only three rounds. Label noise arising from, for instance, annotator disagreement leads to even stronger negative results. Our second model generalizes the first to the case where data collection and model fitting have a hierarchical dependency structure. We show that this design guarantees strictly more progress than the first, albeit at a significant increase in complexity. We support our theoretical analysis by simulating dynamic benchmarks on two popular datasets. These results illuminate the benefits and practical limitations of dynamic benchmarking, providing both a theoretical foundation and a causal explanation for observed bottlenecks in empirical work.

link (url) [BibTex]

link (url) [BibTex]


no image
Human-Guided Fair Classification for Natural Language Processing

Dorner, F. E., Peychev, M., Konstantinov, N., Goel, N., Ash, E., Vechev, M.

In The Eleventh International Conference on Learning Representations (ICLR 2023), February 2023 (inproceedings)

Abstract
Text classifiers have promising applications in high-stake tasks such as resume screening and content moderation. These classifiers must be fair and avoid discriminatory decisions by being invariant to perturbations of sensitive attributes such as gender or ethnicity. However, there is a gap between human intuition about these perturbations and the formal similarity specifications capturing them. While existing research has started to address this gap, current methods are based on hardcoded word replacements, resulting in specifications with limited expressivity or ones that fail to fully align with human intuition (e.g., in cases of asymmetric counterfactuals). This work proposes novel methods for bridging this gap by discovering expressive and intuitive individual fairness specifications. We show how to leverage unsupervised style transfer and GPT-3's zero-shot capabilities to automatically generate expressive candidate pairs of semantically similar sentences that differ along sensitive attributes. We then validate the generated pairs via an extensive crowdsourcing study, which confirms that a lot of these pairs align with human intuition about fairness in the context of toxicity classification. Finally, we show how limited amounts of human feedback can be leveraged to learn a similarity specification that can be used to train downstream fairness-aware models.

link (url) [BibTex]

link (url) [BibTex]


no image
Performative Prediction: Past and Future

Hardt, M., Mendler-Dünner, C.

arXiv preprint arXiv:2310.16608, 2023 (conference) In revision

Abstract
Predictions in the social world generally influence the target of prediction, a phenomenon known as performativity. Self-fulfilling and self-negating predictions are examples of performativity. Of fundamental importance to economics, finance, and the social sciences, the notion has been absent from the development of machine learning. In machine learning applications, performativity often surfaces as distribution shift. A predictive model deployed on a digital platform, for example, influences consumption and thereby changes the data-generating distribution. We survey the recently founded area of performative prediction that provides a definition and conceptual framework to study performativity in machine learning. A consequence of performative prediction is a natural equilibrium notion that gives rise to new optimization challenges. Another consequence is a distinction between learning and steering, two mechanisms at play in performative prediction. The notion of steering is in turn intimately related to questions of power in digital markets. We review the notion of performative power that gives an answer to the question how much a platform can steer participants through its predictions. We end on a discussion of future directions, such as the role that performativity plays in contesting algorithmic systems.

arXiv [BibTex]

arXiv [BibTex]

2022


no image
Anticipating Performativity by Predicting from Predictions

Mendler-Dünner, C., Ding, F., Wang, Y.

In Advances in Neural Information Processing Systems 35 (NeurIPS), Curran Associates, Inc., November 2022 (inproceedings)

Abstract
Predictions about people, such as their expected educational achievement or their credit risk, can be performative and shape the outcome that they are designed to predict. Understanding the causal effect of predictions on the eventual outcomes is crucial for foreseeing the implications of future predictive models and selecting which models to deploy. However, this causal estimation task poses unique challenges: model predictions are usually deterministic functions of input features and highly correlated with outcomes, which can make the causal effects of predictions on outcomes impossible to disentangle from the direct effect of the covariates. We study this problem through the lens of causal identifiability. Despite the hardness of this problem in full generality, we highlight three natural scenarios where the causal effect of predictions can be identified from observational data: randomization in predictions, overparameterization of the predictive model deployed during data collection, and discrete prediction outputs. Empirically we show that given our identifiability conditions hold, standard variants of supervised learning that predict from predictions by treating the prediction as an input feature can find transferable functional relationships that allow for conclusions about newly deployed predictive models. These positive results fundamentally rely on model predictions being recorded during data collection, bringing forward the importance of rethinking standard data collection practices to enable progress towards a better understanding of social outcomes and performative feedback loops.

ArXiv link (url) [BibTex]

2022

ArXiv link (url) [BibTex]


no image
Regret Minimization with Performative Feedback

Jagadeesan, M., Zrnic, T., Mendler-Dünner, C.

In Proceedings of the 39th International Conference on Machine Learning (ICML), pages: 9760-9785, PMLR, 39th International Conference on Machine Learning (ICML 2022) , July 2022 (inproceedings)

Abstract
In performative prediction, the deployment of a predictive model triggers a shift in the data distribution. As these shifts are typically unknown ahead of time, the learner needs to deploy a model to get feedback about the distribution it induces. We study the problem of finding near-optimal models under performativity while maintaining low regret. On the surface, this problem might seem equivalent to a bandit problem. However, it exhibits a fundamentally richer feedback structure that we refer to as performative feedback: after every deployment, the learner receives samples from the shifted distribution rather than bandit feedback about the reward. Our main contribution is regret bounds that scale only with the complexity of the distribution shifts and not that of the reward function. The key algorithmic idea is careful exploration of the distribution shifts that informs a novel construction of confidence bounds on the risk of unexplored models. The construction only relies on smoothness of the shifts and does not assume convexity. More broadly, our work establishes a conceptual approach for leveraging tools from the bandits literature for the purpose of regret minimization with performative feedback.

ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]


no image
Causal Inference Struggles with Agency on Online Platforms

Milli, S., Belli, L., Hardt, M.

In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), ACM, June 2022 (inproceedings)

Abstract
Online platforms regularly conduct randomized experiments to understand how changes to the platform causally affect various outcomes of interest. However, experimentation on online platforms has been criticized for having, among other issues, a lack of meaningful oversight and user consent. As platforms give users greater agency, it becomes possible to conduct observational studies in which users self-select into the treatment of interest as an alternative to experiments in which the platform controls whether the user receives treatment or not. In this paper, we conduct four large-scale within-study comparisons on Twitter aimed at assessing the effectiveness of observational studies derived from user self-selection on online platforms. In a within-study comparison, treatment effects from an observational study are assessed based on how effectively they replicate results from a randomized experiment with the same target population. We test the naive difference in group means estimator, exact matching, regression adjustment, and inverse probability of treatment weighting while controlling for plausible confounding variables. In all cases, all observational estimates perform poorly at recovering the ground-truth estimate from the analogous randomized experiments. In all cases except one, the observational estimates have the opposite sign of the randomized estimate. Our results suggest that observational studies derived from user self-selection are a poor alternative to randomized experimentation on online platforms. In discussing our results, we postulate a “Catch-22” that suggests that the success of causal inference in these settings may be at odds with the original motivations for providing users with greater agency.

link (url) [BibTex]

link (url) [BibTex]


no image
Adversarial Scrutiny of Evidentiary Statistical Software

Abebe, R., Hardt, M., Jin, A., Miller, J., Schmidt, L., Wexler, R.

In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT), ACM, June 2022 (inproceedings)

Abstract
The U.S. criminal legal system increasingly relies on software output to convict and incarcerate people. In a large number of cases each year, the government makes these consequential decisions based on evidence from statistical software—such as probabilistic genotyping, environmental audio detection and toolmark analysis tools—that the defense counsel cannot fully cross-examine or scrutinize. This undermines the commitments of the adversarial criminal legal system, which relies on the defense’s ability to probe and test the prosecution’s case to safeguard individual rights. Responding to this need to adversarially scrutinize output from such software, we propose robust adversarial testing as a framework to examine the validity of evidentiary statistical software. We define and operationalize this notion of robust adversarial testing for defense use by drawing on a large body of recent work in robust machine learning and algorithmic fairness. We demonstrate how this framework both standardizes the process for scrutinizing such tools and empowers defense lawyers to examine their validity for instances most relevant to the case at hand. We further discuss existing structural and institutional challenges within the U.S. criminal legal system which may create barriers for implementing this framework and close with a discussion on policy changes that could help address these concerns.

link (url) [BibTex]

link (url) [BibTex]


no image
Performative Power

Hardt, M., Jagadeesan, M., Mendler-Dünner, C.

In Advances in Neural Information Processing Systems 35 (NeurIPS), Curran Associates Inc., March 2022 (inproceedings)

Abstract
We introduce the notion of performative power, which measures the ability of a firm operating an algorithmic system, such as a digital content recommendation platform, to cause change in a population of participants. We relate performative power to the economic study of competition in digital economies. Traditional economic concepts struggle with identifying anti-competitive patterns in digital platforms not least due to the complexity of market definition. In contrast, performative power is a causal notion that is identifiable with minimal knowledge of the market, its internals, participants, products, or prices.We study the role of performative power in prediction and show that low performative power implies that a firm can do no better than to optimize their objective on current data. In contrast, firms of high performative power stand to benefit from steering the population towards more profitable behavior. We confirm in a simple theoretical model that monopolies maximize performative power. A firm's ability to personalize increases performative power, while competition and outside options decrease performative power. On the empirical side, we propose an observational causal design to identify performative power from discontinuities in how digital platforms display content. This allows to repurpose causal effects from various studies about digital platforms as lower bounds on performative power. Finally, we speculate about the role that performative power might play in competition policy and antitrust enforcement in digital marketplaces.

link (url) [BibTex]

link (url) [BibTex]