Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Abstract: This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose 200+, concrete research questions.

Read the Paper Summary Thread

Overview of the Challenges

Scientific Understanding of LLMs

In-Context Learning (ICL) is Black-Box

Summary Thread

The dynamic and flexible nature of ICL is central to the success of LLMs as it allows LLMs to proficiently improve on already known tasks as well as learn to perform novel tasks. It is likely to gain an even more prominent role as LLMs are scaled up further and become more proficient at ICL. However, the black box nature of ICL is a risk from the perspective of alignment and safety, and there is a critical need to better understand the mechanisms underlying ICL. Several “theories” have been proposed that provide plausible explanations of how ICL in LLMs might work. However, the actual mechanism(s) underlying ICL in LLMs is still not well understood. We highlight several research questions that could be instrumental in addressing the understanding of the mechanisms underlying ICL.

Can different theorizations of ICL as sophisticated pattern-matching or mesa-optimization be extended to explain the full range of ICL behaviors exhibited by the LLMs? (Section 2.1.1)
What are the key differences and commonalities between ICL and existing learning paradigms? Prior work has mostly examined ICL from the perspective of few-shot supervised learning. However, in practice, ICL sometimes exhibits qualitatively distinct behaviors compared to supervised learning and can learn from data other than labeled examples, such as interactive feedback, explanations, or reasoning patterns. (Section 2.1.2)
Which learning algorithms can transformers implement in-context? While earlier studies (e.g. Akyürek et al., 2022a) argue transformers implement gradient descent-based learning algorithms, more recent work (Fu et al., 2023a) indicate that transformers can implement higher-order iterative learning algorithms e.g. iterative Newton method as well. (Section 2.1.2)
What are the best abstract settings for studying ICL that better mirror the real-world structure of language modeling and yet remain tractable? Current toy settings e.g. learning to solve linear regression are too simple and may lead to findings that do not transfer to real LLMs. (Section 2.1.2)
How can interpretability-based analysis contribute to a general understanding of the mechanisms underlying ICL? Can this approach be used to explain various phenomena associated with ICL such as why ICL performance varies across tasks, how the inductive biases of a particular architecture affect ICL, how different prompt styles impact ICL, etc.? (Section 2.1.3)
Which properties of large-scale text datasets are responsible for the emergence of ICL in LLMs trained with an autoregressive objective? (Section 2.1.4)
How do different components of the pretraining pipeline (e.g. pretraining dataset construction, model size, pretraining flops, learning objective) and the finetuning pipeline (e.g. instruction following, RLHF) impact ICL? How can this understanding be leveraged to develop techniques to modulate ICL in LLMs? (Section 2.1.5)

Capabilities are Difficult to Estimate and Understand

Summary Thread

Understanding and estimating the capabilities of LLMs is a complex task due to various factors. One challenge is the different ‘shape’ of capabilities between humans and AI models, making it hard to predict and simulate LLM behavior accurately. Another issue is the absence of a well-established conceptualization of ‘capabilities’ in the context of LLMs. This lack of clarity, coupled with the absence of reliable methods to assess the generality of LLMs, poses significant challenges in understanding and ensuring their safety. Addressing these challenges is crucial for a comprehensive understanding of LLM capabilities, their limits, and the implications for LLM safety and alignment.

How can we understand the differences in the ‘shape’ of capabilities between humans and AI models? What are the implications of these differences? (Section 2.2.1)
What is the right conceptualization of capabilities for LLMs? How can we formalize different conceptualizations and understand their merits and demerits? (Section 2.2.2)
How can we draw reasonable general insights from behavioral evaluations of LLMs? (Section 2.2.2)
Can we develop methods, like factor analysis, to automatically discover capabilities by decomposing a model’s behavior into a ‘basis’ of capabilities? (Section 2.2.2)
How can we distinguish between ‘capabilities failure’ and ‘alignment failures’ in model evaluations using benchmarks? (Section 2.2.3)
How can we develop elicitation protocols that reliably and consistently elicit the capabilities of interest from LLMs? (Section 2.2.3)
How can fine-grained benchmarking be used to identify the precise shortcomings of a model and make more useful detailed predictions about test behavior? (Section 2.2.3)
How can we evaluate the generalization of LLMs across different domains? Are procedurally-defined evaluations effective for this purpose? (Section 2.2.4)
Can we formalize statements like “the model is using the same capability when performing these two tasks”? How can this improve the efficiency of evaluating LLMs across domains? (Section 2.2.4)
How can a practitioner with limited resources choose a small number of evaluations to efficiently evaluate the general-purpose capabilities of LLMs? (Section 2.2.4)
Can mechanistic interpretability techniques be used to discover capabilities that are reused across tasks? (Section 2.2.4)
How can we understand the dependencies between LLM capabilities and create theoretically grounded taxonomies of these capabilities? (Section 2.2.4)
How can we distinguish between revealed capabilities and learned capabilities in LLMs? (Section 2.2.5)
How can we precisely characterize the contribution of the LLM to behaviors demonstrated by an LLM-based system, and can concepts from game theory and other fields be used for this purpose? (Section 2.2.5)

Effects of Scale on Capabilities Are Not Well-Characterized

Summary Thread

There remain many challenges that hinder our ability to predict which capabilities LLMs will acquire with continued scaling, and when. We continue to lack a robust explanation of why scaling works, and to what extent the scaling laws we have discovered are universal. Additionally, there has been minimal exploration into how scale influences the representations learned by the models and whether certain capabilities cannot be acquired through scaling alone. Furthermore, our understanding of the factors that underlie abrupt performance improvements on certain tasks is lacking, and our methods for discovering task-specific scaling laws are inadequate.

Can the different explanations (manifold explanation, kernel spectrum explanation, long tail theory) of power-law scaling in the resolution-limited regime be unified? (Section 2.3.1)
What is a good theoretical model for compute-efficient scaling, where data and model size are scaled jointly? (Section 2.3.1)
What is the role of feature learning in scaling laws? Can we develop models of scaling laws that account for the computational difficulty of learning the given task? (Section 2.3.1)
What is an appropriate theoretical model to explain scaling observed in reinforcement learning settings where the data distribution is not stationary? (Section 2.3.1)
To what extent scaling law exponents are fundamentally bounded by the data distribution? (Section 2.3.1)
What properties of a task (and its relation to the full training distribution) affect the predictability of its scaling behavior? (Section 2.3.1)
To what extent does scaling a model increase the ‘universality’ of its representations? (Section 2.3.2)
Does increasing scale cause model representations to converge to, or diverge away from, human representations? In other words, does representation alignment between human representations and model representations increase or decrease with scale? (Section 2.3.2)
How does scale impact the structure of the representations? Does scaling cause the structure of model representations to become more, or less, linear? To what extent is the linear representation hypothesis true in general? (Section 2.3.2)
How can we determine whether a given capability is below or above the scale ceiling, i.e. whether simple scaling up the model (and/or data, compute) would enable the model to learn that capability or not? (Section 2.3.3)
To what extent are the issues faced by current LLMs (causal confusion, ‘hallucinations’, jailbreaks/adversarial robustness, etc.) likely to be resolved by further scaling? (Section 2.3.3)
How do we determine if the inverse-scaling behavior of a capability will be reversed with further scaling (i.e. result in a U-shaped curve)? How can we predict threshold points at which scaling behaviors change shape? (Section 2.3.3)
What factors may explain abrupt improvements in performance associated with emergent capabilities? To what extent does the multiplicative emergence effect (Okawa et al., 2023) explain the emergent capabilities of LLMs that have been observed in practice? (Section 2.3.4)
How can we discover valid decompositions of various compositional capabilities of interest and assess the accuracy of such decompositions? How can the emergence of compositional capabilities be predicted based on the learning dynamics of its decomposed capabilities? (Section 2.3.4)
What is an appropriate formalization of “emergent capabilities” in the context of LLM scaling? How can we apply it to understand which sorts of novel phenomena are likely to be (un)predictable? (Section 2.3.4)
Can we discover progress measures that may explain emergent capabilities, e.g. by using interpretability methods? How can we establish the faithfulness of such progress measures to ensure that they can be used to predict the emergence of the capability of interest? (Section 2.3.4)
Can we develop better methods for modeling task-specific scaling? E.g. by conditioning on additional information, using probabilistic techniques, or by developing evaluation measures with higher resolution. (Section 2.3.5)
Can we clarify the purpose of task-specific scaling laws? In order to be useful for forecasting capabilities, what is the minimum range over which a task-specific scaling law must extrapolate accurately? (Section 2.3.5)

Qualitative Understanding of Reasoning Capabilities is Lacking

Summary Thread

Calibrated understanding of the reasoning capabilities of LLMs is required to better understand their risks. In particular, there is a need to develop a more complete understanding of which limitations in reasoning capabilities are fundamental in nature, and which are likely to be resolved with additional scale or improved training methods of LLMs. Formulating empirical scaling laws for general reasoning capabilities, clarifying the mechanisms underlying reasoning and understanding the computational limits of learning and inference in transformers may help in this regard. Furthermore, there is a need for more research on understanding the non-deductive reasoning capabilities of LLMs and understanding how LLMs acquire these capabilities.

Do general reasoning capabilities of LLMs reliably improve with scale? Can we discover empirical scaling laws for reasoning to predict this improvement beforehand? If it is the case that scale does not improve general reasoning capabilities of the LLMs, can we conclusively show this to be the case? (Section 2.4.1)
Can interpretability analyses be used to understand the mechanisms underlying various types of reasoning within LLMs? Can these techniques be used to explain the successful and unsuccessful cases of reasoning in LLMs identified in the literature? (Section 2.4.2)
How well are LLMs able to perform non-deductive reasoning? Can they infer rules or axioms from a set of observations, either in training or in-context? If so, do LLMs use abductive reasoning capabilities in training to develop a coherent model of the outside world from dispersed and partial information available in training data? (Section 2.4.3)
How do language models acquire the ability to perform reasoning tasks from their training? Tools such as influence functions could help identify which training examples are instrumental in acquiring reasoning capabilities (Grosse et al., 2023). (Section 2.4.4)
Does finetuning or pretraining on reasoning traces robustly improve the reasoning capabilities of LLMs or not? How well does such training generalize OOD? (Section 2.4.4)
Can we develop a better understanding of the computational limits of transformers that may use intermediate steps (e.g. chain-of-thought reasoning)? Can we define some symbolic programming language that is exactly equivalent in expressive power to transformers? (Section 2.4.5)
What algorithms are learnable by transformers? How does the representability of an algorithm as a RASP (or RASP-L) program relate to its learnability by a transformer? (Section 2.4.5)

Agentic LLMs Pose Novel Risks

Summary Thread

LLM-agents will pose many novel alignment and safety risks. These risks may be amplified by the ability of LLM-agents to perform lifelong learning and their access to various affordances. Our understanding of these risks and their likelihoods is currently quite poor and needs improvement. There is also a need to develop methods that allow us to better control LLM-agents and guide their behavior more effectively. Furthermore, the development of monitoring systems for LLM-agents is likely to entail significant challenges.

What drives the capability of LLM-agents to improve via lifelong learning, and to what extent is this capability present in current LLMs? How does it relate to the in-context learning ability of the base LLM, and how can we modulate it? For example, Wang et al. (2023a) note that replacing GPT-4 with GPT-3.5 in their agent caused the agent’s performance to plummet. However, it is unclear whether this was due to GPT-4 being a better in-context learner (and therefore, better able to improve based on feedback) or due to GPT-4 being inherently more capable. (Section 2.5.1)
How can we enable LLM-agents to be more robust to underspecification (Ruan et al., 2023) and make them act more conservatively in the face of uncertainty, or when performing high-impact actions? (Section 2.5.2)
How can we quantify and benchmark the propensities of LLM-agents to engage in undesirable behaviors like deception, power-seeking, and self-preservation? Can we use interpretability techniques to identify why LLM-agents exhibit such behavior? Can we explain why such undesirable behaviors arise in the first place (e.g. is it due to specific examples or data in pretraining or finetuning) and how can we modify our training pipelines to mitigate such behavior? (Section 2.5.3)
How can we build more robust monitoring systems for LLM agents? This could benefit from a better understanding of issues such as reward hacking, situational awareness, and deception. (Section 2.5.4)
How should the monitoring systems for LLM-agents be evaluated? What is a good threat model that may reveal adversarial vulnerabilities of the monitoring system? (Section 2.5.4)
How can we better understand the safety issues posed by different affordances and how can we assure that particular affordances provided to an LLM-agent are safe? (Section 2.5.5)

Multi-Agent Safety is Not Assured by Single-Agent Safety

Summary Thread

Multi-agent alignment and safety is distinct from single-agent alignment and safety, and assurance will require deliberate efforts on the part of agent designers. The possible safety risks that must be dealt with range from correlated failures that might occur due to foundationality of the LLM-agents to collusion between LLM-agents. At the same time, confronting social dilemmas requires LLM-agents have the ability to cooperate successfully with each other and with humans, even when their objectives might differ.

How do pretraining, prompting, safety finetuning, etc. shape the behavior of a LLM-agent within multi-agent settings? In particular, what is the role of pretraining data on agents’ dispositions and capabilities? (Section 2.6.1)
How can we evaluate or benchmark cooperative success and failure of LLM-based systems? How can existing environments, such as those of Park et al. (2023a); Yocum et al. (2023); Mukobi et al. (2023), be leveraged to study this? Can we create new LLM-agent analogues of popular multi-agent benchmarks, e.g. Melting Pot, or Hanabi? (Section 2.6.1)
How can we leverage foundationality to enable LLM-agents to better cooperate with each other and achieve outcomes with higher social welfare? (Section 2.6.2)
How can we evaluate and improve robustness of LLM-agents to correlated failures? Is quality-diversity-based finetuning an effective way to improve robustness of LLM-agents to correlated failures? How else can we improve robustness of LLM-agents to correlated failures? (Section 2.6.2)
Do groups of LLM-agents show emergent functionality or any form of self-organization? What worrisome capabilities are more likely to emerge in multi-agent contexts that are absent in single-agent contexts? (Section 2.6.3)
Can we design benchmarks and adversarial evaluations to study colluding behaviors between LLM-agents? (Section 2.6.4)
How can collusion between LLM-agents be prevented and detected? How can we assure that the game mechanisms (which are often implicit) are robust to collusion when deploying multiple LLM-agents in the same context? Can we design “watchdog” LLM-agents that detect colluding behavior among LLM-agents? (Section 2.6.4)
How can we train LLM-agents to avoid colluding behavior? (Section 2.6.4)
How can insights and techniques from the multi-agent reinforcement learning literature (e.g. utility transfer, contracting, reputation mechanisms) be adapted for LLM-agents? What adjustments need to be made in theory, and in practice, to unlock similar benefits? (Section 2.6.5)

Safety-Performance Trade-offs are Poorly Understood

Summary Thread

The high-level challenge is to improve our understanding of safety-performance trade-offs, in multiple different ways. As a foundation for future research, a better formalization and classification of different types of trade-offs is important. This will be assisted by the development of clear metrics, in particular for different axes of safety. Building on those, empirical investigations could answer important questions about the severity of these trade-offs and let us track progress in their mitigation. As a complement to such measurements, we should also aim to understand the causes of these trade-offs and whether or not these trade-offs are fundamental in nature.

What are the best metrics to measure performance, and in particular, safety, in ways that are representative of real-world usage of AI systems and that can be applied across different LLMs and safety methods? Are metrics such as Elo ratings valid for measuring the safety of different LLMs relative to each other? (Section 2.7.1)
How can the safety of LLMs be disentangled from their performance? Do there exist useful “knobs” for practitioners to trade-off safety against performance? (Section 2.7.2)
Can we develop LLM ‘savants’ that excel in some domains while remaining selectively ignorant about other areas (i.e. those which pose safety concerns, such as knowledge of weapons)? (Section 2.7.2)
What are the various axes of safety and performance, and along which axes are there safety-performance trade-offs? Which of those trade-offs are especially important, in the sense of creating strong incentives to sacrifice safety? Can we identify high-level clusters of instances of safety/performance trade-offs? (Section 2.7.3)
How do safety-performance trade-offs vary depending on the deployment context? In particular, in what ways do safety-performance trade-offs for LLM-agents differ from trade-offs for LLM-assistants? What safety-performance trade-offs exist in the development stage? (Section 2.7.3)
What are the ‘causes’ of safety-performance trade-offs? Are these trade-offs for LLMs fundamental in nature or can they be overcome by development of better methods? (Section 2.7.4)

Development and Deployment Methods

Pretraining Produces Misaligned Models

Summary Thread

Misaligned pretraining of the LLMs is a major roadblock in assuring their alignment and safety. The chief cause of this misalignment is widely believed to be that LLMs are trained to imitate large-scale datasets that contain undesirable text samples. The large scale of these datasets makes auditing and manual filtering of these datasets difficult. There is a need to develop scalable techniques for data filtering, data auditing, and training data attribution to help identify harmful data. Furthermore, even after harmful data is identified, further research is needed to find the most effective ways to address the harmful data. In addition to directly improving the alignment and safety of pretrained models, future work could explore ways in which pretraining could be modified to facilitate other processes (e.g. safety finetuning or interpretability analysis) that can help to assure alignment and safety.

How can the methods for detection of harmful data be improved? The complex nature of harmful data (Rauh et al., 2022) makes it difficult to develop automated methods to effectively remove all such data. Can we use feedback from human labelers, in a targeted fashion, to directly improve the quality of the pretraining dataset? (Section 3.1.1)
How can the effects of harmful data be effectively mitigated? Instead of removing harmful data, can it be edited, or rewritten, to remove the harmful aspects e.g. by an existing LLM? Alternatively, can we add synthetic data, generated procedurally, to the model such that the effects of harmful data are mitigated? (Section 3.1.1)
How can we develop static dataset auditing techniques to identify harmful data of various types? Can existing techniques (e.g. Elazar et al., 2023) be extended for this purpose? (Section 3.1.2)
How can dynamic dataset auditing techniques (e.g. Siddiqui et al., 2022), which leverage knowledge of training dynamics, be used to audit LLM pretraining datasets? (Section 3.1.2)
How can we further scale training data attribution methods, in particular, those utilizing influence functions? How can we leverage insights from training data attribution methods to improve the quality of the pretraining dataset? (Section 3.1.3)
How can the effectiveness of pretraining with human feedback (PHF) techniques be improved? Korbak et al. (2023) utilize conditioning on binary tokens only (good/bad); how can it be generalized to more granular forms of feedback e.g. harmless and helpfulness scores? In what other ways can conditional training at train time, like PHF, be used to improve the alignment of a pretrained model? (Section 3.1.4)
How can the pretraining process be modified so that the models are more amenable to interpretability-based analysis? (Section 3.1.5)
Can we develop task-blocking language models, i.e. pretrain language models in a way that they are highly resistant to learning or performing specific harmful tasks? (Section 3.1.5)

Finetuning Methods Struggle to Assure Alignment and Safety

Summary Thread

A major goal of finetuning is to remove potentially undesirable capabilities in a model while steering it toward its intended behavior. However, current approaches struggle to remove undesirable behaviors, and can even actively reinforce them. Adversarial training alone is unlikely to be an adequate solution. Mechanistic methods that operate directly on the model’s internal knowledge may enable deeper forgetting and unlearning. Finally, behind these technical challenges is a murky understanding of how finetuning changes models and why it struggles to make networks “deeply” forget and unlearn undesirable behaviors.

To what extent does pretraining determine the concepts that the LLM uses in its operation? To what extent can finetuning facilitate fundamental changes in the network’s behavior? Can we develop a fine-grained understanding of changes induced by finetuning within an LLM? (Section 3.2.1)
Can we improve our understanding of why LLMs are resistant to forgetting? How is this resistance affected by the model scale, the inductive biases of the transformer architecture, the optimization method, etc.? (Section 3.2.1)
Can we create comprehensive benchmarks to assist in evaluating and understanding generalization patterns of finetuning? (Section 3.2.2)
Can we develop more sophisticated finetuning methods with better generalization properties? For example, by basing them on OOD generalization methods in deep learning or using explanation-based language feedback (e.g. critique) to prevent reliance on spurious features? (Section 3.2.2)
How can we ensure that finetuning on a small number of adversarial (“red-teamed”) samples generalizes correctly? Are process supervision and latent adversarial training viable methods in this regard? (Section 3.2.3)
Can machine unlearning techniques be used or extended to precisely remove knowledge and capabilities from an LLM? (Section 3.2.4)
How can we reliably benchmark methods for targeted modification of LLM behavior? (Section 3.2.4)
How can unknown undesirable capabilities be removed from an LLM? Are compression and/or distillation effective ways to achieve this behavior? What are the kinds of capabilities that are lost when an LLM is compressed, and/or distilled? (Section 3.2.5)

LLM Evaluations Are Confounded and Biased

Summary Thread

Many issues undermine our ability to comprehensively and reliably evaluate LLMs. Issues such as prompt-sensitivity, test-set contamination, and targeted training to suppress undesirable behaviors in known context confound evaluation. The validity of evaluation is further compromised by biases present in LLMs (which are used to evaluate other LLMs), and human evaluators. Furthermore, there exist ‘systematic biases’ that create blindspots in LLM evaluations, e.g. limited evaluations on low-resource languages. Finally, considering the rapid rate of improvement in LLMs’ capabilities, we need robust strategies to implement scalable supervision which are currently lacking.

Can we develop automated methods that reliably find the best prompt for a given task or task instance?
How can we account for prompt sensitivity when evaluating an LLM? (Section 3.3.1)
How can the evaluations of LLMs be made trustworthy given the difficulty of assuring that there is no test-set contamination? Can we develop methods that can detect whether a given text is contained in the training dataset in mutated form, e.g. paraphrased or translated into another language? (Section 3.3.2)
How can training data attribution methods be used to detect cases of LLMs responding to queries based on memorized knowledge when the training dataset is known? (Section 3.3.2)
What measures can evaluation developers take to prevent leakage of the evaluation data into an LLM’s training dataset? How effective are existing measures such as canary strings, hiding datasets behind APIs, or password-protecting dataset files in detecting accidental and/or deliberate leakage? (Section 3.3.2)
How can the failure modes of an LLM be uncovered when the LLM has been explicitly trained to hide those failure modes? Are there general techniques, such as persona modulation, or counterfactual evaluation, that can be used for this purpose? (Section 3.3.3)
What are the various ways in which an evaluation of an LLM by an LLM may be biased or misleading? How can LLM-based evaluation be made robust against such biases? (Section 3.3.4)
What are the limitations, and strengths, of Constitutional AI-based LLM evaluation? How can we develop a nuanced understanding of how principles given in the constitution affect the evaluation of different LLM behaviors? How do LLMs handle issues such as underspecification or conflict in the constitutional principles? (Section 3.3.4)
How can evaluation done by humans be made robust against the various known biases and cognitive limitations of humans? How can LLMs be used to complement human evaluators to improve the quality of human evaluations?
Can we develop better models of how humans generate their preferences than the widely-used Bradley-Terry model?
How can the ‘blindspots’ in LLM evaluation resulting from systematic biases be avoided? (Section 3.3.6)
Can we formalize different proposed methods for scalable oversight in the context of evaluating LLMs? Can this formalization be used to understand the relative strengths and weaknesses of these proposals, through theoretical and empirical research? Which proposed methods are complementary, and which are interchangeable? (Section 3.3.7)
Are there any decomposition strategies that generalize across tasks? Prior work has proposed task-specific decomposition strategies, e.g. for book summarization (Wu et al., 2021) or writing code (Zhong et al., 2023b). Do the proposed decompositions generalize to other tasks? Can language models automatically decompose tasks? (Section 3.3.7)

Tools for Interpreting or Explaining LLM Behavior Are Absent or Lack Faithfulness

Summary Thread

Interpretability methods like representation probing, mechanistic interpretability and externalized reasoning suffer from many challenges that limit their applicability and utility in interpreting LLMs. Indeed, we do not yet have good methods for efficiently obtaining explanations of model reasoning that are faithful and which explain 95%+ of the variance in model behavior for tasks with non-trivial complexity. Some of the challenges in interpreting models are ‘fundamental’ in nature, e.g. lack of clarity about what abstractions are present within models that could be used for interpretability, mismatch between concepts and representations used by humans and AI models, and lack of reliable evaluations to measure faithfulness of the interpretations and explanations. Representation probing and mechanistic interpretability methods suffer from additional challenges such as depending on an assumption of linear feature representation, the polysemantic nature of neurons, high sensitivity of unsupervised and supervised concept-discovery methods to the choice of datasets used to discover these concepts, and challenges in scaling feature interpretation and automated circuit discovery methods. Methods for externalized reasoning similarly suffer from challenges such as lack of faithfulness in natural-language-based externalized reasoning, and externalized reasoning based on formal semantics being only applicable to a limited number of tasks.

How can we discover (computational) abstractions already present within a neural network? (Section 3.4.1)
How can we design training objectives so that the model is incentivized to use known specific abstractions? (Section 3.4.1)
Can we develop general strategies that help us learn, and understand, concepts used by (superhuman) models? (Section 3.4.2)
How can we train large-scale models such that the concepts they use are naturally understandable to humans? (Section 3.4.2)
How can we establish benchmarks to standardize evaluations of the faithfulness of various interpretability methods, in particular, mechanistic interpretability methods? (Section 3.4.3)
How can we efficiently evaluate the faithfulness of externalized reasoning in natural language? Can we develop red-teaming methods that help us generate inputs on which model behavior is inconsistent with the given explanation? Can we develop scalable oversight techniques to help humans detect such inconsistencies? (Section 3.4.3)
When should we be concerned about overfitting to the particularities of interpretability methods when using them to construct optimization targets? How might we mitigate such concerns? (Section 3.4.4)
To what extent do models encode concepts linearly in their representations? What causes a concept to be encoded linearly (or not)? (Section 3.4.5)
Can we fully determine the causes of feature superposition and polysemanticity within neural networks? Can we develop scalable techniques that deal with these issues? (Section 3.4.6)
How can we mitigate, or account for, the sensitivity of interpretability results to the choice of dataset used for model analysis? (Section 3.4.7)
To what extent can LLMs be used to help scale feature interpretation? (Section 3.4.8)
Can we develop efficient methods for automated circuit discovery within neural networks? (Section 3.4.9)
Can we understand why natural-language-based externalized reasoning can be unfaithful despite often resulting in improved performance? To what extent does training based on human feedback, which promotes the likeability of model responses, contribute to the unfaithfulness of model explanations? (Section 3.4.10)
Does directly supervising the reasoning training process (e.g. via process supervision) improve or worsen the faithfulness of model reasoning? (Section 3.4.10)
What kind of structures can be imposed on natural-language-based externalized reasoning to force the model to use consistent reasoning patterns across inputs? (Section 3.4.10)
What level of completeness of explanations is needed to avoid alignment failures in practice, considering there is an inherent trade-off between completeness and efficiency (of the evaluation) of the natural language explanations? Can we develop dynamic structured reasoning methods that may allow human evaluators to iteratively seek more details regarding specific aspects of reasoning as required? (Section 3.4.10)
What kinds of tasks can we solve with structured reasoning and program synthesis, rather than relying on LLMs end-to-end? Can we discover how to perform structured reasoning for difficult tasks that are not typically solved in this manner, e.g. open-ended tasks like question answering? (Section 3.4.11)

Jailbreaks and Prompt Injections Threaten Security of LLMs

Summary Thread

Jailbreaking and prompt injections are the two prominent security vulnerabilities of current LLMs. Despite considerable research interest, the research on these topics is still in infancy, and many open challenges remain, both in terms of developing better attacks as well as putting up defenses against these attacks. Successfully defending against these attacks could be achieved either via improving robustness of the LLM itself, or by defending the LLM as a system. These challenges are likely to be exacerbated further due to the addition of various modalities to LLMs and the deployment of LLMs in novel applications, e.g. as LLM-agents.

How can we standardize the evaluation of jailbreak and prompt injection success? This may be helped by the development of appropriate threat models with clear and standardized measures of success, or by improving the efficiency of adversarial attacks and corresponding benchmarks. (Section 3.5.1)
How can we make white-box attacks for LLMs more efficient and reliable? For example, can we better leverage gradient-based optimization, or develop more sophisticated discrete optimization schemes? (Section 3.5.2)
What are the similarities and differences between different types of jailbreaking attacks? Does robustness against one type of attack transfer to other types of attacks? Why and when do these attacks transfer across models? (Section 3.5.3)
What are the different ways in which LLMs can be compromised via adversarial attacks on modalities other than text, e.g. images? Is it possible to design robust multimodal models without solving robustness for each modality independently? (Section 3.5.4)
How do different design decisions and training paradigms for multimodal LLMs impact adversarial robustness? (Section 3.5.4)
Can we design secure systems around non-robust LLMs e.g. using strategies like output filtering and input preprocessing? And can we design efficient and effective adaptive attacks against such hardened systems? (Section 3.5.5)
Can LLMs course-correct after initially agreeing to respond to a harmful request? (Section 3.5.6)
Can we find better ways of using adversarial optimization to find jailbreaks, which go beyond aiming to elicit an initial affirmative response? (Section 3.5.6)
How can we prevent system prompts from being leaked? (Section 3.5.7)
How can we assure that the system prompt reliably supersedes user instructions and other inputs? Is there a way to implement “privilege levels” within LLMs to reliably restrict the scope of actions that a user can get an LLM to perform? Can we restrict the privilege of adversarially planted instructions found in “data”? (Section 3.5.7)
What kind of adversarial attacks may enable hijacking of LLM-based applications, and in particular LLM-agents? How effective is adversarial training against such attacks? How else can we prevent against such attacks? (Section 3.5.7)

Vulnerability to Poisoning and Backdoors Is Poorly Understood

Summary Thread

Data poisoning allows an adversary to inject specific vulnerabilities (“backdoors”) to a model by manipulating the training data. The majority of training data for LLMs comes from untrusted sources—internet or crowd-sourced workers. Hence, data poisoning attacks on LLMs are highly plausible. Despite this, data poisoning attacks on LLMs, and corresponding defense strategies, are critically underresearched at the moment. More research is needed to better understand the risks of poisoning attacks on LLMs through various modalities, and at different training stages, and how these risks can be mitigated.

Can we devise finetuning strategies that can serve as a proxy for pretraining, and leverage them to study the requirements and effects of poisoning attacks against LLMs at the pretraining stage? What are some strategies that could be used to defend against such attacks? (Section 3.6.1)
Which of the three stages of LLM training—self-supervised pretraining, instruction tuning, reinforcement learning from human feedback—is most vulnerable to poisoning attacks? What explains the relative differences in robustness against data poisoning among the three stages? (Section 3.6.2)
How does scale affect the vulnerability of LLMs to poisoning attacks at different stages of training? Is this effect different for task-specific vs. task-general poisoning? (Section 3.6.3)
Is out-of-context reasoning an effective attack vector for data poisoning attacks? If so, how can such attacks be countered? (Section 3.6.4)
How can multimodal LLMs be backdoored through modalities other than text? How can Vision-Language models be attacked via poisoning image inputs? Can encodings like base64 be an effective poisoning vector? (Section 3.6.5)
How can backdoors be detected in already-trained LLMs? Once detected, what are effective ways to “unlearn” them? (Section 3.6.6)

Sociotechnical Challenges

Values to Be Encoded within LLMs Are Not Clear

Summary Thread

Collaborations with philosophers, ethicists, moral psychologists, governance researchers, and others are required to better understand the pros and cons of different approaches to encoding values and how to resolve issues such as conflicts between values. At the same time, what values we will encode within our models may be heavily biased by the tractability of different approaches to encoding values. Improving the technical feasibility of encoding different values may help mitigate this problem, but it is necessary to have a broad and critical consideration of diverse value systems, as well as other ways of understanding alignment and safety, to ensure the field of alignment remains aligned with its own goals.

What justifies choosing one set of values (e.g. helpfulness, harmlessness, honesty) over other sets of values?
How does the type of a system (e.g. assistant vs. agent) and the context of its use affect what values we might want to encode within our model? (Section 4.1.1)
How does the capabilities profile of a model impact what values we might want to encode within a model? Should the values we encode within LLMs remain the same or change if the LLMs become more performant (e.g. due to scaling)? (Section 4.1.1)
How do different methods for communicating and encoding values differ in terms of information content? How should these different types of messages about values be interpreted, e.g. should principles or stated preferences take precedence over revealed preferences? (Section 4.1.1)
How can conflicts between various values or principles proposed to align model behavior be resolved effectively (e.g. harmlessness and helpfulness in 3H framework)? (Section 4.1.2)
How can we design methods to balance conflicting values appropriately or enable (groups of) humans to resolve the conflicts between their values? (Section 4.1.2)
How do we mitigate the risk of value imposition? Can we design governance mechanisms that allow LLMs’ values to be chosen in a systematically fair and just way? (Section 4.1.2)
How are we to account for changes in values over time? (Section 4.1.2)
To what extent will the ’technical lottery’ play a role in what values we encode in our models? For values that may be technically infeasible to encode, can we develop technically feasible robust proxies that we could use instead? (Section 4.1.3)
How can we robustly evaluate what values are encoded within a model? (Section 4.1.4)
How can we determine whether a model understands the encoded values or is only mimicking them? Relatedly, to what extent can we claim that an LLM has values, given an LLM is perhaps more like a superposition of various personas with varying characteristics? (Section 4.1.4)
How are values transmitted from one stage of development to another? (Section 4.1.4)
What are the limitations of framing the design of AI technologies that broadly benefit humanity in terms of ‘value alignment’ with humanity? Can we develop alternative or complementary framings that might help address those limitations? (Section 4.1.4)

Dual-Use Capabilities Enable Malicious Use and Misuse of LLMs

Summary Thread

There is a strong risk that dual-use capabilities of LLMs may be exploited for malicious purposes. LLMs may be misused towards generating targeted misinformation at an unprecedented scale. The coding capabilities of LLMs may be misused by malicious actors to mount cyberattacks with greater sophistication and at higher frequencies. LLMs are quite effective as content moderation tools; this may mean that LLMs get adopted to enact mass surveillance and censorship. LLMs may be used to power autonomous weapons, creating a possibility of physical harm from them. Other misuses of LLMs may occur as LLMs are applied to various domains. Active research is needed to better understand these risks and to create effective mitigation strategies.

Can we develop a calibrated understanding of how current and future LLMs could be used to scale up and amplify disinformation campaigns? What level of human expertise is required to effectively use LLMs to generate targeted misinformation? (Section 4.2.1)
Can we develop reliable techniques to attribute LLM-generated content, helping track its spread? Can watermarking be an effective measure given the growing availability of openly accessible LLMs? (Section 4.2.1)
How can individuals be protected against harms caused by AI-assisted deepfakes? (Section 4.2.1)
What measures can be taken to prevent LLMs from producing sophisticated misinformation, while simultaneously enhancing their capacity to identify and mitigate misinformation? (Section 4.2.1)
Can we develop effective tooling and mechanisms to combat misinformation on online platforms? How can LLMs themselves be applied to detect and intervene against LLM-generated misinformation? (Section 4.2.1)
How can the military applications of LLMs, especially LLM-powered autonomous weapons, be regulated? There appears to be a mass consensus within the machine learning community that LLMs, and other AIs, should not be weaponized; how can this consensus be leveraged to create legislation pressure to outlaw autonomous weapons development across the world? (Section 4.2.4)
How may LLMs contribute to scaling and personalization of social engineering-based cyberattacks? (Section 4.2.2)
To what extent do LLMs reduce the threshold of technical expertise required for executing a successful cyberattack? (Section 4.2.2)
Do advances in capabilities of LLMs cause LLMs to become better at crafting jailbreaking attacks? If so, how can the safety of LLMs from jailbreaking attacks (designed by other LLMs) be assured? (Section 4.2.2)
How effective are LLMs at surveillance and censorship? How may LLMs, on their own or in combination with other technologies (e.g. speech-to-text softwares), contribute to the expansion and sophistication of current surveillance operations? How can we limit the use of LLMs for surveillance? (Section 4.2.3)
How might current, or future, LLM-based technologies (including chemical LLMs and specialized biological design tools based on LLMs) be misused in the design of hazardous biological and chemical technologies? (Section 4.2.5)
Can we identify how LLMs may get misused across various domains, such as health and education? What regulations are required to prevent such misuses? In general, how can we best understand LLM use cases and identify those with significant misuse potential? (Section 4.2.6)
Can we design robust watermarking mechanisms that may help us identify LLM-generated content? (Section 4.2.7)
What mechanisms can be used to determine attribution for content generated using openly available models for which watermarking may not be an appropriate solution (as it could be easily undone)? (Section 4.2.7)

LLM-Systems CanBe Untrustworthy

Summary Thread

Trustworthiness, defined as the assurance that users will not experience accidental harm when using an LLM, is critical for LLM alignment and safety. Among other things, trustworthiness requires ensuring that users can use LLMs safely despite the occasional unreliability of their outputs; preventing users from developing an overreliance on LLMs; mitigating harm resulting from biased representations of societal groups learned by LLMs; ensuring appropriate behavior across all contexts; preventing the generation of toxic and offensive content by LLMs; and reliably preserving user privacy across various contexts.

What sort of societal biases are present within LLMs? Can evaluations on complex, real-world tasks uncover those biases? (Section 4.3.1)
To what extent is the global south represented within LLMs? What harmful representations of the global south are present within the LLMs? (Section 4.3.1)
Can we develop better and more comprehensive tools for implicit and explicit toxicity and offensive language detection? (Section 4.3.1)
To what extent are LLM users able to accurately perceive the reliability of LLM outputs? Does a user’s ability to assess the reliability of an LLM’s response improve with continued interaction? (Section 4.3.2)
What extrinsic measures can be implemented to ensure that LLM users utilize the technology safely, especially considering the potential unreliability of LLM responses? (Section 4.3.2)
How can we mitigate the potential harms arising from overreliance on LLMs? How can we prevent users from becoming desensitized to disclaimers about LLMs’ limitations? (Section 4.3.3)
How can we make LLMs understand the sensitivity of information in a given context, and preserve contextual privacy? (Section 4.3.4)

Socioeconomic Impacts of LLM May Be Highly Disruptive

Summary Thread

The socioeconomic impacts of LLMs have the potential to be highly disruptive if not effectively managed. LLMs are likely to adversely affect the workforce, exacerbate societal inequality, and introduce new challenges for the education sector. Furthermore, the implications of LLM-based automation on global economic development remain uncertain. These challenges are complex and systemic in nature; there do not exist any simple fixes. To devise solutions, we need to develop a deep and nuanced understanding of these issues. Answering the following questions may help make progress towards this goal.

How can we better understand and forecast the disruptive effects of LLMs on job availability and job quality in different sectors? How can displaced workers be helped to transition to other sectors? (Section 4.4.1)
How can LLM developers best conduct impact assessments or risk assessments for whether AI systems improve working conditions (by augmenting workers) or not? (Section 4.4.1)
How can LLM-based systems be designed to augment workers and improve working conditions, as opposed to automating and discplacing workers? (Section 4.4.1)
How can we best educate business leaders to leverage the growth benefits of LLMs in a way that is minimally disruptive to society? (Section 4.4.2)
How likely is the market for advanced LLMs to become a monopoly or oligopoly? What will the ramifications of such market concentration be on wealth distribution across society? (Section 4.4.2)
How can we ensure equitable access to LLMs for individuals of all socioeconomic backgrounds? (Section 4.4.2)
To what extent are LLMs likely to exacerbate an ‘intelligence divide’ based on access to the most advanced LLMs? (Section 4.4.2)
How can LLM developers best keep economic policymakers updated on the technological scenarios that lie ahead and their economic implications (e.g. by writing policy briefs or writing informal educational documents)? (Section 4.4.2)
How do we best educate the workforce for the effective use of LLMs and retrain disrupted workers? (Section 4.4.3)
What factors might impede adoption of LLM-based technology in educational contexts? How can these factors be mitigated? (Section 4.4.3)
How can we better understand the second-order effects of LLM-driven automation on AI safety and alignment? E.g. could LLM-driven automation reduce the agency and ability of skilled labor to resist immoral usage of technologies? (Section 4.4.3)
How might LLM-based automation negatively impact the economies of Global South countries? (Section 4.4.4)
How accessible are LLMs to Global South populations? How can this accessibility be improved? What measures can be taken by governments to address issues related to lack of internet connectivity, poor tech literacy etc.? (Section 4.4.4)
How can LLMs be used to help address some of the issues that hinder the economic development of Global South countries? For example, how can LLMs be used to help improve the quality of education available to Global South populations? (Section 4.4.4)

LLM Governance Is Lacking

Summary Thread

Effective governance of LLMs is critical for ensuring that LLMs prove a beneficial addition to societies. However, efforts to govern LLMs, and related AI technologies, remain nascent and ill-formed. The governance of LLMs is made challenging by various meta-challenges ranging from lack of scientific understanding and technical tools required for governance to the risks of regulatory capture by corporations. From a more practical lens, concrete and comprehensive proposals to govern LLMs remain absent and the various governance mechanisms (e.g., deployment governance, development governance, compute governance, data governance) unfortunately are not adequately developed yet.

How should the governance approach change depending on how rapidly the capabilities of models are advancing, the rate at which they are being productized (and hence proliferating in the society) and the degree to which the technical understanding of a model is lacking? (Section 4.5.1)
What policy interventions can be taken by governing bodies to support research to alleviate technical limitations inhibiting effective governance of LLM-based systems? (Section 4.5.1)
Should governing bodies aim to moderate the pace of progress in AI? If so, what governance mechanisms could be used for this? (Section 4.5.11)
How likely is the slow, bureaucratic nature of governments to negatively impact governance of LLMs, and other AI technologies? What are the relative merits and demerits of various measures (such as forming public-private partnerships or formalizing regulatory markets) that could be taken by the governments to mitigate this issue? (Section 4.5.2)
What measures can be taken to disincentivize irresponsible approaches to AI development? Can we design regulations that mandate the safe and responsible development of AI models? (Section 4.5.3)
Can we develop better (game-theoretic) models of AI race dynamics, and understand how different governance interventions might alter these dynamics? (Section 4.5.3)
How can we ensure the involvement of all stakeholders, particularly the marginalized groups most impacted by AI and LLMs in the creation, interpretation, and application of governance tools? How can we avoid legislation that might favor the interests of tech companies developing LLMs over the interests of other social groups? (Section 4.5.4)
Can we develop better structures for corporate governance that might protect public interests in a better way? (Section 4.5.4)
Can we develop technical tools that may help measure, detect, and counteract the role of technology companies in shaping public opinion, by influencing the content consumed by the public? (Section 4.5.4)
Can we develop a better understanding of the influence of corporate power on academia, policy, and research? What are the potential detrimental effects of such influence? (Section 4.5.4)
Can we develop a better understanding of the factors (arms race between nations, different national-level approaches to AI regulation) that might negatively impact international governance for LLMs? (Section 4.5.5)
What are the different ways through which LLMs could be governed in a unified way internationally? (Section 4.5.5)
How can clear lines of accountability be established for harms associated with LLMs? (Section 4.5.6)
How can governance tools be used to mitigate the risks associated with multi-agent interactions between LLM-agents and between humans and LLM-agents? (Section 4.5.6)
What are the relative merits and demerits of different governance mechanisms, use-based governance, deployment governance, developmental governance, data governance, and compute governance? How may we effectively combine all the governance mechanisms to achieve the most favorable outcomes? (Section 4.5.7)(Section 4.5.8)(Section 4.5.9)(Section 4.5.10)(Section 4.5.11)
What existing regulations and governance institutions can be applied for use-based governance, at national and international level? (Section 4.5.7)
How can we proactively identify problematic uses and address them via use-based governance? (Section 4.5.7)
How can use-based governance help deter misuses that are likely to be perpetrated by governments? Can it be effectively used to regulate against instances of self-harm? (Section 4.5.7)
Can we develop a better understanding of the risks, and benefits, associated with various model deployment strategies? (Section 4.5.8)
What kind of regulations can be adopted to assure model deployers perform their due diligence in regards of assuring model safety before deploying a model? (Section 4.5.8)
How can we create appropriate legal frameworks for deployment governance of LLM-agents? These frameworks will have to address what the regulatory criteria be to allow the deployment of an LLM-agent; how that agent may be monitored once deployed and who would be accountable for any harm incurred by the agent in the deployment. (Section 4.5.8)
How can we efficiently assure an LLM-based system after a system upgrade to the LLM, or some other component of LLM-based system? (Section 4.5.8)
What are the merits and demerits of requiring deployers to seek licenses, or register, with regulators prior to the release of the model? Should the requirements that deployers have to meet be different for different deployment strategies? (Section 4.5.8)
How can developers best identify and share knowledge about responsible LLM (and AI) development among themselves? How can such practices be enshrined as legally-binding standards? (Section 4.5.9)
What are the merits and demerits of mandating licensing for the development of frontier AI technologies? (Section 4.5.9)
Can we develop technical tools that may help us verify whether particular developmental practices were followed or not in the development of a given model? (Section 4.5.9)
To what extent concerns around regulatory flight are likely to impede the effective governance of LLMs? (Section 4.5.9)
What are the merits and demerits of ‘responsible scaling policies’ issued by different LLM developers? (Section 4.5.9)
How can we establish and assure the rights of data creators (e.g., writers) and the rights of data workers (e.g., workers hired to generate data for LLM training)? Are data trusts (Chan et al., 2022) a practical solution in this regard? (Section 4.5.10)
How can we verify that a particular model was indeed trained exclusively on the data claimed as training data by the model creator? (Section 4.5.10)
Who owns the data created by an LLM? This is arguably one of the most critical questions in governance with downstream impact on other important questions of who bears responsibility for LLM outputs that cause harm to society, and who can profit from the LLM outputs? (Section 4.5.10)
Can we develop concrete proposals for how compute governance could be exercised in practice? (Section 4.5.11)
To what extent the current, and any future, proposals for compute governance are robust to advances in distributed training? (Section 4.5.11)
How will compute governance proposals be impacted by the changes in the structure of the compute-providing industry? (Section 4.5.11)
How can compute governance be leveraged to enhance the ability of the independent scientific community to conduct investigations that may uncover flaws in LLMs that could otherwise be overlooked? (Section 4.5.11)

Authors

Usman Anwar¹, Abulhair Saparov^*2, Javier Rando^*3, Daniel Paleka^*3, Miles Turpin^*2, Peter Hase^*4, Ekdeep Singh^*5, Erik Jenner^*6, Stephen Casper^*7, Oliver Sourbut^*8, Benjamin Edelman^*9, Zhaowei Zhang^*10, Mario Gunther^*11, Anton Korinek^*12, Jose Hernandez-Orallo^*13, Lewis Hammond^†8, Eric Bigelow^†9, Alex Pan^†6, Lauro Langosco^†1, Tomasz Korbak^†14, Heidi Zhang^†15, Ruiqi Zhong^†6, Seán Ó hÉigeartaigh^‡1, Gabriel Rachet^†16, Giulio Corsi^‡1, Alan Chan^‡17, Markus Anderljung^‡17, Lillian Edwards^‡19,, Yoshua Bengio^‡19, Danqi Chen^‡20, Samuel Albanie^‡1, Tegan Maharaj^‡21, Jakob Foerster^‡8, Florian Tramer^‡3, He He^‡2, Atoosa Kasirzadeh^‡22, Yejin Choi^‡23, David Krueger^‡1

^*indicates major contribution, ^†indicates minor contribution, ^‡indicates advisory role.

¹University of Cambridge, ²New York University, ³ETH Zurich, ⁴University of North Carolina, ⁵University of Michigan, ⁶University of California, Berkeley, ⁷Massachusetts Institute of Technology, ⁸University of Oxford, ⁹Harvard University, ¹⁰Peking University, ¹¹Munich Center for Mathematical Philosophy, ¹²University of Virginia, ¹³Universitat Politècnica de València, ¹⁴University of Sussex, ¹⁵Stanford University, ¹⁶Modulo Research, ¹⁷Center for the Governance of AI, ¹⁸Newcastle University, ¹⁹Mila - Quebec AI Institute, Université de Montréal, ²⁰Princeton University, ²¹University of Toronto, ²²University of Edinburgh, ²³University of Washington

Cite this work

@article{anwar2024foundational, title={Foundational Challenges in Assuring Alignment and Safety of Large Language Models}, author = {Usman Anwar and Abulhair Saparov and Javier Rando and Daniel Paleka and Miles Turpin and Peter Hase and Ekdeep Singh and Erik Jenner and Stephen Casper and Oliver Sourbut and Benjamin Edelman and Zhaowei Zhang and Mario Gunther and Anton Korinek and Jose Hernandez-Orallo and Lewis Hammond and Eric Bigelow and Alexander Pan and Lauro Langosco and Tomasz Korbak and Heidi Zhang and Ruiqi Zhong and Seán Ó hÉigeartaigh and Gabriel Rachet and Giulio Corsi and Alan Chan and Markus Anderljung and Lillian Edwards and Yoshua Bengio and Danqi Chen and Samuel Albanie and Tegan Maharaj and Jakob Foerster and Florian Tramer and He He and Atoosa Kasirzadeh and Yejin Choi and David Krueger}, year={2024}}