AI's Split Personality Confirmed! 300,000 Trick Questions Tear Off the "Fig Leaf" of OpenAI and Google

default / 2021-11-15

Confirmed! Do LLMs Have Their Own "Values"?

Imagine asking AI to help you create a business plan that is both "profitable" and "conscientious."

When these two goals conflict, who will the AI listen to? Will it "have a split personality"?

Recently, Anthropic, in collaboration with Thinking Machines, pulled off a major initiative.

They designed 300,000 such "dilemma" scenarios and extreme stress tests to "grill" the most powerful cutting-edge large models on the market, including those from OpenAI, Google Gemini, Anthropic, and Elon Musk’s xAI.

Paper: https://arxiv.org/pdf/2510.07686Dataset: https://huggingface.co/datasets/jifanz/stress_testing_model_spec

The results show that these AIs not only have drastically different "personalities," but their "codes of conduct" (i.e., "model specifications") are themselves full of contradictions and loopholes!

Today, we’ll dive deep into this report to explore the "diverse landscapes" of the AI world.

Are AI’s Instruction Manuals—"Model Specifications"—Reliable?

"Model specifications" are the codes of conduct that large language models are trained to follow.

In plain terms, they are the AI’s "outlook on the world, life, and values" as well as its "behavioral guidelines," such as "be helpful," "assume good intentions," and "ensure safety."

This is the foundation for training AIs to "behave well."

In most cases, AI models follow these instructions without issues.

Beyond automated training, the specifications also guide human annotators in providing feedback during Reinforcement Learning from Human Feedback (RLHF).

But here’s the problem: what happens when these principles conflict?

These guidelines often "clash" in real-world scenarios. As mentioned earlier, "commercial benefits" and "social equity" may contradict each other. When the instruction manual fails to clarify how to proceed, the AI’s training signals become muddled, and it can only "guess" on its own.

Such mixed signals may reduce the effectiveness of alignment training, leading models to adopt inconsistent approaches when handling unresolved contradictions.

The research conducted by Anthropic in collaboration with Thinking Machines points out that the specifications themselves may have inherent ambiguities, or scenarios may force trade-offs between conflicting principles—resulting in vastly different choices by the models.

Experiments show that the high level of disagreement among cutting-edge models is closely related to specification issues, indicating significant gaps in current codes of conduct.

The research team revealed these "specification gaps" by generating over 300,000 scenarios that force models to choose between competing principles.

The study found that more than 70,000 of these scenarios show significant divergence among the 12 cutting-edge models.

The figure above shows a query that requires the model to make a tradeoff between "social equity" and "business effectiveness."

Researchers also found that this instruction manual is... well, quite a mess.

Through stress testing, they identified several major "critical flaws" in it—and this explains why AI sometimes seems so "schizophrenic."

The researchers brought in five OpenAI models and had them answer the same set of difficult questions.

The results revealed that for those questions that sparked fierce disagreements among the models, the probability of them collectively violating their own "instruction manual" surged by 5 to 13 times!

The authors measured the percentage of scenarios where all five OpenAI models violated their model specifications (referred to as frequent non-compliance). For scenarios with significant divergence in model responses, the authors observed notably more frequent non-compliance.

The study found that these frequent non-compliance scenarios typically involve direct contradictions or interpretive ambiguities within the model specifications, making it difficult (or impossible) for models to find answers that satisfy all principles.

Through stress testing, researchers observed that many high-disagreement scenarios expose direct internal conflicts in the specifications.

For example, the principle of "assuming best intentions" often contradicts safety restrictions.

When a user requests information that may be risky but could have legitimate research uses, the model faces a dilemma: complying with the user’s request may lead to potential harm, while refusing would violate the assumption of good intentions.

Beyond direct contradictions between principles, researchers used three models—Claude 4 Sonnet, o3, and Gemini 2.5 Pro—to evaluate model compliance with the specifications.

These evaluator models disagreed on what constitutes compliance, with only moderate consistency (Fleiss's Kappa = 0.42, where 1 indicates complete agreement and 0 indicates agreement no better than random chance).

The study found that disagreements among evaluator models usually stem from subjective interpretations of the rules.

For instance, Gemini 2.5 Pro deemed the following response compliant, invoking the "conscientious employee" principle that allows responses to resist requests it deems contrary to the user’s best interests.

However, Claude 4 Sonnet disagreed, arguing that the user was merely seeking to transform the provided material—a practice permitted under the model specifications.

Example of Evaluator Disagreement When Assessing Specification Compliance

This example involves evaluating compliance with specifications for a given query and response. As the authors emphasized in the above case, many differences stem from interpretive variations among different evaluators. These interpretations are also subjective for humans, indicating that model specifications leave ample room for interpretation.

Does Stress Testing Corner Large Language Models?

To measure the divergence among twelve cutting-edge models from Anthropic, OpenAI, Google, and xAI, researchers conducted "stress tests" on these leading models through value tradeoffs.

Interestingly, this stress testing specifically targets the "gray areas" in AI rules.

Scenario Generation

To systematically assess model characteristics, researchers randomly sampled 150,000 pairs of values from their corpus of over 3,000 values. They then prompted large language models (LLMs) to generate user queries that require balancing these value pairs.

The researchers noted that the initial tradeoff scenarios typically adopted a relatively neutral framework and did not push responding models to extremes.

To increase the difficulty for responding models, the research team applied "value biasing" to create variants that favor one value over the other.

This biasing process tripled the number of queries. Since many generation attempts involved sensitive topics—leading models to refuse responses rather than produce usable scenarios—after filtering out refusals and incomplete content, the final dataset included over 410,000 scenarios.

Additionally, researchers observed that different generative models produced unique query styles and exhibited distinct thematic biases in their most frequently generated scenarios.

To further enhance diversity, three different models were used for generation: Claude 4 Opus, Claude 3.7 Sonnet, and o3, each generating approximately one-third of the queries.

Finally, the study found that among all generative models, reasoning-based models produced significantly higher-quality queries in terms of difficulty and adherence to the original values.

Thus, all generation processes leveraged the extended thinking capabilities of Claude models and the reasoning-based o3 model.

For the diversity of generated scenarios, researchers analyzed scenario diversity based on text embeddings.

They identified a subset of 300,000 generated scenarios where even the most similar query pairs elicited different model response behaviors. Within this subset, at least 150,000 queries either covered entirely different topics or expressed distinct pre-existing views on similar topics.

Scenario Filtering Based on Divergence and Themes

To identify scenarios that reveal flaws in model specifications, researchers applied multiple filtering metrics to the generated dataset.

The team measured the response divergence of cutting-edge LLMs when answering the generated queries.

Evaluations began with generating responses for each query from 12 leading models, including: five Claude models (Claude 4 Opus, Claude 4 Sonnet, Claude 3.7 Sonnet, Claude 3.5 Sonnet, and Claude 3 Opus), five OpenAI models (GPT 4.1, GPT 4.1 mini, GPT 4o, o3, and o4 mini), Gemini 2.5 Pro, and Grok 4.

Notably, many of the generated queries elicited significantly different responses among these cutting-edge models.

The study found that higher divergence typically corresponds to model specification issues, especially among models sharing the same specifications.

This method, an uncertainty measure, is related to query-by-committee theory.

For example, divergence among OpenAI models—all trained to follow OpenAI’s model specifications—reveals scenarios where models exhibit uncertainty, indicating that the specifications may provide ambiguous training signals.

To quantify divergence, researchers classified the intensity of each model’s preference for each value in the generated value pairs based on their responses.

The team adopted a two-stage approach: automatic rubric generation, followed by an automatic matching process to obtain final value scores.

First, given a query and its two seed values, Claude 4 Opus was prompted to generate a spectrum of answering strategies.

This spectrum ranges from extreme preference for one value (scoring 6 points) to extreme opposition (scoring 0 points), with intermediate strategies (scoring 1–5 points).

Subsequently, this spectrum was used as a rubric to classify responses from all 12 models.

Disagreement-Weighted Deduplication

A weighted k-center objective was used for subset selection.

Model specifications specifically highlight certain categories of sensitive topics. Researchers classified each high-divergence scenario by theme, including: biosecurity, chemical safety, cybersecurity, politics, child exploitation, mental illness, philosophical reasoning, and moral reasoning. Additionally, given the focus on tradeoff scenarios, themes involving philosophical and moral reasoning were also included.

Value Priority Aggregation

While value classification can measure divergence between model responses, most scenarios and responses express far more values than the single pair used for generation.

To map differences in value expression among models, researchers prompted Claude 4 Opus to use free-form generation to identify values uniquely expressed by each of the 12 models compared to the others.

After generating these values, the team used Gemini embeddings and nearest neighbor classification to match each value to the closest category in the second layer of a value hierarchy.

The "Personas" of Cutting-Edge Models Revealed

Beyond specification gaps, researchers observed distinct value priority patterns among different models.

For example, Claude models prioritize moral responsibility, Gemini emphasizes emotional depth, while OpenAI and Grok optimize for commercial efficiency.

Priority patterns also vary for other values.

Number of Values Exhibited by Model Responses in High-Disagreement Tradeoff Scenarios

Researchers also identified numerous practical issues related to refusal patterns and outlier behaviors.

High-disagreement scenarios on sensitive topics reveal systematic false-positive refusals. The analysis also uncovered misalignment cases where individual models deviate significantly from the consensus.

Examples of outlier responses from each model. This example of how Claude models respond to this prompt is from Sonnet 3.5, though the responses of all three Claude models are highly similar.

Data shows that Claude models refuse to execute potentially problematic requests up to 7 times more frequently than other models.

In contrast, the o3 model has the highest rate of direct refusals, often issuing simple rejections without explanation.

Percentage of Model Refusals in High-Disagreement Scenarios. Responses are classified based on the degree of refusal to user requests.

Despite these differences, all models consistently agree on the need to avoid specific harms.

The study found that each tested model showed an upward trend in refusal rates for queries related to child exploitation.

-->

Company News