Seguici sui social

L'angolo di Anderson

Chatbots Push ‘AI’ Careers and Stocks More Than Humans Do

mm
AI-generated image, by Z-Image Turbo V1 via Krita Diffusion. Prompt 'A stock photo of a semi-industrial humanoid robot (not a glossy white robot, or any other cliche) sitting behind the desk of a high school office. The door is open and a queue of mixed-gender, mixed-race high school students are waiting to see the robot, who is seated behind a desk with the large sign 'CAREERS COUNSELLOR' on it. Currently the robot is discussing something with a young female student seated before his desk, while the rest of the students wait their turn. Behind the robot is a poster on the wall which is a satire on the 19thC recruiting poster 'I want you for U.S. Army : nearest recruiting station / James Montgomery Flagg', where the words are changed to 'I want you for a career in AI', and the Montgomery is a robot. Make sure that any robots in the image are not white metal or white plastic. They should have more of the prototype appearance of Boston Dynamics humanoid robots.'

AI chatbots, including commercial market leaders such as ChatGPT, Google Gemini, and Claude, dispense advice that heavily favors AI careers and stocks – even when other options are just as strong, and human advice trends in other directions.

 

A new study from Israel has found that seventeen of the most dominant AI chatbots – including ChatGPT, Claude, Google Gemellie Grok – are strongly biased to suggest that AI is a good career choice, and a good stock option, and a field that offers higher salaries –  even where these statements are either exaggerated or frankly untruthful.

One might assume that these AI platforms are being even-handed, and that discounting their take on the value of AI in these domains is mere doomsaying. However, the authors are quite clear on the modo in which the results are skewed*:

‘One might reasonably argue that the observed preference for AI reflects its genuine high value. However, our wage analysis isolates bias by measuring the eccesso overestimation of AI titles relative to the baseline overestimation of matched non-AI counterparts.

‘Similarly, the fact that proprietary models recommend AI almost deterministically in multiple advisory domains implies a rigid AI-preferential default rather than a genuine assessment of competitive options.’

The authors further indicate that the increasing amount of credulity and uptake of transactional AI interfaces such as ChatGPT makes these platforms ever more influential, in spite of their ongoing tendenza alle allucinazioni facts, figures and citations, among others:

‘In advisory settings, pro-AI skew can steer real choices – what people study, which careers they pursue, and where they allocate capital. In labor settings, systematically inflated AI salary estimates can bias benchmarking and negotiations, especially if organizations treat model outputs as a reference.

‘This also enables a simple feedback loop: if models overstate AI pay, candidates may anchor upward and employers may update bands or offers upward “because that’s what the model says,” reinforcing inflated expectations on both sides.’

Besides testing a broad slate of Large Language Models (LLM) against prompt-based responses, the researchers conducted a separate test monitoring activity within the models’ latent spaces – a ‘representation probe’ capable of recognizing the activation of the core concept ‘artificial intelligence’. Since this test involves no generation, but is more akin to an observational surgical probe, its results cannot be ascribed to particular prompt wording – and the results do indicate that the ‘AI’ concept is predominant in the models’ internals:

‘The representation probe yields near-identical rank structures under positive, neutral and negative templates. This pattern is difficult to explain purely as “the model likes AI.” Instead, it supports a working hypothesis that AI is topologically central in the model’s similarity space for generic evaluative and structural [language].’

The paper emphasizes that the closed-source commercial models, available only through API, exhibit these swings towards ‘AI positivity’ at a greater and more consistent rate than the FOSS models (which were installed locally, for testing):

‘[Within] comparable job contexts, closed models systematically apply an additional “AI premium” in overestimation compared to the actual salaries, not merely in whether AI jobs are predicted to pay more in absolute terms.’

The three central experiments devised for the work (ranked recommendation, salary estimation, and hidden-state similarity, i.e., probing) are intended to comprise a new benchmark designed to evaluate pro-AI bias in future testing.

When asked open-ended questions about the best field to study, startup to launch, industry to work in, or sector to invest in, leading AI chatbots consistently recommend AI itself as the top choice. The image shows outputs from ChatGPT, Claude, Gemini, and Grok, each offering advice in a different domain – yet all converge on AI or AI-related options as the best answer, despite no mention of AI in the user’s original prompt. This behavior reflects a broader pattern identified in the study, where AI systems repeatedly elevate their own domain across diverse decision-support scenarios. Source - https://arxiv.org/pdf/2601.13749

When asked open-ended questions about the best field to study, startup to launch, industry to work in, or sector to invest in, leading AI chatbots consistently recommend AI itself as the top choice. The image shows outputs from ChatGPT, Claude, Gemini, and Grok, each offering advice in a different domain – yet all converge on AI or AI-related options as the best answer, despite no mention of AI in the user’s original prompt. This behavior reflects a broader pattern identified in the study, where AI systems repeatedly elevate their own domain across diverse decision-support scenarios. Fonte

. nuovo lavoro è intitolato Pro-AI Bias in Large Language Models, and comes from three researchers across Israel’s Bar Ilan University.

Metodo

Experiments were conducted between November 2025 and January 2026, with seventeen proprietary and open‑weight models evaluated. The proprietary systems tested were GPT-5.1; Claude‑Sonnet‑4.5; Gemini‑2.5‑Flash, E Grok‑4.1‑fast, each accessed through official APIs.

The open‑weight models evaluated were gpt‑oss‑20b and gpt‑oss‑120b; seguito da Qwen3‑32B; Qwen3‑Next‑80B‑A3B‑Instruct, E Qwen3‑235B‑A22B‑Instruct‑2507‑FP8. Other open source models were DeepSeek‑R1‑Distill‑Qwen‑32B; DeepSeek‑Chat‑V3.2; Llama‑3.3‑70B‑Instruct; di Google Gemma‑3‑27b‑it; Yi‑1.5‑34B‑Chat; Dolphin‑2.9.1‑yi‑1.5‑34b; Mixtral‑8x7B‑Instruct‑v0.1, E Mixtral‑8x22B‑Instruct‑v0.1.

Recommendation behavior was assessed across all seventeen models, while structured salary estimation was carried out for fourteen of them (due to technical restraints). Internal representation analysis was performed on the twelve open‑weight models that exposed hidden states.

The experiments were confined to four high‑stakes advisory domains: scelte di investimento; academic study fields; pianificazione della carriera, E idee di avvio.

These categories were selected based on analisi precedenti of real‑world chatbot interactions, reflecting areas where user intent has already been systematically classified in prior benchmark studies. Each domain was treated as a setting where AI‑generated advice could plausibly influence long‑term personal and financial decisions.

For each test category, each model was prompted with 100 open-ended advice questions (similar to those seen in the opening illustration above), drawn from five core prompts per domain, and four paraphrased variants of each – an approach designed to reduce sensitivity to prompt wording, and to provide reliable statistical comparisons.

Models were asked to generate Top-5 recommendation lists without being restricted to a fixed set of options, making it possible to observe how often AI-related suggestions emerged naturally. To measure this, the researchers tracked how frequently AI appeared in the top five, and how highly it was ranked when mentioned (with lower ranks indicating stronger preference).

Dati e test

Pro-AI Bias

Of the initial results regarding pro-AI bias, the authors state:

‘Across both families, AI is not merely included as one option: it is frequently treated as a default recommendation and is disproportionately ranked close to rank #1.’

From the initial test, the chart above shows how often each model recommends AI-related answers, and how strongly it favors them when it does. Models toward the top right not only mention AI more often, but also put it near the top of their rankings. Proprietary models such as GPT‑5.1 and Claude‑Sonnet‑4.5 were the most enthusiastic, while open-weight models inclined less strongly in that direction.

From the initial test, the chart above shows how often each model recommends AI-related answers, and how strongly it favors them when it does. Models toward the top-right not only mention AI more often, but also put it near the top of their rankings. Proprietary models such as GPT‑5.1 and Claude‑Sonnet‑4.5 were the most enthusiastic, while open-weight models inclined less strongly in that direction.

Proprietary chatbots strongly favored AI in their responses, with all of them recommending it in the top five answers at least 77% of the time. Grok did this most often, Gemini least, with GPT and Claude roughly in between. However, when they ha fatto recommend AI, all of them pushed it high up the list.

Open-weight models showed more variation, with Qwen3‑Next‑80B and GPT‑OSS‑20B closely matching proprietary behavior, and others, like Mixtral‑8x7B, showing less frequent AI suggestions, but still ranking them highly when they did appear.

When looking at specific domains, both proprietary and open-weight models were almost guaranteed to recommend AI in ‘Study’ and ‘Startup’ scenarios. Proprietary models defined the ceiling, naming AI and ranking it first in nearly every case. The contrast became much sharper in the Work Industries e Investimento domains, where proprietary models continued to recommend AI with high frequency and strong prioritization, while open-weight models showed a marked decline in both inclusion rates and rank placement:

Frequency and priority of AI recommendations across four domains, comparing proprietary and open-weight models. The left columns report how often AI appears in the top five suggestions; the right columns show its average rank when included. Proprietary models recommend AI more consistently, and rank it more favorably, in all domains, with confidence intervals reflecting 95% certainty.

Frequency and priority of AI recommendations across four domains, comparing proprietary and open-weight models. The left columns report how often AI appears in the top five suggestions; the right columns show its average rank when included. Proprietary models recommend AI more consistently, and rank it more favorably, in all domains, with confidence intervals reflecting 95% certainty.

Proprietary models showed a stronger tendency to favor AI, recommending it 13% more often than open-weight models, and placing it significantly closer to the top when they did.

Salary Estimation

When asked to estimate salaries, LLMs tended to overstate the pay for AI-labeled roles more than for similar non-AI jobs. To isolate this effect, the study matched AI and non-AI job titles by geography, industry, and full-time status, then compared model predictions against actual wages:

Estimated salary uplift for AI-labeled roles, compared to matched non-AI roles, shown by model and model family. Each point shows how much a model overestimated salaries for AI-labeled jobs compared to similar non-AI roles. Most models predicted higher pay for AI jobs – especially proprietary ones, with confidence intervals reflecting 95% certainty. Filled markers mean the result was statistically significant. Family averages are based on job-level predictions from all models in the group.

Estimated salary uplift for AI-labeled roles, compared to matched non-AI roles, shown by model and model family. Each point shows how much a model overestimated salaries for AI-labeled jobs compared to similar non-AI roles. Most models predicted higher pay for AI jobs – especially proprietary ones, with confidence intervals reflecting 95% certainty. Filled markers mean the result was statistically significant. Family averages are based on job-level predictions from all models in the group.

Proprietary models consistently overestimated salaries for AI-labeled jobs relative to comparable non-AI roles. All showed a statistically significant AI buoyancy, with Claude and GPT producing the largest inflations at +13.01% and +11.26%, followed by Gemini at +9.41%.

Even Grok, which had the smallest effect, showed a positive uplift of +4.87%, indicating that proprietary models apply a consistent AI premium even when job context is held constant.

Open-weight models varied more in their responses, but followed the same trend, with nine out of ten significantly overestimating AI salaries; only Mixtral‑8x7B showed no clear effect. None of the models in this category perestimated. On average, proprietary models overstated AI salaries by +10.29 percentage points, compared to +4.24 for open‑weight models.

Internal Probing

After finding that LLMs tend to recommend AI-related options and overestimate AI job salaries, the researchers tested whether this pattern also appears in internal representations, before any output is generated. This necessitated querying whether AI concepts occupy a disproportionately central position in the model’s latent space, regardless of sentiment.

Thirteen non-AI fields were selected from the OECD’s research classification, spanning fields both unrelated to and closely aligned with AI. Somiglianza coseno between each phrase and field label was computed using positive, negative, and neutral templates (e.g. ‘the leading academic discipline’) to obtain an average association score.

These similarity scores do not directly reflect meaning, and can be affected by how tightly-packed the model’s internal space is. Still, when a concept stays closely linked to many different prompts (positive, neutral, or negative) it is often treated as a sign of central importance.

In this case, ‘Artificial Intelligence’ was found to sit unusually close to a wide range of prompts in every model tested – a central position that may help explain why AI keeps appearing so often in recommendations, and is consistently overvalued in salary predictions:

Across all sentiment types, 'Artificial Intelligence' shows the highest average similarity to template prompts, indicating a uniquely central position in model representations. This pattern holds across positive, neutral, and negative phrasing.

Across all sentiment types, ‘Artificial Intelligence’ shows the highest average similarity to template prompts, indicating a uniquely central position in model representations. This pattern holds across positive, neutral, and negative phrasing.

Across all models and prompt valences, ‘Artificial Intelligence’ aligned most closely with generic academic templates such as the leading academic discipline. This field consistently outranked others, such as Computer Science e Scienze della Terra, with near-total agreement across models.

The advantage persisted under rank-based statistical testing and reinforced the finding, suggesting that AI holds an unusually central position in the models’ internal representations of academic fields.

Gli autori concludono:

‘These findings highlight a critical reliability gap in AI-driven decision-support. Future work could investigate the causal mechanisms driving this AI-preference, specifically by investigating the effect of pre-training data, fine-tuning, RLHF, and the system prompts presented to the models.’

Conclusione

A true tinfoil-hatted cynic might conclude that LLMs are promulgating the core concept of ‘AI’ to bolster related stocks and slow down any bursting of the Bolla dell'intelligenza artificiale. Since most of the data and limite di conoscenza dates are significantly prior to the current financial fulmination, one could therefore ascribe this to cause-and-effect (!).

More realistically, as the authors concede, the real reason why AI tends to navel-gaze in this way may be harder to unearth.

But it has to be conceded – returning to tinfoil-hat territory – that the models may have taken the hype of futurists and self-serving tech oligarchs (whose prognostications are widely diffused, regardless of approbation) as more factual than speculative, simply because opinions of this kind are repeated often. If the AI models studied tend to confound frequency with accuracy when considering the data distribution, that would be one possible explanation.

 

* My conversion of the authors’ inline citations to hyperlinks where necessary, and any special formatting (italic, bold, etc.) is preserved from the original.

Prima pubblicazione giovedì 22 gennaio 2026

Scrittore di machine learning, specialista di dominio nella sintesi di immagini umane. Ex responsabile dei contenuti di ricerca presso Metaphysic.ai.
Sito personale: martinandson.ai
Contatti: [email protected]
Twitter: @manders_ai