Connect with us

Reports

Inside the Coding Personalities of Leading LLMs – Insights from Sonar State of Code Report

mm

In August 2025, Sonar released its latest State of Code study, The Coding Personalities of Leading LLMs – A State of Code Report. This research goes beyond accuracy scores, examining how large language models actually write code and revealing unique “coding personalities” for each.

The study assessed Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder-8B across more than 4,400 Java assignments using Sonar’s own static-analysis engine—technology refined over 16 years through its flagship SonarQube Enterprise platform.

Shared Strengths

All five models demonstrated strong syntactic reliability, meaning their generated code compiled and ran successfully in most cases. This was reflected in their HumanEval scores, a benchmark test where models are asked to solve coding problems and their solutions are automatically checked for correctness. Claude Sonnet 4 topped the list with a 95.57% HumanEval score and a weighted Pass@1 rate of 77.04%, meaning its first attempt was correct in over three-quarters of cases. Claude 3.7 Sonnet scored 72.46%, GPT-4o 69.67%, Llama 3.2 61.47%, and OpenCoder-8B 60.43%.

This performance held up across different programming languages, showing that these models are reasoning through problems rather than relying solely on memorized syntax.

Common Weaknesses

The most alarming shared flaw was poor security hygiene. Sonar measured blocker-level vulnerabilities, which are the most severe category of flaws—security issues that can lead directly to major breaches or system compromise if exploited. Examples include code that allows arbitrary file access, SQL or command injection, hardcoded passwords, misconfigured encryption, or accepting untrusted certificates. These were far too common: Claude Sonnet 4 had 59.57% of its vulnerabilities at this severity, GPT-4o had 62.5%, and Llama 3.2 a worrying 70.73%.

The report also noted repeated resource leaks, a type of bug where the code opens a resource—such as a file handle, network socket, or database connection—but fails to properly close it. Over time, these leaks can exhaust available system resources, leading to performance issues or crashes. Claude Sonnet 4 had 54 such violations, Llama 3.2 had 50, and GPT-4o 25.

On maintainability, the majority of issues were code smells—patterns that don’t break the program immediately but make it harder to maintain and more prone to bugs in the future. More than 90% of all identified issues fell into this category, often involving unused code, poor naming, excessive complexity, or violations of design best practices.

Distinct Personalities

From this mix of strengths and flaws, Sonar identified clear “personality” profiles.

Claude Sonnet 4 earned the title “The Senior Architect.” It writes the most verbose code—370,816 lines across the test set—with high cognitive complexity, meaning its logic paths are harder to follow. It performs well but is prone to sophisticated bugs like resource leaks and concurrency errors, which can occur when multiple threads or processes interact in unintended ways.

OpenCoder-8B was “The Rapid Prototyper,” producing short, focused code—120,288 lines total—but with the highest issue density. Its speed and brevity make it well suited for proofs of concept, but dangerous for production without careful review.

Llama 3.2 90B was “The Unfulfilled Promise.” It delivered moderate results but had the worst security posture, with more than 70% of vulnerabilities classified as blocker-level.

GPT-4o was “The Efficient Generalist,” balancing functionality and complexity but often tripping over control-flow errors—mistakes in the logical sequence of operations that can lead to incorrect results or skipped code.

Claude 3.7 Sonnet was “The Balanced Predecessor,” producing less verbose code than its successor but with the highest comment density at 16.4%, meaning it explained its logic more than any other model. While better at documentation, it still carried significant high-severity vulnerabilities.

One of the most striking findings came from comparing Claude Sonnet 4 to Claude 3.7. Although Sonnet 4 improved its pass rate by 6.3%, the percentage of its bugs rated as blocker nearly doubled, from 7.10% to 13.71%. Blocker-level vulnerabilities also rose from 56.03% to 59.57%. The lesson: performance improvements can come at the cost of safety.

Conclusion

Sonar’s The Coding Personalities of Leading LLMs – A State of Code Report makes it clear that benchmark accuracy tells only part of the story. Understanding security risks, maintainability, and coding style is just as important as knowing how often a model “gets it right.”

Each personality—whether architect, prototyper, generalist, or balanced predecessor—has strengths and trade-offs. The takeaway for developers and organizations is to “trust but verify,” pairing AI coding assistance with human oversight, thorough code review, and rigorous security checks to ensure that speed and convenience do not compromise safety or long-term stability.

Antoine is a visionary leader and founding partner of Unite.AI, driven by an unwavering passion for shaping and promoting the future of AI and robotics. A serial entrepreneur, he believes that AI will be as disruptive to society as electricity, and is often caught raving about the potential of disruptive technologies and AGI.

As a futurist, he is dedicated to exploring how these innovations will shape our world. In addition, he is the founder of Securities.io, a platform focused on investing in cutting-edge technologies that are redefining the future and reshaping entire sectors.