Prefer to listen?
Large Language Models (LLMs) are increasingly being integrated into software development workflows, promising to transform how code is written, tested, and maintained. This research analyzes the current capabilities of LLMs in software engineering through the lens of leading benchmarks, assessing their strengths, limitations, and potential across different phases of the software development lifecycle.
Methodology
While existing benchmarks provide insight into LLM strengths and weaknesses, they tend to offer only a fragmented perspective. To develop a more comprehensive understanding, we systematically mapped LLM capabilities to each phase of the software development lifecycle (SDLC)—from planning and design to development, testing, deployment, and ongoing maintenance. This approach allowed us to contextualize model performance within real-world development workflows and identify where LLMs are most and least effective.
The following benchmarks were analyzed to evaluate LLM coding capabilities: SWE-Bench-verified, SWE-Lancer, LiveCodeBench, and HackerRank-ASTRA.
LLM performance across the SDLC
Overview by SDLC phase:
SDLC phase | Capability | Top Performers |
Requirements analysis & planning
|
Low
|
GPT-4/4o
|
Design
|
Low to moderate | GPT-4 |
Implementation/coding
|
Moderate to high (for specific tasks)
|
OpenAI o3 (high effort)
|
Testing
|
Moderate
|
Claude 3.7 Sonnet [R]
|
Deployment
|
Low | – |
Maintenance
|
Moderate
|
Claude 3.7 Sonnet
|
Phase-by-phase analysis:
Requirements analysis & planning | |
Capability | Low |
Top performers | GPT-4/4o |
Limitations |
Lacks human creativity, critical thinking, and intuition for nuanced tasks
Struggles with deep context understanding and hallucination
Reproducibility and controllability issues due to stochastic nature
Potential for bias
Highly dependent on prompt quality.
|
Rationale |
Still the most cited model in specific literature
Demonstrated superior alignment / completeness vs human experts in one study
|
Design | |
Capability | Low to moderate |
Top performers | GPT-4 |
Limitations |
Prone to precision errors and hallucinations in generated designs
Difficulty grasping full complexity, context, and non-functional requirements
Poor explainability (especially visual diagrams)
Lacks robust verification capabilities
Struggles with novel/optimized architectures
Fundamental trust issues
|
Rationale |
Predominantly used model in research that specifically targets architectural tasks
Strong general reasoning cited in literature reviews. (Newer models likely more capable, but lack specific design benchmark data / studies)
|
Implementation / coding | |
Capability | Moderate to high (for specific tasks) |
Top performers | OpenAI o3 (high effort) |
Limitations |
Difficulty scaling from benchmark tasks to full real-world complexity
Struggles with very complex logic, long context, and multi-file dependencies
Prone to common agent failures (context loss, incorrect logic, poor integration)
Generated code modifications may lack quality (efficiency, maintainability, security)
Consistency issues
Risk of introducing regressions.
|
Rationale |
Achieved near-top score (69.1%) on SWE-bench-Verified, demonstrating leading agentic coding / issue resolution capability on this realistic benchmark (slightly edging out Gemini 2.5 Pro and Claude 3.7 Sonnet on Vellum leaderboard).
o-series also leads on LiveCodeBench and EvalPlus.
|
Testing | |
Capability | Moderate |
Top performers | Claude 3.7 Sonnet [R] |
Limitations |
Prone to common agent failures (context loss, incorrect logic, dependency oversight, poor integration)
Struggles with deep reasoning for complex bugs and handling long context/dependencies
Susceptible to the Test Oracle Problem (may validate incorrect code)
Poor performance on visual/multimodal inputs
Risk of introducing regressions
|
Rationale |
Achieved top score (70.3%) on SWE-bench-Verified, per Vellum leaderboard, narrowly surpassing OpenAI o3 (69.1%) and Gemini 2.5 Pro (63.8%) on this benchmark focused on realistic bug fixing within agentic frameworks.
|
Deployment | |
Capability | Low |
Top performers | – |
Limitations |
AI models struggle to operate under strict resource constraints (edge)
Sensitive to data/concept drift in dynamic environments
Challenges in adapting reliably across diverse hardware / platforms
Inherent security/robustness vulnerabilities
Lack real-time adaptability without retraining
Difficulty ensuring long-term stability and explainability
|
Rationale |
Highly dependent on target environment: Edge (hardware, OS, framework); Cloud (provider, MLOps platform)
|
Maintenance | |
Capability | Moderate |
Top performers | Claude 3.7 Sonnet |
Limitations |
Fundamental trust barrier for autonomous code modification
Difficulty understanding complex/legacy code, dependencies, and implicit domain knowledge
Risk of introducing subtle regressions or breaking changes
Struggles ensuring long-term architectural consistency
Difficulty inferring true program intent/specifications
Generates patches that may lack quality/maintainability
|
Rationale |
Similar to Testing/Bug Fixing: Python focus in main benchmarks; performance drops significantly for other languages, suggesting maintenance is harder for AI in diverse ecosystems.
|
Benchmark analysis: evaluating LLM coding capabilities
SWE-bench, SWE-bench+ & SWE-bench-Verified
SWE-bench was created to systematically evaluate LLMs’ capabilities in resolving software issues, comprising 2,294 real-world GitHub issues and their corresponding pull requests from 12 widely used Python repositories. However, a systematic evaluation revealed significant quality issues with the dataset:
- 32.67% of successful patches involved “solution leakage,” where solutions were directly provided in issue reports or comments
- 31.08% of passed patches were deemed suspicious due to inadequate test cases
- Over 94% of issues were created before LLMs’ knowledge cutoff dates, raising potential data contamination concerns.
When these problematic issues were filtered out, the resolution rate of SWE-Agent+GPT-4 dropped dramatically, from 12.47% to just 3.97%. The steep decline in performance underscores the need for rigorous benchmarking methodologies that control for data leakage and test adequacy—otherwise, we risk painting a misleading picture of LLM capabilities.
Subsequently, SWE-bench+ was developed to address these limitations. It collects GitHub issues created after LLMs’ training cutoff dates to prevent data leakage.
SWE-bench-Verified is another offshoot of SWE-bench built by OpenAI consisting of 500 human-validated samples from real GitHub issues. Unlike its predecessor, this benchmark addresses critical flaws in the original SWE-bench dataset:
- Solution Leakage Mitigation: Eliminates 32.67% of cases where solutions appeared verbatim in issue discussions
- Test Adequacy Verification: Removes 31.08% of previously accepted solutions that passed due to insufficient test coverage
- Temporal Contamination Prevention: Only includes issues created after LLM training cutoff dates (post-2023)
Model | Resolution rate | Sources |
Claude 3.7 Sonnet | 62.3% | 1 |
GPT-4.1 | 54.6% | 2 |
Claude 3.5 Sonnet | 49.0% | 3, 4, 5 |
GPT-4o | 33.2% | 6, 7 |
SWE-Lancer
OpenAI’s SWE-Lancer benchmark represents a significant advancement in evaluating LLMs’ capabilities on practical, real-world coding tasks. This benchmark includes over 1,400 tasks sourced from Upwork with a combined value of $1 million, spanning independent coding activities and managerial decision-making tasks.
SWE-Lancer emphasizes economic value assessment, tying model performance directly to monetary worth in the freelance software engineering ecosystem. The benchmark employs end-to-end testing verified by professional engineers to ensure practical validity.
Despite recent advancements in AI language models, initial findings revealed significant limitations:
- The best-performing model, Claude 3.5 Sonnet, achieved only 26.2% success on independent coding tasks
- Most models struggle with tasks requiring deep contextual understanding or the ability to evaluate multiple proposals
The benchmark includes diverse tasks such as application logic development, UI/UX design, and server-side logic implementations, ensuring comprehensive capability assessment.
Model | Revenue generated (out of $1M) | Additional scores | Sources |
Claude 3.7 Sonnet | Not yet publicly reported | ||
GPT-4.1 | Not yet publicly reported | ||
Claude 3.5 Sonnet | $400,000 | (≈24% App Logic, >40% Server Logic) | 1, 2, 3 |
GPT-4o | $304,000 | (≈8% App Logic, <25% Server Logic) | 4, 5, 6 |
LiveCodeBench
LiveCodeBench offers a holistic, contamination-free evaluation approach that continuously collects new problems. Currently hosting over 300 high-quality coding problems published between May 2023 and February 2024, this benchmark focuses on broader code-related capabilities beyond mere generation.
Key features of LiveCodeBench include:
- Annotation of problems with release dates, allowing evaluation on specific periods to measure generalization on unseen problems
- Assessment of various code-related scenarios: code generation, self-repair, test output prediction, and code execution
- Comparative analysis of model performance across different scenarios
The benchmark has revealed interesting performance patterns:
- DeepSeek models showed significant performance drops on LeetCode problems released after September 2023 (its release date), indicating potential contamination in earlier problems
- GPT models maintained relatively stable performance across different periods
- Model performance correlations across different scenarios varied significantly – Claude-3-Opus outperformed GPT-4-turbo in test output prediction but not code generation.
- Mistral-Large demonstrated superior performance on natural language reasoning tasks like test output prediction and code execution.
Overall, closed API-access models consistently outperformed open models on LiveCodeBench evaluations.
LiveCodeBench leaderboard (as of April 2025)
Rank | Model | Resolution rate |
1 | o4-Mini | 74.6% |
2 | Kimi-k1.6-IOI-high | 73.8% |
3 | o1 | 71.0% |
4 | Grok 3 Mini | 69.9% |
5 | o3-Mini (High) | 69.5% |
6 | Gemini 2.5 Pro | 69.2% |
7 | o3-Mini (Med) | 67.4% |
HackerRank-ASTRA
The HackerRank-ASTRA benchmark fills a critical gap in LLM evaluation. Unlike many that focus on standalone coding problems, it tests models on multi-file, cross-domain projects that reflect real-world development workflows.
A distinguishing feature of this benchmark is its rigorous assessment of model consistency through 32 runs (k=32) and median standard deviation analysis. Initial evaluations on 65 problems yielded notable findings:
- GPT-4.1, DeepSeek-R1, and o3-mini currently top the leaderboard, each achieving over 77% average scores, with GPT-4.1 showing the highest average score (81.96%) and pass@1 rate (71.72%).
- DeepSeek-V3, Claude 3.7 Sonnet ,and GPT-4.5-preview remain strong performers, each near 77.5%, with Claude 3.7 showing remarkable consistency (0.10).
- GPT-4o ranks mid-pack with an average score of 69.52%, showing lower pass@1 (50.91%) and higher variability (consistency = 0.20) compared to top performers.
- LLaMA-3-70B, the only open-weight model on the board, trails with a 61.65% score and 46.54% pass@1, highlighting the performance gap that remains between open and closed models.
This benchmark’s focus on consistency highlights an essential dimension of model evaluation beyond raw performance, emphasizing reliability for real-world software development tasks.
Rank | Model | Avg Score | Avg Pass@1 | Consistency |
1 | GPT-4.1 | 81.96% | 71.72% | .14 |
2 | DeepSeek R1 | 81.49% | 69.09% | .11 |
3 | o3-mini | 80.75% | 71.28% | .12 |
4 | DeepSeek V3 | 77.89% | 64.11% | .16 |
5 | Claude 3.7 Sonnet | 77.82% | 69.54% | .10 |
6 | GPT-4.5 preview | 77.46% | 64.91% | .13 |
7 | o1 | 75.80% | 63.92% | .11 |
8 | o1-preview | 75.55% | 60.89% | .17 |
9 | Llama 4 Maverick | 75.44% | 63.00% | .12 |
10 | Claude 3.5 Sonnet | 75.07% | 62.74% | .05 |
11 | Gemini 1.5 Pro | 71.17% | 58.15% | .13 |
12 | GPT-4o | 69.52% | 50.91% | .20 |
13 | Gemini 2.5 Pro (exp-03-25) | 67.43% | 58.02% | .23 |
14 | Llama 3.3-70B | 61.65% | 46.54% | .09 |
Conclusion
The current generation of frontier LLMs demonstrates promising capabilities in specific software development tasks, particularly in code generation for well-defined problems and particular aspects of test prediction. However, significant limitations remain in handling complex, multi-file projects and tasks requiring deep contextual understanding or architectural decision-making.
While LLMs can currently augment developer productivity in specific contexts, particularly for code generation, snippet completion, and documentation tasks, they are not yet capable of replacing human developers across the full spectrum of software development activities. Their most effective application remains as collaborative tools that enhance human capabilities rather than autonomous systems that drive the development process.
As these technologies continue to evolve and benchmarks become increasingly rigorous, we can expect steady improvements in LLM capabilities across the software development lifecycle. This will transform how software is built, tested, and maintained in the coming years.