The state of frontier models across the SDLC

Written By Vivek Ravisankar | April 21, 2025

Prefer to listen?

Large Language Models (LLMs) are increasingly being integrated into software development workflows, promising to transform how code is written, tested, and maintained. This research analyzes the current capabilities of LLMs in software engineering through the lens of leading benchmarks, assessing their strengths, limitations, and potential across different phases of the software development lifecycle.

Methodology

While existing benchmarks provide insight into LLM strengths and weaknesses, they tend to offer only a fragmented perspective. To develop a more comprehensive understanding, we systematically mapped LLM capabilities to each phase of the software development lifecycle (SDLC)—from planning and design to development, testing, deployment, and ongoing maintenance. This approach allowed us to contextualize model performance within real-world development workflows and identify where LLMs are most and least effective.

The following benchmarks were analyzed to evaluate LLM coding capabilities: SWE-Bench-verified, SWE-Lancer, LiveCodeBench, and HackerRank-ASTRA.

LLM performance across the SDLC

Overview by SDLC phase:

SDLC phase	Capability	Top Performers
Requirements analysis & planning	Low	GPT-4/4o
Design	Low to moderate	GPT-4
Implementation/coding	Moderate to high (for specific tasks)	OpenAI o3 (high effort)
Testing	Moderate	Claude 3.7 Sonnet [R]
Deployment	Low	–
Maintenance	Moderate	Claude 3.7 Sonnet

Phase-by-phase analysis:

Requirements analysis & planning
Capability	Low
Top performers	GPT-4/4o
Limitations	Lacks human creativity, critical thinking, and intuition for nuanced tasks Struggles with deep context understanding and hallucination Reproducibility and controllability issues due to stochastic nature Potential for bias Highly dependent on prompt quality.
Rationale	Still the most cited model in specific literature Demonstrated superior alignment / completeness vs human experts in one study

Design
Capability	Low to moderate
Top performers	GPT-4
Limitations	Prone to precision errors and hallucinations in generated designs Difficulty grasping full complexity, context, and non-functional requirements Poor explainability (especially visual diagrams) Lacks robust verification capabilities Struggles with novel/optimized architectures Fundamental trust issues
Rationale	Predominantly used model in research that specifically targets architectural tasks Strong general reasoning cited in literature reviews. (Newer models likely more capable, but lack specific design benchmark data / studies)

Implementation / coding
Capability	Moderate to high (for specific tasks)
Top performers	OpenAI o3 (high effort)
Limitations	Difficulty scaling from benchmark tasks to full real-world complexity Struggles with very complex logic, long context, and multi-file dependencies Prone to common agent failures (context loss, incorrect logic, poor integration) Generated code modifications may lack quality (efficiency, maintainability, security) Consistency issues Risk of introducing regressions.
Rationale	Achieved near-top score (69.1%) on SWE-bench-Verified, demonstrating leading agentic coding / issue resolution capability on this realistic benchmark (slightly edging out Gemini 2.5 Pro and Claude 3.7 Sonnet on Vellum leaderboard). o-series also leads on LiveCodeBench and EvalPlus.

Testing
Capability	Moderate
Top performers	Claude 3.7 Sonnet [R]
Limitations	Prone to common agent failures (context loss, incorrect logic, dependency oversight, poor integration) Struggles with deep reasoning for complex bugs and handling long context/dependencies Susceptible to the Test Oracle Problem (may validate incorrect code) Poor performance on visual/multimodal inputs Risk of introducing regressions
Rationale	Achieved top score (70.3%) on SWE-bench-Verified, per Vellum leaderboard, narrowly surpassing OpenAI o3 (69.1%) and Gemini 2.5 Pro (63.8%) on this benchmark focused on realistic bug fixing within agentic frameworks.

Deployment
Capability	Low
Top performers	–
Limitations	AI models struggle to operate under strict resource constraints (edge) Sensitive to data/concept drift in dynamic environments Challenges in adapting reliably across diverse hardware / platforms Inherent security/robustness vulnerabilities Lack real-time adaptability without retraining Difficulty ensuring long-term stability and explainability
Rationale	Highly dependent on target environment: Edge (hardware, OS, framework); Cloud (provider, MLOps platform)

Maintenance
Capability	Moderate
Top performers	Claude 3.7 Sonnet
Limitations	Fundamental trust barrier for autonomous code modification Difficulty understanding complex/legacy code, dependencies, and implicit domain knowledge Risk of introducing subtle regressions or breaking changes Struggles ensuring long-term architectural consistency Difficulty inferring true program intent/specifications Generates patches that may lack quality/maintainability
Rationale	Similar to Testing/Bug Fixing: Python focus in main benchmarks; performance drops significantly for other languages, suggesting maintenance is harder for AI in diverse ecosystems.

Benchmark analysis: evaluating LLM coding capabilities

SWE-bench, SWE-bench+ & SWE-bench-Verified

SWE-bench was created to systematically evaluate LLMs’ capabilities in resolving software issues, comprising 2,294 real-world GitHub issues and their corresponding pull requests from 12 widely used Python repositories. However, a systematic evaluation revealed significant quality issues with the dataset:

32.67% of successful patches involved “solution leakage,” where solutions were directly provided in issue reports or comments
31.08% of passed patches were deemed suspicious due to inadequate test cases
Over 94% of issues were created before LLMs’ knowledge cutoff dates, raising potential data contamination concerns.

When these problematic issues were filtered out, the resolution rate of SWE-Agent+GPT-4 dropped dramatically, from 12.47% to just 3.97%. The steep decline in performance underscores the need for rigorous benchmarking methodologies that control for data leakage and test adequacy—otherwise, we risk painting a misleading picture of LLM capabilities.

Subsequently, SWE-bench+ was developed to address these limitations. It collects GitHub issues created after LLMs’ training cutoff dates to prevent data leakage.

SWE-bench-Verified is another offshoot of SWE-bench built by OpenAI consisting of 500 human-validated samples from real GitHub issues. Unlike its predecessor, this benchmark addresses critical flaws in the original SWE-bench dataset:

Solution Leakage Mitigation: Eliminates 32.67% of cases where solutions appeared verbatim in issue discussions
Test Adequacy Verification: Removes 31.08% of previously accepted solutions that passed due to insufficient test coverage
Temporal Contamination Prevention: Only includes issues created after LLM training cutoff dates (post-2023)

Model	Resolution rate	Sources
Claude 3.7 Sonnet	62.3%	1
GPT-4.1	54.6%	2
Claude 3.5 Sonnet	49.0%	3, 4, 5
GPT-4o	33.2%	6, 7

SWE-Lancer

OpenAI’s SWE-Lancer benchmark represents a significant advancement in evaluating LLMs’ capabilities on practical, real-world coding tasks. This benchmark includes over 1,400 tasks sourced from Upwork with a combined value of $1 million, spanning independent coding activities and managerial decision-making tasks.

SWE-Lancer emphasizes economic value assessment, tying model performance directly to monetary worth in the freelance software engineering ecosystem. The benchmark employs end-to-end testing verified by professional engineers to ensure practical validity.

Despite recent advancements in AI language models, initial findings revealed significant limitations:

The best-performing model, Claude 3.5 Sonnet, achieved only 26.2% success on independent coding tasks
Most models struggle with tasks requiring deep contextual understanding or the ability to evaluate multiple proposals

The benchmark includes diverse tasks such as application logic development, UI/UX design, and server-side logic implementations, ensuring comprehensive capability assessment.

Model	Revenue generated (out of $1M)	Additional scores	Sources
Claude 3.7 Sonnet	Not yet publicly reported
GPT-4.1	Not yet publicly reported
Claude 3.5 Sonnet	$400,000	(≈24% App Logic, >40% Server Logic)	1, 2, 3
GPT-4o	$304,000	(≈8% App Logic, <25% Server Logic)	4, 5, 6

LiveCodeBench

LiveCodeBench offers a holistic, contamination-free evaluation approach that continuously collects new problems. Currently hosting over 300 high-quality coding problems published between May 2023 and February 2024, this benchmark focuses on broader code-related capabilities beyond mere generation.

Key features of LiveCodeBench include:

Annotation of problems with release dates, allowing evaluation on specific periods to measure generalization on unseen problems
Assessment of various code-related scenarios: code generation, self-repair, test output prediction, and code execution
Comparative analysis of model performance across different scenarios

The benchmark has revealed interesting performance patterns:

DeepSeek models showed significant performance drops on LeetCode problems released after September 2023 (its release date), indicating potential contamination in earlier problems
GPT models maintained relatively stable performance across different periods
Model performance correlations across different scenarios varied significantly – Claude-3-Opus outperformed GPT-4-turbo in test output prediction but not code generation.
Mistral-Large demonstrated superior performance on natural language reasoning tasks like test output prediction and code execution.

Overall, closed API-access models consistently outperformed open models on LiveCodeBench evaluations.

LiveCodeBench leaderboard (as of April 2025)

Rank	Model	Resolution rate
1	o4-Mini	74.6%
2	Kimi-k1.6-IOI-high	73.8%
3	o1	71.0%
4	Grok 3 Mini	69.9%
5	o3-Mini (High)	69.5%
6	Gemini 2.5 Pro	69.2%
7	o3-Mini (Med)	67.4%

HackerRank-ASTRA

The HackerRank-ASTRA benchmark fills a critical gap in LLM evaluation. Unlike many that focus on standalone coding problems, it tests models on multi-file, cross-domain projects that reflect real-world development workflows.

A distinguishing feature of this benchmark is its rigorous assessment of model consistency through 32 runs (k=32) and median standard deviation analysis. Initial evaluations on 65 problems yielded notable findings:

GPT-4.1, DeepSeek-R1, and o3-mini currently top the leaderboard, each achieving over 77% average scores, with GPT-4.1 showing the highest average score (81.96%) and pass@1 rate (71.72%).
DeepSeek-V3, Claude 3.7 Sonnet ,and GPT-4.5-preview remain strong performers, each near 77.5%, with Claude 3.7 showing remarkable consistency (0.10).
GPT-4o ranks mid-pack with an average score of 69.52%, showing lower pass@1 (50.91%) and higher variability (consistency = 0.20) compared to top performers.
LLaMA-3-70B, the only open-weight model on the board, trails with a 61.65% score and 46.54% pass@1, highlighting the performance gap that remains between open and closed models.

This benchmark’s focus on consistency highlights an essential dimension of model evaluation beyond raw performance, emphasizing reliability for real-world software development tasks.

Rank	Model	Avg Score	Avg Pass@1	Consistency
1	GPT-4.1	81.96%	71.72%	.14
2	DeepSeek R1	81.49%	69.09%	.11
3	o3-mini	80.75%	71.28%	.12
4	DeepSeek V3	77.89%	64.11%	.16
5	Claude 3.7 Sonnet	77.82%	69.54%	.10
6	GPT-4.5 preview	77.46%	64.91%	.13
7	o1	75.80%	63.92%	.11
8	o1-preview	75.55%	60.89%	.17
9	Llama 4 Maverick	75.44%	63.00%	.12
10	Claude 3.5 Sonnet	75.07%	62.74%	.05
11	Gemini 1.5 Pro	71.17%	58.15%	.13
12	GPT-4o	69.52%	50.91%	.20
13	Gemini 2.5 Pro (exp-03-25)	67.43%	58.02%	.23
14	Llama 3.3-70B	61.65%	46.54%	.09

Conclusion

The current generation of frontier LLMs demonstrates promising capabilities in specific software development tasks, particularly in code generation for well-defined problems and particular aspects of test prediction. However, significant limitations remain in handling complex, multi-file projects and tasks requiring deep contextual understanding or architectural decision-making.

While LLMs can currently augment developer productivity in specific contexts, particularly for code generation, snippet completion, and documentation tasks, they are not yet capable of replacing human developers across the full spectrum of software development activities. Their most effective application remains as collaborative tools that enhance human capabilities rather than autonomous systems that drive the development process.

As these technologies continue to evolve and benchmarks become increasingly rigorous, we can expect steady improvements in LLM capabilities across the software development lifecycle. This will transform how software is built, tested, and maintained in the coming years.