Skip to content
69% of tech leaders are preparing their teams for GenAI. Uncover more insights in the AI Skills Report. Read now
Adapt your hiring strategy for an AI-powered future. Uncover more insights in our latest whitepaper. Read now
Embrace AI

The state of frontier models across the SDLC

Written By Vivek Ravisankar | April 21, 2025

Prefer to listen?


Large Language Models (LLMs) are increasingly being integrated into software development workflows, promising to transform how code is written, tested, and maintained. This research analyzes the current capabilities of LLMs in software engineering through the lens of leading benchmarks, assessing their strengths, limitations, and potential across different phases of the software development lifecycle.


Methodology

While existing benchmarks provide insight into LLM strengths and weaknesses, they tend to offer only a fragmented perspective. To develop a more comprehensive understanding, we systematically mapped LLM capabilities to each phase of the software development lifecycle (SDLC)—from planning and design to development, testing, deployment, and ongoing maintenance. This approach allowed us to contextualize model performance within real-world development workflows and identify where LLMs are most and least effective.

The following benchmarks were analyzed to evaluate LLM coding capabilities: SWE-Bench-verified, SWE-Lancer, LiveCodeBench, and HackerRank-ASTRA.


LLM performance across the SDLC 

Overview by SDLC phase:

SDLC phase Capability Top Performers
Requirements analysis & planning
Low
GPT-4/4o
Design
Low to moderate GPT-4
Implementation/coding
Moderate to high (for specific tasks)
OpenAI o3 (high effort)
Testing
Moderate
Claude 3.7 Sonnet [R]
Deployment
Low
Maintenance
Moderate
Claude 3.7 Sonnet

 

Phase-by-phase analysis:

Requirements analysis & planning
Capability Low
Top performers GPT-4/4o
Limitations
Lacks human creativity, critical thinking, and intuition for nuanced tasks
Struggles with deep context understanding and hallucination
Reproducibility and controllability issues due to stochastic nature
Potential for bias
Highly dependent on prompt quality.
Rationale
Still the most cited model in specific literature
Demonstrated superior alignment / completeness vs human experts in one study

 

Design
Capability Low to moderate
Top performers GPT-4
Limitations
Prone to precision errors and hallucinations in generated designs
Difficulty grasping full complexity, context, and non-functional requirements
Poor explainability (especially visual diagrams)
Lacks robust verification capabilities
Struggles with novel/optimized architectures
Fundamental trust issues
Rationale
Predominantly used model in research that specifically targets architectural tasks
Strong general reasoning cited in literature reviews. (Newer models likely more capable, but lack specific design benchmark data / studies)

 

Implementation / coding
Capability Moderate to high (for specific tasks)
Top performers OpenAI o3 (high effort)
Limitations
Difficulty scaling from benchmark tasks to full real-world complexity
Struggles with very complex logic, long context, and multi-file dependencies
Prone to common agent failures (context loss, incorrect logic, poor integration)
Generated code modifications may lack quality (efficiency, maintainability, security)
Consistency issues
Risk of introducing regressions.
Rationale
Achieved near-top score (69.1%) on SWE-bench-Verified, demonstrating leading agentic coding / issue resolution capability on this realistic benchmark (slightly edging out Gemini 2.5 Pro and Claude 3.7 Sonnet on Vellum leaderboard).
o-series also leads on LiveCodeBench and EvalPlus.

 

Testing
Capability Moderate
Top performers Claude 3.7 Sonnet [R]
Limitations
Prone to common agent failures (context loss, incorrect logic, dependency oversight, poor integration)
Struggles with deep reasoning for complex bugs and handling long context/dependencies
Susceptible to the Test Oracle Problem (may validate incorrect code)
Poor performance on visual/multimodal inputs
Risk of introducing regressions
Rationale
Achieved top score (70.3%) on SWE-bench-Verified, per Vellum leaderboard, narrowly surpassing OpenAI o3 (69.1%) and Gemini 2.5 Pro (63.8%) on this benchmark focused on realistic bug fixing within agentic frameworks.

 

Deployment
Capability Low
Top performers
Limitations
AI models struggle to operate under strict resource constraints (edge)
Sensitive to data/concept drift in dynamic environments
Challenges in adapting reliably across diverse hardware / platforms
Inherent security/robustness vulnerabilities
Lack real-time adaptability without retraining
Difficulty ensuring long-term stability and explainability
Rationale
Highly dependent on target environment: Edge (hardware, OS, framework); Cloud (provider, MLOps platform)

 

Maintenance
Capability Moderate
Top performers Claude 3.7 Sonnet
Limitations
Fundamental trust barrier for autonomous code modification
Difficulty understanding complex/legacy code, dependencies, and implicit domain knowledge
Risk of introducing subtle regressions or breaking changes
Struggles ensuring long-term architectural consistency
Difficulty inferring true program intent/specifications
Generates patches that may lack quality/maintainability
Rationale
Similar to Testing/Bug Fixing: Python focus in main benchmarks; performance drops significantly for other languages, suggesting maintenance is harder for AI in diverse ecosystems.

 


Benchmark analysis: evaluating LLM coding capabilities

SWE-bench, SWE-bench+ & SWE-bench-Verified

SWE-bench was created to systematically evaluate LLMs’ capabilities in resolving software issues, comprising 2,294 real-world GitHub issues and their corresponding pull requests from 12 widely used Python repositories. However, a systematic evaluation revealed significant quality issues with the dataset:

  • 32.67% of successful patches involved “solution leakage,” where solutions were directly provided in issue reports or comments
  • 31.08% of passed patches were deemed suspicious due to inadequate test cases
  • Over 94% of issues were created before LLMs’ knowledge cutoff dates, raising potential data contamination concerns.

When these problematic issues were filtered out, the resolution rate of SWE-Agent+GPT-4 dropped dramatically, from 12.47% to just 3.97%. The steep decline in performance underscores the need for rigorous benchmarking methodologies that control for data leakage and test adequacy—otherwise, we risk painting a misleading picture of LLM capabilities.

Subsequently, SWE-bench+ was developed to address these limitations. It collects GitHub issues created after LLMs’ training cutoff dates to prevent data leakage.

SWE-bench-Verified is another offshoot of SWE-bench built by OpenAI consisting of 500 human-validated samples from real GitHub issues. Unlike its predecessor, this benchmark addresses critical flaws in the original SWE-bench dataset:

  1. Solution Leakage Mitigation: Eliminates 32.67% of cases where solutions appeared verbatim in issue discussions
  2. Test Adequacy Verification: Removes 31.08% of previously accepted solutions that passed due to insufficient test coverage
  3. Temporal Contamination Prevention: Only includes issues created after LLM training cutoff dates (post-2023)
Model Resolution rate Sources
Claude 3.7 Sonnet 62.3% 1
GPT-4.1 54.6% 2
Claude 3.5 Sonnet 49.0% 3, 4, 5
GPT-4o 33.2% 6, 7

SWE-Lancer

OpenAI’s SWE-Lancer benchmark represents a significant advancement in evaluating LLMs’ capabilities on practical, real-world coding tasks. This benchmark includes over 1,400 tasks sourced from Upwork with a combined value of $1 million, spanning independent coding activities and managerial decision-making tasks.

SWE-Lancer emphasizes economic value assessment, tying model performance directly to monetary worth in the freelance software engineering ecosystem. The benchmark employs end-to-end testing verified by professional engineers to ensure practical validity.

Despite recent advancements in AI language models, initial findings revealed significant limitations:

  • The best-performing model, Claude 3.5 Sonnet, achieved only 26.2% success on independent coding tasks
  • Most models struggle with tasks requiring deep contextual understanding or the ability to evaluate multiple proposals

The benchmark includes diverse tasks such as application logic development, UI/UX design, and server-side logic implementations, ensuring comprehensive capability assessment.

Model Revenue generated (out of $1M) Additional scores Sources
Claude 3.7 Sonnet Not yet publicly reported
GPT-4.1 Not yet publicly reported
Claude 3.5 Sonnet $400,000 (≈24% App Logic, >40% Server Logic) 1, 2, 3
GPT-4o $304,000 (≈8% App Logic, <25% Server Logic) 4, 5, 6


LiveCodeBench

LiveCodeBench offers a holistic, contamination-free evaluation approach that continuously collects new problems. Currently hosting over 300 high-quality coding problems published between May 2023 and February 2024, this benchmark focuses on broader code-related capabilities beyond mere generation.

Key features of LiveCodeBench include:

  • Annotation of problems with release dates, allowing evaluation on specific periods to measure generalization on unseen problems
  • Assessment of various code-related scenarios: code generation, self-repair, test output prediction, and code execution
  • Comparative analysis of model performance across different scenarios

The benchmark has revealed interesting performance patterns:

  • DeepSeek models showed significant performance drops on LeetCode problems released after September 2023 (its release date), indicating potential contamination in earlier problems
  • GPT models maintained relatively stable performance across different periods
  • Model performance correlations across different scenarios varied significantly – Claude-3-Opus outperformed GPT-4-turbo in test output prediction but not code generation.
  • Mistral-Large demonstrated superior performance on natural language reasoning tasks like test output prediction and code execution.

Overall, closed API-access models consistently outperformed open models on LiveCodeBench evaluations.

LiveCodeBench leaderboard (as of April 2025)

Rank Model Resolution rate
1 o4-Mini 74.6%
2 Kimi-k1.6-IOI-high 73.8%
3 o1 71.0%
4 Grok 3 Mini 69.9%
5 o3-Mini (High) 69.5%
6 Gemini 2.5 Pro 69.2%
7 o3-Mini (Med) 67.4%

 


HackerRank-ASTRA

The HackerRank-ASTRA benchmark fills a critical gap in LLM evaluation. Unlike many that focus on standalone coding problems, it tests models on multi-file, cross-domain projects that reflect real-world development workflows.

A distinguishing feature of this benchmark is its rigorous assessment of model consistency through 32 runs (k=32) and median standard deviation analysis. Initial evaluations on 65 problems yielded notable findings:

  • GPT-4.1, DeepSeek-R1, and o3-mini currently top the leaderboard, each achieving over 77% average scores, with GPT-4.1 showing the highest average score (81.96%) and pass@1 rate (71.72%).
  • DeepSeek-V3, Claude 3.7 Sonnet ,and GPT-4.5-preview remain strong performers, each near 77.5%, with Claude 3.7 showing remarkable consistency (0.10).
  • GPT-4o ranks mid-pack with an average score of 69.52%, showing lower pass@1 (50.91%) and higher variability (consistency = 0.20) compared to top performers.
  • LLaMA-3-70B, the only open-weight model on the board, trails with a 61.65% score and 46.54% pass@1, highlighting the performance gap that remains between open and closed models.

This benchmark’s focus on consistency highlights an essential dimension of model evaluation beyond raw performance, emphasizing reliability for real-world software development tasks. 

Rank Model Avg Score Avg Pass@1 Consistency
1 GPT-4.1 81.96% 71.72% .14
2 DeepSeek R1 81.49% 69.09% .11
3 o3-mini 80.75% 71.28% .12
4 DeepSeek V3 77.89% 64.11% .16
5 Claude 3.7 Sonnet 77.82% 69.54% .10
6 GPT-4.5 preview 77.46% 64.91% .13
7 o1 75.80% 63.92% .11
8 o1-preview 75.55% 60.89% .17
9 Llama 4 Maverick 75.44% 63.00% .12
10 Claude 3.5 Sonnet 75.07% 62.74% .05
11 Gemini 1.5 Pro 71.17% 58.15% .13
12 GPT-4o 69.52% 50.91% .20
13 Gemini 2.5 Pro (exp-03-25) 67.43% 58.02% .23
14 Llama 3.3-70B 61.65% 46.54% .09

Conclusion

The current generation of frontier LLMs demonstrates promising capabilities in specific software development tasks, particularly in code generation for well-defined problems and particular aspects of test prediction. However, significant limitations remain in handling complex, multi-file projects and tasks requiring deep contextual understanding or architectural decision-making.

While LLMs can currently augment developer productivity in specific contexts, particularly for code generation, snippet completion, and documentation tasks, they are not yet capable of replacing human developers across the full spectrum of software development activities. Their most effective application remains as collaborative tools that enhance human capabilities rather than autonomous systems that drive the development process.

As these technologies continue to evolve and benchmarks become increasingly rigorous, we can expect steady improvements in LLM capabilities across the software development lifecycle. This will transform how software is built, tested, and maintained in the coming years.