OpenAI Codex - Using it for Code Review

#codereview #ai #programming #csharp

What OpenAI Codex Missed in a Legacy .NET Codebase

AI code review tools are often marketed as near–senior-engineer replacements: point them at a repository and expect deep architectural insight. I wanted to see how true that is, so I tested OpenAI Codex’s web-based code review on a real-world legacy .NET C# application.

The result? Useful—but shallow in the ways that matter most.

The Setup

I pointed Codex directly at a GitHub-hosted legacy .NET solution using the web interface—no IDE plugins, no hand-holding. This is a non-trivial codebase that’s been evaluated by other AI tools before, making it a good benchmark.

The goal wasn’t to test syntax knowledge. It was to see whether Codex could reason about a system.

What Codex Gets Right

Out of the box, Codex quickly identified:

Unused variables and redundant methods
Overly static implementations
Minor data access and code-structure issues

For code-level feedback, it’s fast and competent. This kind of review can absolutely save time during refactoring or cleanup.

Where It Falls Apart: Architecture

The problems started when I asked for a holistic review.

Even after providing a detailed scorecard covering architecture, testing, and maintainability, Codex produced:

A vague, overly positive executive summary
Inflated scores for architecture and testing
No serious discussion of systemic design flaws

The application looks layered, but in reality the domain is tightly coupled to Entity Framework and the database sits at the center of the system. This is a well-known architectural anti-pattern—and Codex largely missed it.

Similarly, the solution contains only end-to-end tests, with no meaningful unit or domain testing. Yet the AI implied reasonable test discipline.

The Real Limitation

Codex still reasons locally, not systemically.

It evaluates classes and methods well, but struggles to:

Trace dependency flow across projects
Identify architectural coupling
Penalize designs that appear structured but are fundamentally flawed

This is especially dangerous in legacy systems, where bad design is often repeated consistently across the codebase—making it harder for AI to recognize as a problem.