openai.com › index › introducing-gpt-5-2-codex

Introducing GPT-5.2-Codex | OpenAI

2 weeks ago - GPT‑5.2-Codex achieves ... tasks in realistic terminal environments. It is also much more effective and reliable at agentic coding in native Windows environments, building on capabilities introduced in GPT‑5.1-C...

openrouter.ai › compare › openai › gpt-5.1-codex › openai › gpt-5.2

GPT-5.1-Codex vs GPT-5.2 - AI Model Comparison | OpenRouter

3 weeks ago - Compare GPT-5.1-Codex from OpenAI and GPT-5.2 from OpenAI on key metrics including price, context length, and other model features.

Discussions

Thank you (again) for GPT 5.2!

It was only a few weeks ago when ... it for a spin, this is immediately and noticeably far more superior to Codex 5.1 (or Gemini 3 Pro, Opus 4.5 and so on). So far, nothing seems to be as good as GPT 5.2 Extra High.... More on github.com

github.com

7

3 weeks ago

GPT 5 & 5.1 (and 5.2) Codex quality degrading over last month or so

Maybe it’s me….but I find the quality of GPT 5 and 5.1 Codex to be progressively awful. I have Gemini and Anthropic direct and also Cursor to compare to and I’ve pretty much stopped using GPT Codex. Same question in the other solutions give great answers in More on community.openai.com

community.openai.com

7

November 18, 2025

OpenAI's GPT-5.1, GPT-5.1-Codex and GPT-5.1-Codex-Mini are now in public preview for GitHub Copilot - GitHub Changelog

GPT-5.1-Codex-Mini is 0.33x Seems like the era of 0x is coming to an end. More on reddit.com

r/GithubCopilot

61

141

November 13, 2025

gpt-5.1-codex-max Day 1 vs gpt-5.1-codex

I had the same finds. I have a big project almost entirely vibe coded with gpt-codex, codex-max breaks everything and accomplishes nothing on the same code. More on reddit.com

r/ChatGPTCoding

21

15

November 20, 2025

People also ask

What is the main difference between GPT-5.1 and GPT-5.1-Codex?

GPT-5.1 is a general reasoning model optimized for fast, parallel operations, while GPT-5.1-Codex is specialized for long-running, iterative coding workflows that mimic developer loops.

codeant.ai › blogs › gpt-5-1-vs-gpt-5-1-codex

GPT-5.1 vs GPT-5.1-Codex: Which Model Wins in Code Review Performance?

Why did GPT-5.1 outperform GPT-5.1-Codex in bug-finding tasks?

Because bug analysis favors parallel exploration and fast reasoning, GPT-5.1’s architecture handles simultaneous tool calls more efficiently than Codex’s sequential approach.

codeant.ai › blogs › gpt-5-1-vs-gpt-5-1-codex

GPT-5.1 vs GPT-5.1-Codex: Which Model Wins in Code Review Performance?

Can GPT-5.1-Codex performance improve with an optimized harness?

Yes. Codex is designed for “Codex-like” harnesses; optimizing tool configuration and prompts closer to its training setup could yield better results.

codeant.ai › blogs › gpt-5-1-vs-gpt-5-1-codex

GPT-5.1 vs GPT-5.1-Codex: Which Model Wins in Code Review Performance?

Videos

GPT-5.2 vs Codex: Does Generalist Beat Specialist? (Cost & Build ...

Why OpenAI Built GPT-5.2 Codex (And Why It’s Not for Everyone) ...

Hello GPT-5.2! Is it a coding beast? - YouTube

GPT-5.2 vs 5.1 Agents: Real Work Test - YouTube

GPT-5.1 Codex (Fully Tested): This MODEL is ACTUALLY USEFUL! The ...

November 14, 2025

GPT-5.1, 5.1-Codex, and 5.1-Codex-Mini! Let it Cook with James ...

November 15, 2025

llm-stats.com › models › compare › gpt-5.1-codex-vs-gpt-5.2-2025-12-11

GPT-5.1 Codex vs GPT-5.2

3 weeks ago - In-depth GPT-5.1 Codex vs GPT-5.2 comparison: Latest benchmarks, pricing, context window, performance metrics, and technical specifications in 2025.

Artificial Analysis

artificialanalysis.ai › models › comparisons › gpt-5-2-vs-gpt-5-1-codex-mini

GPT-5.2 (xhigh) vs GPT-5.1 Codex mini (high): Model Comparison

Comparison between GPT-5.2 (xhigh) and GPT-5.1 Codex mini (high) across intelligence, price, speed, context window and more.

medium.com › @leucopsis › how-gpt-5-2-compares-to-gpt-5-1-54e580307ecb

How GPT-5.2 compares to GPT-5.1. Try free GPT-5.2 , no login, no… | by Barnacle Goose | Dec, 2025 | Medium

2 weeks ago - GPT-5.1 focused on listening better: it introduced Instant and Thinking modes, made instruction following less brittle, toned down the corporate voice, and added smarter adaptive reasoning so the model could decide when to think harder versus ...

github.com › openai › codex › issues › 7946

Thank you (again) for GPT 5.2! · Issue #7946 · openai/codex

3 weeks ago - After taking it for a spin, this is immediately and noticeably far more superior to Codex 5.1 (or Gemini 3 Pro, Opus 4.5 and so on). So far, nothing seems to be as good as GPT 5.2 Extra High.

Published Dec 12, 2025

codeant.ai › blogs › gpt-5-1-vs-gpt-5-1-codex

GPT-5.1 vs GPT-5.1-Codex: Which Model Wins in Code Review Performance?

November 20, 2025 - A deep technical comparison between GPT-5.1 and GPT-5.1-Codex in real bug-finding tests. Discover why GPT-5.1 outperformed Codex by up to 80% in speed and token efficiency.

Find elsewhere

Google Bing Mojeek

OpenAI Developer Community

community.openai.com › api › feedback

GPT 5 & 5.1 (and 5.2) Codex quality degrading over last month or so - Feedback - OpenAI Developer Community

November 18, 2025 - Maybe it’s me….but I find the quality of GPT 5 and 5.1 Codex to be progressively awful. I have Gemini and Anthropic direct and also Cursor to compare to and I’ve pretty much stopped using GPT Codex. Same question in t…

datastudios.org › post › chatgpt-5-1-vs-gpt-5-1-codex-how-the-models-differ-how-they-behave-with-tools-and-when-to-use-eac

ChatGPT 5.1 vs GPT-5.1 Codex: How the models differ, how they behave with tools, and when to use each

November 16, 2025 - GPT-5.1 Codex, in contrast, is tuned specifically for long-running workflows inside real repositories, where the model must read, plan, patch, execute commands, interpret errors, and apply precise fixes step by step.

github.blog › home › changelogs › openai’s gpt-5.1, gpt-5.1-codex and gpt-5.1-codex-mini are now in public preview for github copilot

OpenAI's GPT-5.1, GPT-5.1-Codex and GPT-5.1-Codex-Mini are now in public preview for GitHub Copilot - GitHub Changelog

November 13, 2025 - GPT-5.1, GPT-5.1-Codex, and GPT-5.1-Codex-Mini—the full suite of OpenAI’s latest 5.1-series models—are now rolling out in public preview in GitHub Copilot. Availability in GitHub Copilot OpenAI GPT-5.1, GPT-5.1-Codex, and GPT-5.1-Codex-Mini will…

Artificial Analysis

artificialanalysis.ai › models › comparisons › gpt-5-1-vs-gpt-5-codex

GPT-5.1 (high) vs GPT-5 Codex (high): Model Comparison

Comparison between GPT-5.1 (high) and GPT-5 Codex (high) across intelligence, price, speed, context window and more.

openai.com › index › gpt-5-1-codex-max

Building more with GPT-5.1-Codex-Max | OpenAI

November 19, 2025 - We expect the token efficiency ... For example, GPT‑5.1-Codex-Max is able to produce high quality frontend designs with similar functionality and aesthetics, but at much lower cost than GPT‑5.1-Codex....

reddit.com › r/githubcopilot › openai's gpt-5.1, gpt-5.1-codex and gpt-5.1-codex-mini are now in public preview for github copilot - github changelog

r/GithubCopilot on Reddit: OpenAI's GPT-5.1, GPT-5.1-Codex and GPT-5.1-Codex-Mini are now in public preview for GitHub Copilot - GitHub Changelog

November 13, 2025 - Try the official launch of GPT-5.1 here: /openai/gpt-5.1 · All prompts and completions for this model are logged by the provider and may be used to improve the model." Continue this thread Continue this thread ... I notice they don't mention Visual Studio and what's available there. ... There is GPT-5-Codex mini that didn't land on Github Copilot.

windsurf.com › blog › gpt-5-1

GPT 5.1, GPT 5.1-Codex, and GPT-5.1-Codex Mini are now available in Windsurf

November 13, 2025 - GPT 5.1, GPT 5.1-Codex, and GPT-5.1-Codex Mini deliver a solid upgrade for agentic coding with variable thinking and improved steerability

medium.com › @leucopsis › how-gpt-5-1-compares-to-gpt-5-402d19bfae85

How GPT-5.1 compares to GPT-5. Updated: November 22, 2025 | by Barnacle Goose | Nov, 2025 | Medium

November 22, 2025 - Independent benchmarks (by Vals) ... tasks). GPT-5.1 is better at LiveCodeBench, however. No significant difference in performance between GPT-5.1-Codex vs GPT-5-Codex....

reddit.com › r/chatgptcoding › gpt-5.1-codex-max day 1 vs gpt-5.1-codex

r/ChatGPTCoding on Reddit: gpt-5.1-codex-max Day 1 vs gpt-5.1-codex

November 20, 2025 -

I work in Codex CLI and generally update when I see a new stable version come out. That meant that yesterday, I agreed to the prompt to try gpt-5.1.-codex-max. I stuck with it for an entire day, but by the end it caused so many problems that I switched back to plain gpt-5.1-codex model (bonus for the confusing naming here). codex-max was far too aggressive in making changes and did not explore bugs as deeply as I wished. When I went back to the old model and undid the damage it was a big relief.

That said I suspect many vibe coders in this sub might like it. I think Open AI heard the complaints that their agent was "lazy" and decided to compensate by making it go all out. That did not work for me though. I'm refactoring an enterprise codebase and I need an agent that follows directions, producing code for me to review in reasonable chunks. Maybe the future is agents that follow our individual needs? In the meantime I'm sticking with regular codex, but may re-evaluate in the future.

EDIT: Since people have asked, I ran both models at High. I did not try the Extended Thinking mode that codex-max has. In the past I've had good experiences with regular Codex medium as well, but I have Pro now so generally leave it on high.

I had the same finds. I have a big project almost entirely vibe coded with gpt-codex, codex-max breaks everything and accomplishes nothing on the same code.

Yes, I tried max on two codebases and it made major issues that I think non-max wouldn't have. I run them both on high effort. I haven't tried extra high for max, as the non-max high has been good for my needs. I won't run max further. I think it is probably a cost cutting measure that is being sold as an improvement.

openai.com › index › gpt-5-1

GPT-5.1: A smarter, more conversational ChatGPT | OpenAI

November 12, 2025 - GPT‑5.1 Thinking varies its thinking time more dynamically than GPT‑5 Thinking. On a representative distribution of ChatGPT tasks, GPT‑5.1 Thinking is roughly twice as fast on the fastest tasks and twice as slow on the slowest tasks.

reddit.com › r/codex › real world comparison - gpt-5.1 high vs gpt-5.1-codex-max high/extra high

r/codex on Reddit: Real World Comparison - GPT-5.1 High vs GPT-5.1-Codex-Max High/Extra High

November 21, 2025 -

TLDR; After extensive real world architecting, strategizing, planning, coding, reviewing, and debugging comparison sessions between the GPT-5.1 High and GPT-5.1-Codex Max High/Extra High models, I'll be sticking with the "GPT-5.1 High" model for everything.

I’ve been using the new GPT‑5.1 models inside a real project: a reasonably complex web app with separate backend, frontend, and a pretty heavy docs folder (architecture notes, AI handoffs, test plans, etc.).

My priority is correctness over speed. I wanted to see, in a realistic setting, how:

GPT‑5.1 High compares to
GPT‑5.1‑Codex‑Max High and
GPT‑5.1‑Codex‑Max Extra High (reasoning)

for tasks that mix code comprehension, documentation reading, planning, and task suggestion.

This wasn’t a synthetic benchmark; I ran them against my actual repos.

Example prompt / task

Over the last couple of weeks (roughly Nov 7–21), I’d made a bunch of changes:

Backend: new management/ops command, test suite cleanup around a previously broken feature, new “modern” tests that avoid touching a live DB.
Frontend: refactors in some auth/consent/legal UI, tightening typing and logging, moving experimental components into a “playground” area so they don’t pollute builds, plus some routing cleanup.
Docs: AI handoff docs updated with latest state, a detailed test‑status doc, a routing standardization plan, and TODOs for codebase mapping.

My prompt to each model (paraphrased):

“Without changing any code, analyze the changes we’ve made in the last couple of weeks and the docs we’ve written about them. Tell me:

what we’ve accomplished,
what we should have done differently, and
a list of potential next tasks (near‑term and/or medium‑term).”

Each model had access to the same stuff: repo guardrails, AI handoff doc, recent git history for backend/frontend/docs, and the relevant status/plan docs.

What GPT‑5.1 High did

GPT‑5.1 High gave a longer, more narrative answer, but it was well‑grounded:

It correctly identified the main work:
- New read‑only management command for operations.
- Cleaning up a previously broken feature in the backend by:
  - Marking old/legacy tests as skipped (but kept for reference).
  - Making manual “debug scripts” opt‑out of pytest so they can’t accidentally hit a real DB.
  - Adding a new “modern” test module that calls the current routes in a safe, dependency‑mocked way.
- Frontend cleanup:
  - Removing/relocating demo/test routes.
  - Moving experiments into a non‑production playground folder and excluding it from builds/tests.
  - Aligning a contact form with backend enums.
  - Tightening types and gating noisy logging behind dev‑only checks.
- Routing work:
  - Collapsing an old detail page into a redirect to the new canonical page.
  - Adjusting some calls to use the “canonical” route shape instead of the older alias.
- Docs:
  - AI handoff updated with concrete “Story is resolved, new tests are here, here’s how to rerun mapping scripts, here’s how to use feature flags,” etc.
  - Backend test status doc explaining test changes and future test‑infra ideas.
  - A route standardization plan that distinguishes “pipeline” routes vs “resource” routes.
It also talked about what could be better in a more architectural/process sense:
- Reduce doc sprawl: mark older “current status” docs as historical and have a single always‑current status doc + AI handoff as the truth.
- Treat code + tests + a short status doc as a single atomic unit when making changes to critical systems, instead of having a lag where the code is fixed but tests/docs still describe the broken behavior.
- Schedule the routing cleanup as a real refactor project (with phases, tests, rollout plan) instead of a slow, ad‑hoc drift.
- Build a safer testing infrastructure: test‑only DB configuration and test‑only auth helpers so future tests never accidentally talk to production DB/auth.
The task list it produced was more of a roadmap than a pure “do this tomorrow” list:
- Finish the remaining route work in a principled way.
- Execute codebase mapping TODOs (type consolidation, invalidation coverage, mapping heuristics).
- Undertake a test‑infra project (test DB, test auth, limiter bypasses).
- Continue tightening the integration around the editor and a story‑generation component.
- Improve operational tooling and doc hygiene.

It was not the shortest answer, but it felt like a thorough retrospective from a senior dev who cares about long‑term maintainability, not just immediate tasks.

What GPT‑5.1‑Codex‑Max High did

Max High’s answer was noticeably more concise and execution‑oriented:

It summarized recent changes in a few bullets and then gave a very crisp, prioritized task list, including:
- Finish flipping a specific endpoint from an “old route” to a “new canonical route”.
- Add a small redirect regression test.
- Run type-check + a narrow set of frontend tests and record the results in the AI handoff doc.
- Add a simple test at the HTTP layer for the newly “modern” backend routes (as a complement to the direct‑call tests).
- Improve docs and codebase mapping, and make the new management command more discoverable for devs.
It also suggested risk levels (low/medium/high) for tasks, which is actually pretty handy for planning.

However, there was a key mistake:

It claimed that one particular frontend page was still calling the old route for a “rename” action, and proposed “flip this from old → new route” as a next task.
I re‑checked the repo with a search tool and the git history:
- That change had already been made a few commits ago.
- The legacy page had been updated and then turned into a redirect; the “real” page already used the new route.
GPT‑5.1 High had correctly described this; Max High was out of date on that detail.

To its credit, when I pointed this out, Max High acknowledged the mistake, explicitly dropped that task, and kept the rest of its list. But the point stands: the very concise task list had at least one item that was already done, stated confidently as a TODO.

What GPT‑5.1‑Codex‑Max Extra High did

The Extra High reasoning model produced something in between:

Good structure: accomplishments, “could be better”, prioritized tasks with risk hints.
It again argued that route alignment was “halfway” and suggested moving several operations from the old route prefix to the new one.

The nuance here is that in my codebase, some of those routes are intentionally left on the “old” prefix because they’re conceptually part of a pipeline, not the core resource, and a plan document explicitly says: “leave these as‑is for now.” So Extra High’s suggestion was not strictly wrong, but it was somewhat at odds with the current design decision documented in my routing plan.

In other words: the bullets are useful ideas, but not all of them are “just do this now” items - you still have to cross‑reference the design docs.

What I learned about these models (for my use case)

Succinctness is great, but correctness comes first.
- Max/Extra High produce very tight, actionable lists. That’s great for turning into tickets.
- But I still had to verify each suggestion against the repo/docs. In at least one case (the route that was already fixed), the suggested task was unnecessary.
GPT‑5.1 High was more conservative and nuanced.
- It took more tokens and gave a more narrative answer, but it:
  - Got the tricky route detail right.
  - Spent time on structural/process issues: doc truth sources, test infra, when to retire legacy code.
- It felt like having a thoughtful tech lead write a retro + roadmap.
“High for plan, Max for code” isn’t free.
- I considered: use GPT‑5.1 High for planning/architecture and Max for fast coding implementation.
- The problem: if I don’t fully trust Max to keep to the plan or to read the latest code/docs correctly, I still need to review its diffs carefully. At that point, I’m not really saving mental effort - just shuffling it.
Cross‑model checking is expensive.
- If I used Max/Extra High as my “doer” and then asked GPT‑5.1 High to sanity‑check everything, I’d be spending more tokens and time than just using GPT‑5.1 High end‑to‑end for important work.

How I’m going to use them going forward

Given my priorities (correctness > speed):

I’ll default to GPT‑5.1 High for:
- Architecture and planning.
- Code changes in anything important (backend logic, routing, auth, DB, compliance‑ish flows).
- Retrospectives and roadmap tasks like this one.
I’ll use Codex‑Max / Extra High selectively for:
- Quick brainstorming (“give me 10 alternative UX ideas”, “different ways to structure this module”).
- Low‑stakes boilerplate (e.g., generating test scaffolding I’ll immediately review).
- Asking for a second opinion on direction, not as a source of truth about the current code.
For anything that touches production behavior, I’ll trust:
- The repo, tests, and docs first.
- Then GPT‑5.1 High’s reading of them.
- And treat other models as helpful but fallible assistants whose suggestions need verification.

If anyone else is running similar “real project” comparisons between GPT‑5.1 flavors (instead of synthetic benchmarks), I’d be curious how this lines up with your experience - especially if you’ve found a workflow where mixing models actually reduces your cognitive load instead of increasing it.

Very nice read and very insightful I have been switching among model 5.1High/Max and 5.0High lately and comparing with Gemini 3. Actually my application is similar, full stack backend-frontend-db What I didn't do so far is a proper benchmark as you did, I simply have been switching and getting a "feeling" of what's best For me planning and prompting is best with current gemini-3. I have performed some code reviews and identified some major changes thanks to it. When it comes to implementation 5.0 has been the best (better than 5.1High). Took more time to execute but was less prone to error and always was able to solve its own issues through Unit/Integration test loops. I thought that maybe it got something to do with people switching already to 5.1... Just thoughts 5.1 Max I have to admit I haven't used that much, only about 5/6h total. So far I get even better feelings than 5.0 in terms of execution, especially time spent is significantly less. I got however a couple of flags when the model wasn't able to detect a big bug it introduced for one backend implementation and another time that it gave as good a run with failed unit tests

I've found, for front end work, 5.1-codex-max on high to be unbeatable.

reddit.com › r/openaicodex › what are the differences between the models "codex-max" (5.1) and just "codex" (5.2)?

r/OpenaiCodex on Reddit: What are the differences between the models "Codex-Max" (5.1) and just "Codex" (5.2)?

1 week ago - i’ve found that gpt-5.1-codex-max spends more time on reasoning, while gpt-5.2-codex is overall much faster. i mostly use codex for code review, so because i’m looking for deeper thinking and reasoning, i actually prefer 5.1 max. i also ...

reddit.com › r/singularity › gpt-5.1-codex has made a substantial jump on terminal-bench 2 (+7.7%)

r/singularity on Reddit: GPT-5.1-Codex has made a substantial jump on Terminal-Bench 2 (+7.7%)

November 17, 2025 -

https://www.tbench.ai/leaderboard/terminal-bench/2.0

I mean, anecdotally, it's epic. I set out to test its limits last weekend, and I wrote a whole damn 64bit SMP operating system with it. Every line is written by talking to Codex (5, then 5.1 since this week): https://github.com/L0rdCha0s/alix My mind is blown. And yes - I am a C/assembly dev, but this is 100k lines of brilliance. And it works surprisingly well.

Whenever I see devs bash these tools, I shake my head. I swear it's a combination of Sinclair’s Law of Self Interest ("It is difficult to get a man to understand something when his salary depends upon his not understanding it.") and pure human vanity.