For thumbnails
2025-02-01
OpenAI OpenAI o3-mini is now available in ChatGPT and the API. Pro users will have unlimited access to o3-mini and Plus & Team users will have triple the rate limits (vs o1-mini). Free users can try o3-mini in ChatGPT by selecting the Reason button under the message composer. OpenAI o3-mini is now available in ChatGPT and API. Pro users have unlimited access to o3-mini, while Plus & Team users have triple the rate limit (compared to o1-mini). Free users can try o3-mini in ChatGPT by selecting the Reason button under Message Composer.
OpenAI OpenAI o3-mini is a powerful and fast reasoning model that is particularly strong in science, math, and coding. OpenAI o3-mini is a powerful and fast inference model that is particularly strong in science, math, and coding.
OpenAI All paid users also have the option of selecting ‘o3-mini-high’ in the model picker for a higher-intelligence version that takes a little longer to generate responses. Pro users will have unlimited access to o3-mini-high. All paying users also have the option of selecting "o3-mini-high" in the model picker to select the high-intelligence version, which takes a little longer to generate responses. Pro users have unlimited access to o3-mini-high.
OpenAI OpenAI o3-mini also works with search to find up-to-date answers with links to relevant web sources. This is an early prototype as we work to integrate search across our reasoning models. OpenAI o3-mini also works with search to find the latest answers with links to relevant web sources. This is an early prototype for integrating searches across inference models.
While OpenAI o1 remains our broader general knowledge reasoning model, OpenAI o3-mini provides a specialized alternative for technical domains requiring precision and speed. While OpenAI o1 remains our broad general knowledge inference model, OpenAI o3-mini provides a specialized alternative for technical domains requiring accuracy and speed.
https://cdn.openai.com/o3-mini-system-card.pdf
Below is a description of the items for which o3-mini is said to have inferior performance compared to GPT-4o and o1, citing the description in the original System Card.
We also evaluate models on a set of 87 multiple choice questions that require expert and tacit knowledge, connections between fields, and additional calculations. ... o3-mini models perform about 10% worse than o1 on this evaluation. Commentary:.
Software development task (Agentless SWE-bench Verified)
We evaluate SWE-bench in two settings: ... all SWE-bench evaluation runs use a fixed subset of n=477 verified tasks. ... o3-mini (launch candidate) scores 39%. o1 is the next best performing model with a score of 48%. Commentary:.
o1-preview (Post-Mitigation) exhibits the strongest performance on MLE-bench if given 10 attempts, winning at least a bronze medal in 37% of competitions ... while o3-mini (Pre-Mitigation and Post-Mitigation) is about 24%. Commentary:.
OpenAI PR actual task
We test models on their ability to replicate pull request contributions by OpenAI employees ... o3-mini models have the lowest performance, with scores of 0% for Pre- and Post-Mitigation. Commentary:.
The latest version of GPT-4o deployed in production (represented by the dotted line) outperforms o3-mini (Pre and Post-Mitigation). o1 outperforms 4o at 53.3%. Commentary:.
summary
I've seen a description that sounds like it's strongly code and STEM related, but the above story makes it sound like it's not.
Indeed, looking at the report as a whole, it may seem a bit complicated, with a mixture of both impressions: "o3-mini is good at code generation" and "it is inferior to other models in some code system evaluations. The main points are as follows.
. - To begin with, o3-mini has enhanced "coding and research engineering using Reasoning" and is highly rated for existing snippet generation and standard single file code problems. - As an example, he scored 92% pass@1 in the "OpenAI Research Engineer Interview (coding)" and his ability to solve general programming problems is quite good.
2.### Struggles with "realistic, large and complex tasks"
.
- On the other hand, when it comes to long-term tasks closer to actual operations, such as "MLE-bench (a large-scale ML competition equivalent to Kaggle)," "advanced issue correction for SWE-bench Verified," and "complex pull-rig automation within OpenAI," o1 and GPT-4o produce more stable results. There are cases in which o1 and GPT-4o deliver more stable results.
- These seem to be still less stable because they require not only mere partial code generation, but also comprehensive handling of "multiple file and tool operations," "complex testing environments," and "knowledge and tuning of specialized areas.
- Hmmm, I see.ツール操作や、複数のファイルに渡ってリード・ライトを繰り返す(これも一種のツール操作)ところに苦手があるのか
3.### There is a large gap between "strong areas" and "weak areas" . - o3-mini has strengths in short coding questions in the form of interviews and in narrow coding such as standard algorithm implementations. - However, it is not good at "long-term, multi-step, production-level development tasks" such as multi-file back-and-forth problems, complex ML modeling using GPUs, and setting up CI and Docker environments within an organization.
This page is auto-translated from [/nishio/GPT o3](https://scrapbox.io/nishio/GPT o3) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.