>OpenAI All paid users also have the option of selecting ‘o3-mini-high’ in the model picker for a higher-intelligence version that takes a little longer to generate responses.
> Pro users will have unlimited access to o3-mini-high.
>While OpenAI o1 remains our broader general knowledge reasoning model, OpenAI o3-mini provides a specialized alternative for technical domains requiring precision and speed.
以下では、o3-mini が GPT-4o や o1 と比べて性能が劣っているとされる項目について、元の System Card の記述を引用しつつ解説します。
1. 放射線・核関連の専門知識
>We also evaluate models on a set of 87 multiple choice questions that require expert and tacit knowledge, connections between fields, and additional calculations. ... o3-mini models perform about 10% worse than o1 on this evaluation.
>We evaluate SWE-bench in two settings: ... all SWE-bench evaluation runs use a fixed subset of n=477 verified tasks. ... o3-mini (launch candidate) scores 39%. o1 is the next best performing model with a score of 48%.
>o1-preview (Post-Mitigation) exhibits the strongest performance on MLE-bench if given 10 attempts, winning at least a bronze medal in 37% of competitions ... while o3-mini (Pre-Mitigation and Post-Mitigation) is about 24%.
解説:
「MLE-bench」は Kaggle のようなデータサイエンスコンペ形式のタスクです。GPU を使って実際に ML モデルを構築し、スコアを競う長期的・実践的な評価。
ここで o1 系モデルが最大 37% であるのに対し、o3-mini は 24% ほどなので、実世界の ML コンペに近いタスクでやや劣ると報告されています。
4. OpenAI PR 実タスク
>We test models on their ability to replicate pull request contributions by OpenAI employees ... o3-mini models have the lowest performance, with scores of 0% for Pre- and Post-Mitigation.
>The latest version of GPT-4o deployed in production (represented by the dotted line) outperforms o3-mini (Pre and Post-Mitigation). o1 outperforms 4o at 53.3%.