Did the model perform poorly and OpenAI decided to not publish arc agi 3 scores? This is honestly the best benchmark right now to measure true intelligence.