About 93,600 results
Open links in new tab
  1. Five hours of expert level autonomy: METR’s Claude ... - Digit

    1 day ago · A new result from the AI evaluation nonprofit METR has pushed the conversation around autonomous AI systems into new territory. According to METR’s latest reporting, Claude Opus 4.5 …

  2. Anthropic's Claude Opus 4.5 can tackle some tasks lasting ...

    2 days ago · The AI research organization METR has published new test results for Claude Opus 4.5. Anthropic's model achieves a so-called 50 percent time horizon of around 4 hours and 49 minutes.

  3. Claude Opus 4.5 Dominates with 4+ Hour Task Performance on ...

    Claude Opus 4.5 delivers a 21 percentage point accuracy boost on the WeirdML benchmark while slashing costs by two-thirds. The upgrade represents the biggest performance leap in the Opus …

  4. Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...

    2 days ago · METR: Claude Opus 4.5 has a 50% task completion time horizon of about 4 hours and 49 minutes, more than double that of Claude Opus 4 released earlier this year — We estimate that, on …

  5. METR

    METR does not accept monetary compensation from model developers for this work, but companies including OpenAI and Anthropic have provided access and free compute credits to support our …

  6. We estimate that, on our tasks, Anthropic's Claude Opus 4.5 ...

    We estimate that, on our tasks, Anthropic's Claude Opus 4.5 has a 50%-time horizon of around 4 hrs 49 mins (95% confidence interval of 1 hr 49 mins to 20 hrs 25 mins). While we're still working ...

  7. METR long-horizon agent evals 7× in 2025 – Opus hits 4h49m

    3 days ago · Cross‑account focus on METR’s long‑horizon coding evals: Opus 4.5 hits near 5‑hour 50% horizon but only ~27 min at 80%. Today adds acceleration charts, reliability caveats, and predictions …