One useful metric to measure AI progress is task-completion time horizon.

It shows how well an AI model handles tasks that need time and skill to finish.

You can see the latest update here:
https://metr.org/time-horizons/

What “hard” tasks mean

Hard tasks are not simple questions.
They need thinking, planning, testing, or debugging.

Examples of tasks that take a skilled person around 30 minutes:

These tasks are clear and short for an expert.

How we measure this metric

We give the same tasks to an AI model.

Then we measure how many tasks the model completes correctly.

Example:

Next, we test longer tasks.

We increase task length step by step and measure the success rate at each level.

What the metric tells us

We fix a success rate, such as:

Then we check the longest task duration where the AI still reaches that success level.

Example:

If the 80% time horizon is 1 hour, it means:

Tasks that take a human 1 hour can be solved by the AI with 80% success.

A higher time horizon means the model can handle longer and harder work.

Recent trend

Recent charts show a clear jump in capability.

Models like Claude Opus moved from around 15 minutes to about 1 hour in 2025.

That means the model became much better at solving longer tasks with high accuracy.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut