The Metric That Shows If AI Can Handle Real Work

One useful metric to measure AI progress is task-completion time horizon.

It shows how well an AI model handles tasks that need time and skill to finish.

You can see the latest update here:
https://metr.org/time-horizons/

Hard tasks are not simple questions.
They need thinking, planning, testing, or debugging.

Examples of tasks that take a skilled person around 30 minutes:

These tasks are clear and short for an expert.

How we measure this metric

We give the same tasks to an AI model.

Then we measure how many tasks the model completes correctly.

Example:

If the AI solves 80% of tasks that take 30 minutes for a human, we record that result.

Next, we test longer tasks.

We increase task length step by step and measure the success rate at each level.

We fix a success rate, such as:

Then we check the longest task duration where the AI still reaches that success level.

Example:

If the 80% time horizon is 1 hour, it means:

Tasks that take a human 1 hour can be solved by the AI with 80% success.

A higher time horizon means the model can handle longer and harder work.

Recent charts show a clear jump in capability.

Models like Claude Opus moved from around 15 minutes to about 1 hour in 2025.

That means the model became much better at solving longer tasks with high accuracy.