One useful metric to measure AI progress is task-completion time horizon.
It shows how well an AI model handles tasks that need time and skill to finish.
You can see the latest update here:
https://metr.org/time-horizons/

What “hard” tasks mean
Hard tasks are not simple questions.
They need thinking, planning, testing, or debugging.
Examples of tasks that take a skilled person around 30 minutes:
- Fix a small bug in a small code project
- Write a simple script that reads a file and prints results
- Set up a basic web server with one or two routes
These tasks are clear and short for an expert.
How we measure this metric
We give the same tasks to an AI model.
Then we measure how many tasks the model completes correctly.
Example:
- If the AI solves 80% of tasks that take 30 minutes for a human, we record that result.
Next, we test longer tasks.
We increase task length step by step and measure the success rate at each level.
What the metric tells us
We fix a success rate, such as:
- 50% success
- 80% success
Then we check the longest task duration where the AI still reaches that success level.
Example:
If the 80% time horizon is 1 hour, it means:
Tasks that take a human 1 hour can be solved by the AI with 80% success.
A higher time horizon means the model can handle longer and harder work.
Recent trend
Recent charts show a clear jump in capability.
Models like Claude Opus moved from around 15 minutes to about 1 hour in 2025.
That means the model became much better at solving longer tasks with high accuracy.
