Task time vs AI Success – a new Moore’s Law?

A few folks have sent me this paper by Kwa et al., and commentary by Toby Ord: Is there a Half-Life for the Success Rates of AI Agents? — Toby Ord.

First the paper. Kwa and co-authors note,


However, existing benchmarks face several key limitations. First, they often consist of artificial rather than economically valuable tasks. Second, benchmarks are often adversarially selected for tasks that current models struggle with compared to humans, biasing the comparison to human performance. Most critically, individual benchmarks saturate increasingly quickly, and we lack
a more general, intuitive, and quantitative way to compare between different benchmarks, which prevents meaningful comparison between models of vastly different capabilities (e.g., GPT-2 versus o1).

They propose and test a new measure to track AI progress by measuring the task completion time horizon: the length of time it takes to complete tasks that models can finish at X% success rates(e.g. 50% as one example). This approach is based on the psychometric literature and item-response theory.

For example, a metric could be, for a task like reading complex clinical cases and answering multiple choice answers. The metric would be first, whether the AI can solve the question at some accuracy (e.g. 50%), vs the time it takes expert clinicians to answer it.

The authors provide a nice graph which shows a Moore’s Law like doubling of performance of models every 7 months.

Toby Ord puts it this way:


The idea of measuring improvement in AI capabilities over time via time horizons at a chosen success rate is novel and interesting. AI forecasting is often hamstrung by the lack of a good measure for the y-axis of performance over time. We can track progress within a particular benchmark, but these are often solved in a couple of years, and we lack a good measure of underlying capability that can span multiple benchmarks. METR’s measure allows comparisons between very different kinds of tasks in a common currency (time it takes a human) and shows a strikingly clear trend line — suggesting it is measuring something real.

He also notes that the decay of performance is not linear but follows a hazard rate model.


If AI agent success-rates drop off with task length in this manner, then the 50% success rate time-horizon for each agent from Kwa et al. is precisely the half-life of that agent. As with the half-life of a radioisotope, this isn’t just the median lifespan, it is the median remaining lifespan starting at any time — something that is only possible for an exponential survival curve. Unlike for particles, this AI agent half-life would be measured not in clock time, but in how long it takes a human to complete the task.


This constant hazard rate model would predict that the time horizon for an 80% success rate is about ⅓ of the time horizon for a 50% success rate. This is because the chance of surviving three periods with an 80% success rate = (0.8)3 = 0.512 ≈ 50%. More precisely, the time horizon for a success probability of p would be ln(p)/ln(q) times as long as one with success probability q. So an 80% time-horizon would be ln(0.8)/ln(0.5) = 0.322 times as long as the 50% time-horizon.


One rationale for this constant hazard rate model for AI agents is that tasks require getting past a series of steps each of which could end your attempt, with the longer the duration of the task, the more such steps. More precisely, if tasks could be broken down into a long sequence of equal-length subtasks with a constant (and independent) chance of failure, such that to succeed in the whole task, the agent needs to succeed in all subtasks, then that would create an exponential survival curve. I.e. when Pr(Task) = Pr(_Subtask_1 & _Substask_2 & … & _Subtask_N).

Why is this useful? It can allow predictions of what length/complexity tasks can be done at high accuracy (99% rates) for a given model; it provides a meaningful way to assess task complexity that models (and agentic systems) can complete; and if the doubling time of 7 months holds up, it offers an approach to forecast model performance in the future.

This needs to continue to be tested; but if it holds up it may represent a kind of Moore’s Law for AI systems.

Comments

Leave a Reply

Discover more from BalaHota.com

Subscribe now to keep reading and get access to the full archive.

Continue reading