One of the frameworks I like is, again, computing power as a kind of barometer we use to determine the level of the general capability of systems. There are asterisks galore on that. We heard that, yes, you absolutely can do—in technical terms—inference time augmentations. You can do all kinds of stuff, but the fundamental capabilities of a base model are limited by the amount of computing power you put into it.
In that sense, look at what's being done in the executive order. They're pulling on that thread. They're starting to build institutional capacity for using that as a yardstick. I think that's the best yardstick we have. It's imperfect and I wish it were not, but it is the best yardstick we have at the moment.
There's a lot of stuff we can do around evaluations and audits, depending on what level you are at on that computing-power hierarchy. The more computing power you spend to build a model, the more it costs. GPT-4 is costing, by our estimates, anywhere from $40 million to $150 million to train, just in computing power alone. I'm sorry, but if you can afford to train GPT-4, you can afford a little auditing.
That's the nice thing about this yardstick. It maps onto resourcing, as well, and we can use that to calibrate the trade-off between risk and reward.