I'll give some context here. For people who've been thinking about the future of AI, these are all things that we were expecting to see, and now we're seeing the evidence of that.
There's a very technical concept here. It goes by the term of convergent instrumental drives, or incentives. It's the idea that if you have a sufficiently intelligent mind, artificial or otherwise, that's trying to accomplish a task, there are certain things that aren't necessarily goals that you would train into the system. That's a separate topic: We don't even have a reliable way to train the goals we want in the AI systems. Putting that aside, many goals would come along with this that are just as instrumentally useful for AI systems in accomplishing whatever goal they might be pursuing.
One goal is self-preservation. It's not some human desire to continue living, but it's very difficult to accomplish a goal if you're not around, or active, to accomplish the goal. Another one would be resource acquisition. It's often easier to pursue the goal you're trying to pursue if you have more resources to do so.
Another example is resistance to having your objectives changed. If you're trying to accomplish a goal, you're not going to be very good at accomplishing it if you allow someone to change the goal you're trying to accomplish.
All this was theoretical 10 years ago. It certainly made a lot of common sense. We're now starting to see that, when we have AI systems that are general and capable enough to be situationally aware in this way, these behaviours are starting to manifest.
They look kind of silly. They make silly mistakes. You see their chains of thought in which they're plotting and saying these things. We say it's kind of dumb and don't see how that would cause a problem. They're thinking out loud about the ways in which they're trying to deceive us or do these dangerous things. These are all test environments.
The concern is that AI systems won't stop with what we have today. The goal of the whole field, in some sense—and certainly of these companies—is to build very powerful general systems that are much smarter than we are. Not only will they be much more dangerous with these convergent instrumental goals, and whatever goals they are pursuing—which we don't know how to train into them reliably—but as they become more intelligent, which I mentioned in my statement, they will become increasingly situationally aware.
When we test them for some of these behaviours, we start to hear them say that it seems like a test to them. When they behave in the ways we expect them to, or want them to, in tests, it becomes harder to know whether we've actually reliably created a safe system, because it is situationally aware of how we're expecting it to behave. As these systems become more powerful, have more autonomy and have more control over how the world and the economy work, that could lead to extremely bad outcomes.