As AI systems get more powerful and used more widely, danger and risks from these systems increase as well. Some dangers come directly from AI systems failing to work as intended. For example, if a language model like ChatGPT outputs made up information or a self-driving car crashes, that can be quite dangerous! Adversarial interactions and misuse presents another set of dangers: even “safe” AI systems can be induced to behave dangerously, and AI systems can be used by bad actors, for surveillance, scams, or disinformation. There is also another, less predictable set of dangers. These arise from AI systems acting autonomously, thinking about themselves, and building on their own results. The terms “AI safety” and “AI Risk” are often overloaded, meaning different things depending on the situation. These can be categorized into three main types:
The first problem is roughly equivalent to challenges faced in safety engineering. The challenge lies in eliminating accidents and ensuring AI systems can be used safely. In traditional engineering challenges, this could be safety in automobiles or airplanes under normal operating conditions: quite difficult, but something we can largely do. In AI systems, such as chatbots or image generation systems, this involves ensuring they don't produce biased or inaccurate answers, hallucinations, or violent and disturbing responses. It also includes preventing them from leaking sensitive data they have learned. In AI systems integrated with the physical world like self-driving cars or robots, it could be preventing them from accidentally hurting people in their environments. We are already seeing issues along these lines: self-driving cars crash and occasionally kill people and chatbots sometimes threaten users over mundane actions. Similar to many other engineering problems, this is hard. We have some technical solutions to some of the safety problems. For example, reinforcement learning from human feedback or RL from AI feedback/constitutional AI can be used to make AI systems less likely to produce undesired outputs by having an overseer (AI or human) rate some of their output as desired or not, and having them learn from those examples. These techniques work fairly well when AI systems are in areas close to the examples they learned from, such as not producing threatening behavior when you disagree with them. Furthermore, given that we have dealt with similar problems with other technologies before, existing infrastructure, such as insurance, court cases, and licenses, can help address these issues. Additionally, we can recall and fix malfunctioning systems, such as when Microsoft patched Bing Chat.
The second set of risks are somewhat equivalent to traditional security problems. In this category of risks, failures are not mere accidents but are intentional exploits and hacks. An adversarial environment or entity is trying to make the system to work not as intended, and the safety challenge is to make these systems not only unlikely to accidentally fail, but also resist these external attacks. This security challenge is harder than safety engineering. On the technical side, this is something we cannot currently guarantee. There are already many open source AI models, from language models to image generation systems, and we do not know how to make these systems robust to misuse after further training. So users can train these models to do anything they want. Keeping them behind APIs with only inference access, as with many language models, voice cloners, and image generation system, does not help by default either, as people can just upload voices to be cloned or abuse the language model API. To be effective, APIs need to be paired with other changes, such as moving liability of voice fakes to providers who provide voice cloning without having obtained consent first. More generally, even for systems that are kept behind gated systems, no one knows how to stop people from jailbreaking chatbots to allow them to produce arbitrary output. Moreover, no one knows how to edit existing content out of models in a controlled fashion.
This is a hard technical problem we currently cannot solve. Traditionally, companies running systems like banks or mobile networks harden their systems against external attack (and are even liable if they don’t), which is something we cannot currently do with AI systems. We can patch issues as they arise, but we haven’t yet gotten close to patching all jailbreaks. On the social side, because open source models already exist, they would be extremely hard to recall. But we could potentially prevent the deployment of new powerful models, further open sourcing of models that can be used for creating deepfakes or voice cloning, or harden social networks and build trust mechanisms to overcome them, but we would have to do that. This type of security is something we’ve done in the past with new technologies such as mobile banking, but we actually need to build these solutions. So while it is possible, we haven’t yet done so on either technical or societal axes.
And the damage caused by technical exploitations and misuse is potentially large: AIs are strengthening many traditional vectors of scams or propaganda: from photoshop and handwritten emails to automatically generated images and video and AI crafted messages with realistic voices created from very small audio samples. AIs are already being used to create illegal child or revenge porn, and could be used to hack insecure computer systems. It also introduces new attack vectors: AI systems with increasing capabilities will make creating dangerous biological and chemical weapons far easier.
The third type of risk comes from autonomous systems making unpredictable actions. In this category of risks, failures can be anything, up to and including human extinction. Once we have autonomous AI, AI that can control other AI systems, build off of its own results in autonomous loops, reflect on its actions, such a system could end up anywhere. Similarly to how one person, given five seconds to answer a question, is quite limited, so too are the current chatbots we have. But if you allow the human to work with others, think about their response, or explore the world and build upon their answers, we are a lot more powerful: we go from mediocre question answers to reaching the moon. And this holds for AI systems as well: what happens when we have intelligent AI systems that can autonomously explore the world, think about and reflect on their actions, build and accumulate knowledge and infrastructure around themselves? We don’t know, and don’t have any way of knowing. Yet we are already moving in this direction, with efforts to build autonomously existing agents. OpenAI is trying to raise 100 billion dollars to explicitly build recursively improving AI. These types of AI systems can already recursively improve their skills and capabilities to act in virtual environments, such as Minecraft, as a testing ground to make them do so in the real world too.
While we try to build autonomous, improving systems, we do not know how to understand or control these systems. As this kind of risk does not exist in other forms of existing technology, we lack the scientific or social machinery to deal with it. We cannot predict the outcome of an arbitrary prompt into a chatbot, but we can put bounds on its output (say, the response will be shorter than the maximum allowed output). But once this system starts looping, starting leveraging other AI systems, those break, and we have no principled way of predicting or controlling what happens. And we aren’t particularly close to discovering it either.
The first type of AI risk comes from failures in safety engineering. These lead to accidents in day-to-day use. This is a hard problem, but we have some ideas how to solve it both technically and socially. The problems caused by these accidents are bad, but limited in scope.
The second type of risks come from failures in security. This leads to adversarial hacks and misuse of AI models. It is a much harder problem to solve than the previous problem, and also one we do not currently know how to do. Given time, we probably could work out solutions, as we have with previous difficult security issues with both technical and societal patches. Security failures, such as large scale surveillance or bioterror attacks, are potentially larger than accidents, but generally the kind of threat we have faced before, even if in different forms.
However, the third type of risk, of autonomous AI systems, is completely different. If AI systems can build upon and recursively improve themselves, these systems can potentially do anything physically possible, as we do not know how to limit their output. It is completely different from any existing scientific or societal problems, and as such we have no idea how to solve it, and no solutions in sight. The only known solution to problems this strange and difficult is to avoid them by not building autonomous, self-improving AI systems in the first place.