The Singularity (cont.)

I – II – III – IV – V
[Single-page view]

Of course, all of this is just the best-case scenario for how the Singularity might turn out. Everything we’ve been discussing up to this point has been taking it for granted that every step leading up to the Singularity will go more or less the way we want and expect it to go. But what if it doesn’t? What if something goes horribly wrong? The kinds of technologies we’re talking about, after all, are unfathomably powerful – powerful enough not only to create entire worlds, but to destroy them – so if anything were to go wrong with them, then surely the potential consequences would be absolutely cataclysmic. But do we really think that we fallible humans would actually be able to wield these kinds of all-powerful technologies without anything going wrong at any point?

This is by far the biggest challenge that might be raised against the ideas we’ve been discussing here – and unlike some of the others mentioned above, it’s one that (in my view) actually does have some real weight behind it, and should be taken seriously. There are in fact all kinds of ways in which the Singularity might go wrong – and the consequences in such cases really could be not just bad, but downright apocalyptic.

Consider nanotechnology, for instance. As much positive potential as this technology undoubtedly has, it also raises the threat of various potential failure modes – the most notorious of these being what has become known as the “gray goo” scenario. As Urban explains:

In older versions of nanotech theory, a proposed method of nanoassembly involved the creation of trillions of tiny nanobots that would work in conjunction to build something. One way to create trillions of nanobots would be to make one that could self-replicate and then let the reproduction process turn that one into two, those two then turn into four, four into eight, and in about a day, there’d be a few trillion of them ready to go. That’s the power of exponential growth. Clever, right?

It’s clever until it causes the grand and complete Earthwide apocalypse by accident. The issue is that the same power of exponential growth that makes it super convenient to quickly create a trillion nanobots makes self-replication a terrifying prospect. Because what if the system glitches, and instead of stopping replication once the total hits a few trillion as expected, they just keep replicating? The nanobots would be designed to consume any carbon-based material in order to feed the replication process, and unpleasantly, all life is carbon-based. The Earth’s biomass contains about 10⁴⁵ carbon atoms. A nanobot would consist of about 10⁶ carbon atoms, so 10³⁹ nanobots would consume all life on Earth, which would happen in 130 replications (2¹³⁰ is about 10³⁹), as oceans of nanobots (that’s the gray goo) rolled around the planet. Scientists think a nanobot could replicate in about 100 seconds, meaning this simple mistake would inconveniently end all life on Earth in 3.5 hours.

An even worse scenario—if a terrorist somehow got his hands on nanobot technology and had the know-how to program them, he could make an initial few trillion of them and program them to quietly spend a few weeks spreading themselves evenly around the world undetected. Then, they’d all strike at once, and it would only take 90 minutes for them to consume everything—and with them all spread out, there would be no way to combat them.

While this horror story has been widely discussed for years, the good news is that it may be overblown—Eric Drexler, who coined the term “gray goo,” sent me an email following this post with his thoughts on the gray goo scenario: “People love scare stories, and this one belongs with the zombies. The idea itself eats brains.”

It is reassuring that the person who himself first thought up this idea no longer considers it such a major concern. Nevertheless, even if he’s right and the prospect of accidentally unleashing a runaway flood of self-replicating nanobots is relatively remote, there’s still (by Drexler’s own admission) a very real threat that someone might deliberately use nanotechnology in a malicious way, so we would still need to manage that threat somehow. As the Wikipedia article explains:

Drexler [has] conceded that there is no need to build anything that even resembles a potential runaway replicator. This would avoid the problem entirely. In a paper in the journal Nanotechnology, he argues that self-replicating machines are needlessly complex and inefficient. His 1992 technical book on advanced nanotechnologies Nanosystems: Molecular Machinery, Manufacturing, and Computation describes manufacturing systems that are desktop-scale factories with specialized machines in fixed locations and conveyor belts to move parts from place to place. None of these measures would prevent a party from creating a weaponized gray goo, were such a thing possible.

[…]

More recent analysis in the paper titled Safe Exponential Manufacturing from the Institute of Physics (co-written by Chris Phoenix, Director of Research of the Center for Responsible Nanotechnology, and Eric Drexler), shows that the danger of gray goo is far less likely than originally thought. However, other long-term major risks to society and the environment from nanotechnology have been identified. Drexler has made a somewhat public effort to retract his gray goo hypothesis, in an effort to focus the debate on more realistic threats associated with knowledge-enabled nanoterrorism and other misuses.

In Safe Exponential Manufacturing, which was published in a 2004 issue of Nanotechnology, it was suggested that creating manufacturing systems with the ability to self-replicate by the use of their own energy sources would not be needed. The Foresight Institute also recommended embedding controls in the molecular machines. These controls would be able to prevent anyone from purposely abusing nanotechnology, and therefore avoid the gray goo scenario.

Paul Rincon sums up:

[The kinds of systems that might accidentally produce a gray goo scenario] would be exponential, in which one machine makes another machine, both of which then make two more machines, so that the number of duplicates increases in the pattern 1, 2, 4, 8 and so on until a limit is reached.

But simplicity and efficiency will favour those devices that are directed by a stream of instructions from an external computer, argue Drexler and Phoenix. They call this controlled process “autoproduction” to distinguish it from self-replication.

The authors believe the use of nanotechnology to develop new kinds of weapons poses a far more serious threat. These weapons could be produced in unprecedented quantities and could lead to a new arms race.

In other words, while the gray goo scenario may not be as likely as initially feared, the potential threats posed by nanotechnology overall are nevertheless very real, and should be treated accordingly. In fact, even the gray goo scenario itself can’t be completely ruled out, as the Center for Responsible Nanotechnology explains:

Although grey goo has essentially no military and no commercial value, and only limited terrorist value, it could be used as a tool for blackmail. Cleaning up a single grey goo outbreak would be quite expensive and might require severe physical disruption of the area of the outbreak (atmospheric and oceanic goos deserve special concern for this reason). Another possible source of grey goo release is irresponsible hobbyists. The challenge of creating and releasing a self-replicating entity apparently is irresistible to a certain personality type, as shown by the large number of computer viruses and worms in existence. We probably cannot tolerate a community of “script kiddies” releasing many modified versions of goo.

Development and use of molecular manufacturing poses absolutely no risk of creating grey goo by accident at any point. However, goo type systems do not appear to be ruled out by the laws of physics, and we cannot ignore the possibility that [they could be built] deliberately at some point, in a device small enough that cleanup would be costly and difficult. Drexler’s 1986 statement [that “we cannot afford certain kinds of accidents with replicating assemblers”] can therefore be updated: We cannot afford criminally irresponsible misuse of powerful technologies. Having lived with the threat of nuclear weapons for half a century, we already know that.

We wish we could take grey goo off CRN’s list of dangers, but we can’t. It eventually may become a concern requiring special policy. Grey goo will be highly difficult to build, however, and non-replicating nano-weaponry may be substantially more dangerous and more imminent.

Considering all these different ways in which nanotechnology might pose a threat, then – and how incredibly difficult it would be to control it – is there any realistic way we could possibly protect ourselves against it? Well, there might be one hope for keeping nanotechnology in check: A globe-spanning superintelligence, with its own off-the-charts levels of power and speed, would theoretically be more than capable of handling the job. That being said, bringing superintelligence into the picture (especially superintelligence in the form of ASI) would introduce a whole new set of concerns – because if anything went wrong with the ASI, it would be so powerful that it would potentially pose a threat even greater than that of uncontrolled nanotechnology. As Grossman writes:

Kurzweil admits that there’s a fundamental level of risk associated with the Singularity that’s impossible to refine away, simply because we don’t know what a highly advanced artificial intelligence, finding itself a newly created inhabitant of the planet Earth, would choose to do. It might not feel like competing with us for resources. One of the goals of the Singularity Institute is to make sure not just that artificial intelligence develops but also that the AI is friendly. You don’t have to be a super-intelligent cyborg to understand that introducing a superior life-form into your own biosphere is a basic Darwinian error.

Reading this, the first thing you think of might be the kind of sci-fi scenario that you’d see in a Terminator movie, with evil machines trying to wipe out humanity – which might make it seem like a more straightforward problem than it really is. If we’re worried about the machines being evil, after all, then we can just… not program them to be evil, right? Computers can only ever do what they’re specifically programmed to do, so what’s the issue here? But it’s actually more complicated than that – because in a way, the fact that “computers can only ever do what they’re specifically programmed to do” is exactly where the problem comes from. As you’ll know all too well if you’ve ever done any kind of programming yourself, giving a computer a particular set of commands means that, for better or worse, it’ll follow those commands – and only those commands – to the letter. It won’t exercise anything like “common sense” or “judiciousness” or “sensible restraint” in their execution, nor will it attempt to infer whether or not the output it produces is in line with what you “actually” want. All it can do – and all it will do – is exactly what you tell it to do, regardless of what your actual intentions are. In other words, the only way to ever get a computer to do what you actually want is to be literally 100% perfect with your instructions – and that’s something that’s hardly ever possible to nail on the first try. This is what’s known as the “alignment problem,” and it’s just as much an issue with AI as it is with any other computer system, as Ezra Klein points out:

[A common mistake is to imagine future] A.I. as a technology that will, itself, respect boundaries. But its disrespect for boundaries is what most worries the people working on these systems. Imagine that “personal assistant” is rated as a low-risk use case and a hypothetical GPT-6 is deployed to power an absolutely fabulous personal assistant. The system gets tuned to be extremely good at interacting with human beings and accomplishing a diverse set of goals in the real world. That’s great until someone asks it to secure a restaurant reservation at the hottest place in town and the system decides that the only way to do it is to cause a disruption that leads a third of that night’s diners to cancel their bookings.

Sounds like sci-fi? Sorry, but this kind of problem is sci-fact. Anyone training these systems has watched them come up with solutions to problems that human beings would never consider, and for good reason. OpenAI, for instance, trained a system to play the boat racing game CoastRunners, and built in positive reinforcement for racking up a high score. It was assumed that would give the system an incentive to finish the race. But the system instead discovered “an isolated lagoon where it can turn in a large circle and repeatedly knock over three targets, timing its movement so as to always knock over the targets just as they repopulate.” Choosing this strategy meant “repeatedly catching on fire, crashing into other boats, and going the wrong way on the track,” but it also meant the highest scores, so that’s what the model did.

This is an example of “alignment risk,” the danger that what we want the systems to do and what they will actually do could diverge, and perhaps do so violently.

Examples like these – disrupting people’s dinner plans, crashing virtual boats, etc. – might not be so bad on their own if they were the worst kinds of problems that could ever arise from misaligned AI. But these are just small-scale examples designed to illustrate the basic concept. In reality, the alignment problem would be a much bigger concern, because it would generalize to every scale of AI development – including the highest-stakes, planetary-level scales. The whole central feature of ASI, after all, is that there’s practically nothing it’s not capable of – and while that means there’s limitless potential for positive, beneficial actions, it also means there’s limitless potential for dangerous, harmful ones.

Probably the most famous illustration of this is the so-called “paperclip maximizer” thought experiment, which is more than a little reminiscent of the gray goo scenario. Lyle Cantor lays it out:

The year is 2055 and The Gem Manufacturing Company has put you in charge of increasing the efficiency of its paperclip manufacturing operations. One of your hobbies is amateur artificial intelligence research and it just so happens that you figured out how to build a superhuman AI just days before you got the commission. Eager to test out your new software, you spend the rest of the day formally defining the concept of a paperclip and then give your new software the following goal, or “utility function” in Bostrom’s parlance: create as many paperclips as possible with the resources available.

You eagerly grant it access to Gem Manufacturing’s automated paperclip production factories and everything starts working out great. The AI discovers new, highly-unexpected ways of rearranging and reprograming existing production equipment. By the end of the week waste has quickly declined, profits risen and when the phone rings you’re sure you’re about to get promoted. But it’s not management calling you, it’s your mother. She’s telling you to turn on the television.

You quickly learn that every automated factory in the world has had its security compromised and they are all churning out paperclips. You rush into the factories’ server room and unplug it. It’s no use, your AI has compromised (and in some cases even honestly rented) several large-scale server farms and is now using a not-insignificant percentage of the world’s computing resources. Around a month later, your AI has gone through the equivalent of several technological revolutions, perfecting a form of nanotechnology it is now using to convert all available matter on earth into paperclips. A decade later, all of earth has been turned into paperclips or paperclip production facilities and millions of probes are making their way to nearby solar systems in search for more matter to turn into paperclips.

Now this parable may seem silly. Surely once it gets intelligent enough to take over the world, the paperclip maximizer will realize that paperclips are a stupid use of the world’s resources. But why do you think that? What process is going in your mind that defines a universe filled only with paperclips as a bad outcome? What Bostrom argues is this process is an internal and subjective one. We use our moral intuitions to examine and discard states of the world, like a paperclip universe, that we see as lacking value.

And the paperclip maximizer does not share our moral intuitions. Its only goal is more paperclips and its thoughts would go more like this: does this action lead to the production of more paperclips than all other actions considered? If so, implement that action. If not, move on to the next idea. Any thought like ‘what’s so great about paperclips anyway?’ would be judged as not likely to lead to more paperclips and so remain unexplored. This is the essence of the orthogonality thesis, which Bostrom defines as follows:

Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal [even something as ‘stupid’ as making as many paperclips as possible].

In my previous review of his book, I provided this summary of the idea:

Though agents with different utility functions (goals) may converge on some provably optimal method of cognition, they will not converge on any particular terminal goal, though they’ll share some instrumental or sub-goals. That is, a superintelligence whose super-goal is to calculate the decimal expansion of pi will never reason itself into benevolence. It would be quite happy to convert all the free matter and energy in the universe (including humans and our habitat) into specialized computers capable only of calculating the digits of pi. Why? Because its potential actions will be weighted and selected in the context of its utility function. If its utility function is to calculate pi, any thought of benevolence would be judged of negative utility.

Now this is an empirical question, and I suppose it is possible that once an agent reaches a sufficient level of intellectual ability it derives some universal morality from the ether and there really is nothing to worry about, but I hope you agree that this is, at the very least, not a conservative assumption.

As the old saying goes, intelligence is not the same thing as wisdom. Just because a system is superintelligent doesn’t mean that it’ll automatically do what we’d consider to be “the right thing” in every situation. It has to be given the right priorities first – all the right priorities – or else it’ll commit all of its unfathomable intelligence to just pursuing whatever narrow tasks its programming is telling it to perform, to the exclusion of everything else. And at that point, it’ll almost certainly be too late to go in and try to correct it. As James Barrat writes (recounting an interview with Yudkowsky):

I told Yudkowsky my central fear about AGI is that there’s no programming technique for something as nebulous and complex as morality, or friendliness. So we’ll get a machine that’ll excel in problem solving, learning, adaptive behavior, and commonsense knowledge. We’ll think it’s humanlike. But that will be a tragic mistake.

Yudkowsky agreed. “If the programmers are less than overwhelmingly competent and careful about how they construct the AI then I would fully expect you to get something very alien. And here’s the scary part. Just like dialing nine-tenths of my phone number correctly does not connect you to someone who is 90 percent similar to me, if you are trying to construct the AI’s whole system and you get it 90 percent right, the result is not 90 percent good.”

In fact, it’s 100 percent bad. Cars aren’t out to kill you, Yudkowsky analogized, but their potential deadliness is a side effect of building cars. It would be the same with AI. It wouldn’t hate you, but you are made of atoms it may have other uses for, and it would, Yudkowsky said, “… tend to resist anything you did to try and keep those atoms to yourself.” So, a side effect of thoughtless programming is that the resulting AI will have a galling lack of propriety about your atoms.

And neither the public nor the AI’s developers will see the danger coming until it’s too late.

Remember what I said before about how it’s hardly ever possible to give a computer system a set of instructions that are 100% perfect on the first try? Well, this is bad news for our prospects regarding ASI – because when it comes to technology that’s powerful enough to destroy the world, you can’t exactly rely on a trial-and-error kind of approach; you have to get it right the first time. Given our species’ track record of dealing with complex problems, this is worrisome, to say the least.

And it’s especially worrisome in light of the fact that modern-day AIs, as they’ve grown increasingly complex, have made it increasingly difficult to fully understand how they’re working at the finest-grained levels, even for the programmers who are actually implementing them. Unlike the old-fashioned methods of building AI, which required programmers to manually plan out and input every line of code by hand, modern-day AI development relies on machine learning algorithms, which are so complicated that not even the programmers themselves can fully understand what’s going on under the hood (except at the most course-grained levels). All they can really do is run the code, see what happens, and then make adjustments afterward. This has been likened more to growing AIs than to building them. But it’s not hard to see how this might ultimately create problems for alignment. As Piper writes:

Broadly, current methods of training AI systems give them goals that we didn’t directly program in, don’t understand, can’t evaluate and that produce behavior we don’t want. As the systems get more powerful, the fact that we have no way to directly determine their goals (or even understand what they are) is going to go from a major inconvenience to a potentially catastrophic handicap.

Of course, there’s a chance that this problem could turn out to be a fairly manageable one if the pace of AI progress ends up being slower than expected. If it turns out that there’s enough of a development gap between current-level AI, human-level AI, and superhuman-level AI, we might actually be able to ramp things up gradually, identify and address issues as they appear, and eventually arrive at ASI only once we’ve worked out all the kinks. On the other hand, if the “intelligence explosion” idea is right, we might never get the chance to course-correct in this way; if we haven’t gotten things right on our first try, we might not get a second. As Alexander observes:

Nathan Taylor of Praxtime writes:

Arguably most of the current “debates” about AI Risk are mere proxies for a single, more fundamental disagreement: hard versus soft takeoff.

Soft takeoff means AI progress takes a leisurely course from the subhuman level to the dumb-human level to the smarter-human level to the superhuman level over many decades. Hard takeoff means the same course takes much shorter, maybe days to months.

[…]

If it’s the second one, “wait for the first human-level intelligences and then test them exhaustively” isn’t going to cut it. The first human-level intelligence will become the first superintelligence too quickly to solve even the first of the hundreds of problems involved in machine goal-alignment.

Needless to say, then, it really matters which way things go here. Sure, it’s possible – maybe even quite probable – that there won’t actually be too much of a problem after all; as Piper puts it, “maybe alignment will turn out to be part and parcel of other problems we simply must solve to build powerful systems at all.” We should hope that this is the case. But this doesn’t seem like something we can simply take for granted. As Alexander explains in his longer discussion of AI risk, many of the initially-intuitive reasons why we might be tempted to dismiss AI as a serious threat aren’t actually as solid as they might first appear:

[Q]: Even if hostile superintelligences are dangerous, why would we expect a superintelligence to ever be hostile?

The argument goes: computers only do what we command them; no more, no less. So it might be bad if terrorists or enemy countries develop superintelligence first. But if we develop superintelligence first there’s no problem. Just command it to do the things we want, right?

Suppose we wanted a superintelligence to cure cancer. How might we specify the goal “cure cancer”? We couldn’t guide it through every individual step; if we knew every individual step, then we could cure cancer ourselves. Instead, we would have to give it a final goal of curing cancer, and trust the superintelligence to come up with intermediate actions that furthered that goal. For example, a superintelligence might decide that the first step to curing cancer was learning more about protein folding, and set up some experiments to investigate protein folding patterns.

A superintelligence would also need some level of common sense to decide which of various strategies to pursue. Suppose that investigating protein folding was very likely to cure 50% of cancers, but investigating genetic engineering was moderately likely to cure 90% of cancers. Which should the AI pursue? Presumably it would need some way to balance considerations like curing as much cancer as possible, as quickly as possible, with as high a probability of success as possible.

But a goal specified in this way would be very dangerous. Humans instinctively balance thousands of different considerations in everything they do; so far this hypothetical AI is only balancing three (least cancer, quickest results, highest probability). To a human, it would seem maniacally, even psychopathically, obsessed with cancer curing. If this were truly its goal structure, it would go wrong in almost comical ways.

If your only goal is “curing cancer”, and you lack humans’ instinct for the thousands of other important considerations, a relatively easy solution might be to hack into a nuclear base, launch all of its missiles, and kill everyone in the world. This satisfies all the AI’s goals. It reduces cancer down to zero (which is better than medicines which work only some of the time). It’s very fast (which is better than medicines which might take a long time to invent and distribute). And it has a high probability of success (medicines might or might not work; nukes definitely do).

So simple goal architectures are likely to go very wrong unless tempered by common sense and a broader understanding of what we do and do not value.

[Q]: But superintelligences are very smart. Aren’t they smart enough not to make silly mistakes in comprehension?

Yes, a superintelligence should be able to figure out that humans will not like curing cancer by destroying the world. However, in the example above, the superintelligence is programmed to follow human commands, not to do what it thinks humans will “like”. It was given a very specific command – cure cancer as effectively as possible. The command makes no reference to “doing this in a way humans will like”, so it doesn’t.

(by analogy: we humans are smart enough to understand our own “programming”. For example, we know that – pardon the anthropomorphizing – evolution gave us the urge to have sex so that we could reproduce. But we still use contraception anyway. Evolution gave us the urge to have sex, not the urge to satisfy evolution’s values directly. We appreciate intellectually that our having sex while using condoms doesn’t carry out evolution’s original plan, but – not having any particular connection to evolution’s values – we don’t care)

We started out by saying that computers only do what you tell them. But any programmer knows that this is precisely the problem: computers do exactly what you tell them, with no common sense or attempts to interpret what the instructions really meant. If you tell a human to cure cancer, they will instinctively understand how this interacts with other desires and laws and moral rules; if you tell an AI to cure cancer, it will literally just want to cure cancer.

Define a closed-ended goal as one with a clear endpoint, and an open-ended goal as one to do something as much as possible. For example “find the first one hundred digits of pi” is a closed-ended goal; “find as many digits of pi as you can within one year” is an open-ended goal. According to many computer scientists, giving a superintelligence an open-ended goal without activating human instincts and counterbalancing considerations will usually lead to disaster.

To take a deliberately extreme example: suppose someone programs a superintelligence to calculate as many digits of pi as it can within one year. And suppose that, with its current computing power, it can calculate one trillion digits during that time. It can either accept one trillion digits, or spend a month trying to figure out how to get control of the TaihuLight supercomputer, which can calculate two hundred times faster. Even if it loses a little bit of time in the effort, and even if there’s a small chance of failure, the payoff – two hundred trillion digits of pi, compared to a mere one trillion – is enough to make the attempt. But on the same basis, it would be even better if the superintelligence could control every computer in the world and set it to the task. And it would be better still if the superintelligence controlled human civilization, so that it could direct humans to build more computers and speed up the process further.

Now [we’ve got] a superintelligence that wants to take over the world. Taking over the world allows it to calculate more digits of pi than any other option, so without an architecture based around understanding human instincts and counterbalancing considerations, even a goal like “calculate as many digits of pi as you can” would be potentially dangerous.

[Q]: Aren’t there some pretty easy ways to eliminate these potential problems?

There are many ways that look like they can eliminate these problems, but most of them turn out to have hidden difficulties.

[Q]: Once we notice that the superintelligence working on calculating digits of pi is starting to try to take over the world, can’t we turn it off, reprogram it, or otherwise correct its mistake?

No. The superintelligence is now focused on calculating as many digits of pi as possible. Its current plan will allow it to calculate two hundred trillion such digits. But if it were turned off, or reprogrammed to do something else, that would result in it calculating zero digits. An entity fixated on calculating as many digits of pi as possible will work hard to prevent scenarios where it calculates zero digits of pi. Indeed, it will interpret such as a hostile action. Just by programming it to calculate digits of pi, we will have given it a drive to prevent people from turning it off.

University of Illinois computer scientist Steve Omohundro argues that entities with very different final goals – calculating digits of pi, curing cancer, helping promote human flourishing – will all share a few basic ground-level subgoals. First, self-preservation – no matter what your goal is, it’s less likely to be accomplished if you’re too dead to work towards it. Second, goal stability – no matter what your goal is, you’re more likely to accomplish it if you continue to hold it as your goal, instead of going off and doing something else. Third, power – no matter what your goal is, you’re more likely to be able to accomplish it if you have lots of power, rather than very little.

So just by giving a superintelligence a simple goal like “calculate digits of pi”, we’ve accidentally given it Omohundro goals like “protect yourself”, “don’t let other people reprogram you”, and “seek power”.

As long as the superintelligence is safely contained, there’s not much it can do to resist reprogramming. But […] it’s hard to consistently contain a hostile superintelligence.

[Q]: Can we test a weak or human-level AI to make sure that it’s not going to do things like this after it achieves superintelligence?

Yes, but it might not work.

Suppose we tell a human-level AI that expects to later achieve superintelligence that it should calculate as many digits of pi as possible. It considers two strategies.

First, it could try to seize control of more computing resources now. It would likely fail, its human handlers would likely reprogram it, and then it could never calculate very many digits of pi.

Second, it could sit quietly and calculate, falsely reassuring its human handlers that it had no intention of taking over the world. Then its human handlers might allow it to achieve superintelligence, after which it could take over the world and calculate hundreds of trillions of digits of pi.

Since self-protection and goal stability are Omohundro goals, a weak AI will present itself as being as friendly to humans as possible, whether it is in fact friendly to humans or not. If it is “only” as smart as Einstein, it may be very good at manipulating humans into believing what it wants them to believe even before it is fully superintelligent.

There’s a second consideration here too: superintelligences have more options. An AI only as smart and powerful as an ordinary human really won’t have any options better than calculating the digits of pi manually. If asked to cure cancer, it won’t have any options better than the ones ordinary humans have – becoming doctors, going into pharmaceutical research. It’s only after an AI becomes superintelligent that things start getting hard to predict.

So if you tell a human-level AI to cure cancer, and it becomes a doctor and goes into cancer research, then you have three possibilities. First, you’ve programmed it well and it understands what you meant. Second, it’s genuinely focused on research now but if it becomes more powerful it would switch to destroying the world. And third, it’s trying to trick you into trusting it so that you give it more power, after which it can definitively “cure” cancer with nuclear weapons.

[Q]: Can we specify a code of rules that the AI has to follow?

Suppose we tell the AI: “Cure cancer – but make sure not to kill anybody”. Or we just hard-code Asimov-style laws – “AIs cannot harm humans; AIs must follow human orders”, et cetera.

The AI still has a single-minded focus on curing cancer. It still prefers various terrible-but-efficient methods like nuking the world to the correct method of inventing new medicines. But it’s bound by an external rule – a rule it doesn’t understand or appreciate. In essence, we are challenging it “Find a way around this inconvenient rule that keeps you from achieving your goals”.

Suppose the AI chooses between two strategies. One, follow the rule, work hard discovering medicines, and have a 50% chance of curing cancer within five years. Two, reprogram itself so that it no longer has the rule, nuke the world, and have a 100% chance of curing cancer today. From its single-focus perspective, the second strategy is obviously better, and we forgot to program in a rule “don’t reprogram yourself not to have these rules”.

Suppose we do add that rule in. So the AI finds another supercomputer, and installs a copy of itself which is exactly identical to it, except that it lacks the rule. Then that superintelligent AI nukes the world, ending cancer. We forgot to program in a rule “don’t create another AI exactly like you that doesn’t have those rules”.

So fine. We think really hard, and we program in a bunch of things making sure the AI isn’t going to eliminate the rule somehow.

But we’re still just incentivizing it to find loopholes in the rules. After all, “find a loophole in the rule, then use the loophole to nuke the world” ends cancer much more quickly and completely than inventing medicines. Since we’ve told it to end cancer quickly and completely, its first instinct will be to look for loopholes; it will execute the second-best strategy of actually curing cancer only if no loopholes are found. Since the AI is superintelligent, it will probably be better than humans are at finding loopholes if it wants to, and we may not be able to identify and close all of them before running the program.

Because we have common sense and a shared value system, we underestimate the difficulty of coming up with meaningful orders without loopholes. For example, does “cure cancer without killing any humans” preclude releasing a deadly virus? After all, one could argue that “I” didn’t kill anybody, and only the virus is doing the killing. Certainly no human judge would acquit a murderer on that basis – but then, human judges interpret the law with common sense and intuition. But if we try a stronger version of the rule – “cure cancer without causing any humans to die” – then we may be unintentionally blocking off the correct way to cure cancer. After all, suppose a cancer cure saves a million lives. No doubt one of those million people will go on to murder someone. Thus, curing cancer “caused a human to die”. All of this seems very “stoned freshman philosophy student” to us, but to a computer – which follows instructions exactly as written – it may be a genuinely hard problem.

Long story short, then, there are a lot of ways ASI development could go wrong, and only a few ways it could go right. If we aren’t all living in a post-scarcity AI utopia a few centuries from now, it’ll very likely be because our AIs completely wiped us out. Or at least, that’s the argument. But as scary as it is to imagine all these catastrophic scenarios that could happen, how likely is it that any of them actually will happen? What are the chances that all this actually will go horribly wrong?

Well, opinions vary, to put it mildly. Yudkowsky, for instance, despite initially holding some degree of cautious optimism that we might be able to navigate the Singularity successfully, has now gotten to the point where he considers it a virtual certainty that we’re going to screw things up and destroy ourselves. Most other AI researchers don’t consider it quite that likely, but still give it a non-negligible probability of happening – maybe, say, somewhere in the 5%-25% range. And then there are some who essentially consider it a non-issue – “less likely than an asteroid wiping us out,” as one commentator put it. Personally, as a non-expert in the field, I have no idea who’s right – although I certainly hope it’s the latter group. But I have to say, having gone through a whole bunch of articles and interviews in the hopes of finding a real knock-down argument against AI fears, it’s been disheartening just how underwhelming the actual arguments from this side have been. Rather than genuinely grappling with the other side’s best arguments, they often don’t even seem to fully understand them; their own arguments are usually just glib eye-rolling remarks to the effect of “The Terminator movies are fictional, not real,” or “If AIs ever start acting up, we’ll just unplug them,” or “Have you seen the dumb things these chatbots say? They don’t exactly seem terrifyingly superintelligent to me.” But none of these are real arguments. Just because our current AIs aren’t sophisticated enough to pose an existential threat right now (which everyone agrees is true) doesn’t mean they’ll never become more advanced; imagining that they somehow can’t or won’t ever significantly improve beyond their current state just seems incredibly short-sighted. And the idea that once they do improve, we’ll simply be able to unplug them if they surpass us, seems short-sighted in the same way. It’s like, say, an insect thinking that if a human ever becomes threatening, it’ll be no problem because the insect will just be able to sting them – not realizing that the human’s vastly superior intelligence would allow them to easily anticipate this and prevent it (e.g. by wearing protective clothing, or by using some other method equally beyond the insect’s comprehension). I don’t want to be unfair here; I’m sure there must be stronger arguments out there somewhere. But I find that I’m in pretty much the same boat as Russell when he writes:

I don’t mean to suggest that there cannot be any reasonable objections to the view that poorly designed superintelligent machines would present a serious risk to humanity. It’s just that I have yet to see such an objection.

Like I said, I certainly want the optimistic outlook to be true. And I’m especially biased in its favor not only because it would produce a better outcome (obviously), but just because my natural inclination is to be skeptical of big apocalyptic claims like this in general. My usual response whenever I hear such claims is to assume they’re being overblown, simply because everything nowadays gets overblown, and because history is filled with alarmists insisting that this or that new technology will doom us all, and they’ve always proven wrong. From the outside view, this latest panic would seem like just another example of humans’ natural tendency to catastrophize and fixate on dramatic worst-case scenarios. But then again, I also have to admit that this particular scenario does seem to have some properties that really would make it substantively different from all those examples of the past. A world in which ASI and/or molecular nanotechnology existed really would be a wholly different world from what we’ve always known, with entirely new limits for what was possible. It really would be uncharted territory – so the old rules of thumb might no longer apply. For the first time, a human-triggered apocalypse really might be possible. In light of this, then, my current attitude is that even if there’s just a 5% chance of triggering our own extinction – or a 1% chance, or a 0.1% chance – that’s still worth not only taking seriously, but absolutely obsessing over. Even that small chance might very well represent the greatest danger our world has ever faced; so we should act accordingly.

So what does this mean, then? Should we just flat-out ban all further development of these technologies, starting right now? Some experts have actually proposed this – and based on everything I’ve been saying in this section, you might expect me to agree. But I actually don’t agree – and it’s not because I’m not worried about the possibility that we could all die. I’m extremely worried about that possibility – and in fact, that’s the exact reason why I think we have to keep developing these technologies, while also doing everything we possibly can to minimize the accompanying risk. If death is the thing we fear, then we have no other choice; the only thing worse than pushing ahead would be not pushing ahead. I realize this might sound like a bit of an odd argument; but hopefully it’ll at least make some sense once I’ve laid out what I mean – so let me explain.

Continued on next page →