I – II – III – IV – V
[Single-page view]
If there’s one thing both sides of the AI risk debate can agree on, it’s that all of us dying would be a bad outcome. This isn’t to say that everyone agrees that death is bad per se – you’ll recall some of the pro-mortality arguments I mentioned earlier (about how death is a natural part of life, how it gives life meaning, etc.). But hopefully you’ll also recall some of the counterarguments I gave that no, really, death is in fact bad. And now (for reasons that will become clear shortly) I just want to add one more counterargument to the list – specifically this one from Alexander, in which he discusses recent efforts by biologists like David Sinclair to reverse the process of aging and “natural” death in humans:
Is stopping aging desirable?
Sinclair thinks self-evidently yes. He tells the story of his grandmother – a Hungarian Jew who fled to Australia to escape communist oppression. She was adventurous, “young at heart”, and “she did her damnedest to live life with the spirit and awe of a child”. Sinclair remembers her as a happy person and free spirit who was always there for him and his family during their childhood in the Australian outback.
And her death was a drawn-out torture:
By her mid-80s, Vera was a shell of her former self, and the final decade of her life was hard to watch…Toward the end, she gave up hope. ‘This is just the way it goes’, she told me. She died at the age of 92…but the more I have thought about it, the more I have come to believe that the person she truly was had been dead many years at that point.
Sinclair’s mother didn’t have an easy time either:
It was a quick death, thankfully, caused by a buildup of liquid in her remaining lung. We had just been laughing together about the eulogy I’d written on the trip from the United States to Australia, and then suddenly she was writhing on the bed, sucking for air that couldn’t satisfy her body’s demand for oxygen, staring at us with desperation in her eyes.
I leaned in and whispered into her ear that she was the best mom I could have wished for. Within a few minutes, her neurons were dying, erasing not just the memory of my final words to her but all of her memories. I know some people die peacefully. But that’s not what happened to my mother. In those moments she was transformed from the person who had raised me into a twitching, choking mass of cells, all fighting over the last residues of energy being created at the atomic level of her being.
All I could think was “No one ever tells you what it is like to die. Why doesn’t anyone tell you?
It would be facile to say “and that’s what made him become an anti-aging researcher”. He was already an anti-aging researcher at that point. And more important, everyone has this experience. If seeing your loved ones fade into shells of their former selves and then die painfully reliably turned you into an anti-aging researcher, who would be left to do anything else?
So his first argument is something like “maybe the thing where we’re all forced to watch helplessly as the people we love the most all die painfully is bad, and we should figure out some solution”. It’s a pretty compelling argument, one which has inspired generations of alchemists, mystics, and spiritual seekers.
[…]
But his second argument is: we put a lot of time and money into researching cures for cancer, heart disease, stroke, Alzheimers’, et cetera. Progress in these areas is bought dearly: all the low-hanging fruit has been picked, and what’s remaining is a grab bag of different complicated things – lung cancer is different from colon cancer is different from bone cancer.
The easiest way to cure cancer, Sinclair says, is to cure aging. Cancer risk per year in your 20s is only 1% what it is in your 80s. Keep everyone’s cells as healthy as they are in a 20-year-old, and you’ll cut cancer 99%, which is so close to a cure it hardly seems worth haggling over the remainder. As a bonus, you’ll get similar reductions in heart disease, stroke, Alzheimers, et cetera.
But also […] Sinclair thinks curing aging is easier than curing cancer. For one thing, aging might be just one thing, whereas cancer has lots of different types that need different strategies. For another, total cancer research spending approaches the hundreds of billions of dollars, whereas total anti-aging spending is maybe 0.1% of that. There’s a lot more low-hanging fruit!
And also, even if we succeed at curing cancer, it will barely matter on a population level. If we came up with a 100% perfect cure for cancer, average US life expectancy would increase two years – from 80 to 82. Add in a 100% perfect cure for heart disease, and you get 83. People mostly get these diseases when they are old, and old people are always going to die of something. Cure aging, and the whole concept of life expectancy goes out the window.
There are a lot of people who get angry about curing aging, because maybe God didn’t mean for us to be immortal, or maybe immortal billionaires will hog all the resources, or [insert lots of other things here]. One unambitious – but still potentially true – counterargument to this is that a world where we conquered aging, then euthanized everyone when they hit 80, would still be infinitely better than the current world where we age to 80 the normal way.
But once you’ve accepted this argument, there are some additional reasons to think conquering death would be good.
First, the environmental sustainability objection isn’t really that strong. If 50% of people stopped dying (maybe some people refuse the treatment, or can’t afford it), that would increase the US population by a little over a million people a year over the counterfactual where people die at the normal rate. That’s close to the annual number of immigrants. If you’re not worried about the sustainability of immigration, you probably shouldn’t worry about the sustainability of ending death.
You can make a similar argument for the world at large: life expectancy is a really minimal driver of population growth. The world’s longest-lived large country, Japan, currently has negative population growth; the world’s shortest-lived large country, Somalia, has one of the highest population growth rates in the world. If 25% of the world population took immortality serum (I’m decreasing this from the 50% for USA because I’m not even sure 50% of the world’s population has access to basic antibiotics), that would increase world population by 15 million per year over the counterfactual. It would take 60 years for there to even be an extra billion people, and in 60 years a lot of projections suggest world population will be stable or declining anyway. By the time we really have to worry about this we’ll either be dead or colonizing space.
Second, life expectancy at age 10 (ie excluding infant mortality) went up from about 45 in medieval Europe to about 85 in modern Europe. What bad things happened because of this? Modern Europe is currently in crisis because it has too few people and has to import immigrants from elsewhere in the world. And the increase didn’t cause some kind of stagnation where older people prevented society from ever changing. It didn’t cause some sort of perma-dictatorship where old people refuse to let go of their resources and the young toil for scraps. It corresponded to the period of the most rapid social and economic progress anywhere in history.
Would Europe be better off if the government killed every European the day they turned 45? If not, it seems like the experiment with extending life expectancy from 45 to 85 went pretty well. Why not try the experiment of extending life expectancy from 85 to 125, and see if that goes well too?
And finally, what’s the worst that could happen? An overly literal friend has a habit of always answering that question with “everyone in the world dies horribly”. But in this case, that’s what happens if we don’t do it. Seems like we have nowhere to go but up!
I think all of this is spot-on – but in particular, I think that last point is worth giving a long hard look in the context of the whole AI risk debate. The main argument for avoiding ASI is that it might lead to all of us dying – but the thing is, “all of us dying” is the outcome that will happen if we don’t get ASI. It’s not just a strong possibility; it’s what will definitely, 100% happen to every single one of us if we never reach the Singularity. (True, we could use biotechnology alone to extend our lifespans by quite a bit without turning to ASI, as Alexander describes; but as other commentators like Arvin Ash and Holger von Jouanne-Diedrich point out, even if we figured out how to reverse aging and cure every disease, that’d still only give the average person a few centuries before they died in an accident or a natural disaster or something like that. For true immortality (i.e. immortality lasting as long as the universe itself lasts), we’d need ASI and nanotechnology and the whole rest of the package, or something equivalent.) What that means, then, is that our choice of whether or not to pursue the Singularity isn’t actually a choice between “definitely stay alive” (if we decide not to risk it) versus “maybe die” (if we do) – it’s a choice between “stay alive for a few more years but then definitely die” (if we don’t go for it) versus “maybe die or maybe unlock immortality” (if we do).
Of course, if the AI safety skeptics are actually right and our odds of successfully navigating the Singularity are particularly low – like, lower than 50% even in the best-case scenario where we’ve taken every possible precaution and implemented every conceivable safeguard – then the idea that we should still go for it anyway becomes a much harder pill to swallow. As Bryan Caplan illustrates with a thought experiment:
Suppose you receive the following option.
- You flip a fair coin.
- If the coin is Heads, you acquire healthy immortality.
- If the coin is Tails, you instantly die.
The expected value of this option seems infinite: .5*infinity + 0 is still infinity, no? Even if you apply diminishing marginal utility to life itself, it’s hard to imagine that the rest of your natural life outweighs a 50% shot of eternity… especially if you remember that many of your actual years are unlikely to be healthy.
Nevertheless, I suspect that almost no one would take this deal. Even I shudder at the possibility. So what gives?
Caplan is right that most people would probably balk at the idea of flipping the coin, just because the immediacy of the threat of death would overwhelm all other considerations. What’s interesting, though, as David Henderson points out in the replies, is that if you asked them whether they’d do it when they were in their 90s, near the very end of their lifespan, the offer would suddenly become a lot more appealing, and most people probably would accept it then. With so little time left for them to potentially lose, the downside of flipping the coin would suddenly seem like much less of an issue. I also suspect that for the same reason, if you told people that their natural lifespan would only be, say, two days, and they’d be offered the coin flip when they were one day old, most of them would probably take the offer in that scenario as well. Even though it would mean flipping the coin just halfway through their natural lifespans, they’d still want to do it just because there would be so little time left for them even if they declined the offer. The deciding factor, in other words, wouldn’t necessarily be how far they were through their lifespans; it would just be how close they were to death in absolute terms.
But here’s the thing: The situation we’re in right now, where most of us have maybe 30-40 years left in our natural lifespans (if we’re lucky), is essentially that same scenario. In absolute terms, 30-40 years is very close to death. The fact that we naturally only live for 80 or so years makes us feel like 30-40 years is quite a long time (in much the same way that a mayfly probably feels like 24 hours is a long time) – but considering things in the context of a potential post-Singularity world, 30-40 years is barely a blip. Our true potential lifespans in a post-Singularity world would be measured in eons, not decades – so from the perspective of a trillion-year-old post-Singularity transhuman looking back on our present day, the notion that a 40-year-old might decline the coin flip just so they could be assured of living for another 30-40 years would seem as absurd as a 90-year-old declining it on their deathbed just so they could be assured of living for another day or two, or a mayfly declining the offer just so it could be assured of living for a few more hours.
Ultimately, then, if our goal is really to avoid death, we have no better option than to take the gamble, even if the odds aren’t especially favorable. No doubt, the downside risk is unspeakably massive; in the worst-case scenario where we completely blow it and inadvertently wipe out the entire species, eight billion people who would otherwise have lived for another 40 years on average will instead be killed instantly. Having said that, though, the fact that each of those people would only have been expected to live for another 40 years or so would mean that even this total extinction event would “only” equate to the destruction of about 320 billion years of human life – whereas if we actually managed to carry off the Singularity successfully, it would mean vastly more than 320 billion years of human life would be gained, since each of us who would otherwise have died in those 40 or so years would now be able to live for another trillion trillion years if we wanted to. In other words, having the opportunity to pursue the Singularity, but choosing never to do so because of the risks, would mean the loss of trillions upon trillions of potential life-years – orders of magnitude more than the mere billions that would be lost in the instant-doom scenario. Accordingly, what the pure utility calculus would suggest is that we really should be trying our hardest to reach the Singularity, even despite the existential risks.
(Of course, you might object to this logic on the grounds that it doesn’t fully acknowledge all that would actually be lost if humanity were completely wiped out. By only counting life-years destroyed, it’s not accounting for all the potential life-years that would never be realized if we destroyed ourselves, because in doing so we would also be destroying the possibility for any future generations. But as I discussed in my metaethics post, the argument for ascribing moral value to purely hypothetical people who never actually come into existence isn’t one that can ultimately hold up, simply because it’s not possible to do moral harm to people who never come into existence in the first place. In order for something to be harmful, it has to actually be harmful to somebody. So any system of ethics that counts it as a moral harm to violate some potential person’s hypothetical preference to be brought into existence is ultimately untenable, and leads to failure modes like the Repugnant Conclusion and so on. You might want to insist that nevertheless, there would still be some abstract sense in which the universe would just innately be worse off without any living beings in it. But again, in such a universe, there wouldn’t be anyone around for anything to actually be worse for – it would all just be uncaring rocks and gas clouds – so the concept would cease to have any meaning (except inasmuch as it would be thwarting our present-day desires for our species not to go extinct). I won’t rehash the whole argument here; but like I said, you can see the metaethics post for a fuller explanation if you’re not entirely convinced.)
So okay then, if reaching the Singularity really should be our driving goal, then does this mean just trying to get there as fast as we can, even at the expense of safety? You might be tempted to argue, based on the above logic, that if we could reach the Singularity even just one year earlier by forgoing some safety precautions, it would mean that the 60 million people who would otherwise have died in that year would instead be able to live for trillions of years – a gain that would so vastly outweigh the potential downside risk of losing 320 billion life-years in an AI-induced extinction event that it would be worth rushing the process even if it meant accepting a significantly higher risk level. In fact, as it happens, this was the attitude that I myself held up until very recently. But I changed my mind after seeing a counterargument from commenter LostaraYil21, who pointed out that the massive gain in life-years we could potentially attain in the best-case scenario is in fact all the more reason why we shouldn’t rush the process, but should instead spend however long it takes to make sure that our odds are as high as they can possibly be. After all, if (let’s say) we rush things and thereby cause there to be a 50-50 chance of either absolute extinction or everyone gaining an extra trillion years in life expectancy, that would equate to an expected value of (0.5)*(8 billion trillion life-years added)-(0.5)*(320 billion life-years lost), for a net total of ~3.9999 billion trillion expected life-years added – whereas if we take an extra year to improve safety by even just one percentage point, making it a 51-49 chance of survival, that would make it an expected value of (0.51)*(7.94 billion trillion life-years added)-(0.49)*(320 billion life-years lost), for a total of ~4.05 billion trillion expected life-years added – an unfathomably massive difference. So even though 60 million people would die during that one extra year of development and would never get to enjoy all those extra trillions of life-years, their loss would still be outweighed by the increased chance we’d be giving ourselves that we wouldn’t all be forced to miss out on all those extra life-years. The upside of ensuring that we reached the Singularity safely would just be so massive that it would be worth it even if it meant that millions of people would have to miss the boat. Now, naturally, at some point all the safety measures would have to reach a point of diminishing returns, where it would no longer be possible to come up with any further safety improvements within a time frame that would make them worthwhile. So at some point we’d just have to pull the trigger and go for it. The point here, though, is just that we’d need to make damn sure we were taking every possible measure to maximize our expected value in doing so. In other words, there are two goals that we need to simultaneously bear in mind when approaching this problem: (1) we must not waste even a single moment in trying to reach the Singularity, lest immeasurable quantities of human life be lost; and (2) the most crucial part of successfully accomplishing that goal is to make sure we aren’t destroyed in the process – so we must do everything within our power to ensure that the transition is a safe one. That is to say, there’s no time to waste – but time spent maximizing safety is not wasted. It might seem like these two goals are in tension, but they’re both absolutely critical. It’s like if, say, there were some extreme emergency scenario in which a massive global pandemic was rapidly killing off the population, and the only cure was some exotic mineral from the surface of the moon, and we only had enough material to build one rocketship to go up and retrieve it. In this scenario, we’d be under immense pressure to launch the ship as quickly as we possibly could, so no one would have to needlessly die due to our dawdling – but also, we’d only have one shot at getting it right; if the ship malfunctioned and blew up, we wouldn’t get a second chance. So while it would be morally imperative not to delay the launch for even a moment longer than necessary, it would be even more imperative to make sure the ship was safe and wouldn’t blow up, even if that meant losing some additional lives in the short term because we took slightly longer to fully foolproof it. It wouldn’t be easy to insist on doing our full due diligence while our loved ones were dropping like flies all around us – but it would be the right call in the end. It would be crucial to do the job quickly, but it would be even more crucial to do it right.
Now, having said all that, there are a couple of possibilities – extremely remote ones, in my view, but possibilities nonetheless – that could completely invalidate this whole line of reasoning. For starters, this whole argument for not wasting any time has been predicated on the assumption that death is irreversible and can never be undone; but if it turned out that it was somehow possible for a sufficiently advanced ASI to bring people back to life – not just to make perfect recreations of past people, but to fully restore their actual bodies and brains, atom for atom, even after they’d long since turned to dust – then obviously that would change things quite a bit. Needless to say, this would require the ASI to be impossibly powerful; I’m imagining that it would literally have to be something like Laplace’s Demon, capable of somehow knowing the position and momentum of every particle in the universe, and then tracing them backward through time to determine the precise atomic makeup of everyone who’d ever died, then bringing each person’s atoms back together again to perfectly reconstruct them in the present. And this is something I have a hard time imagining that even the most advanced ASI would be able to pull off, for a whole host of reasons (not least of which being that calculating the path of every particle in the universe would presumably require as much computing power as existed in the universe itself). But if a future ASI could actually figure it out – and I’d hesitate to put anything past an entity with an IQ millions of times greater than my own, even if it involved circumventing the laws of physics somehow – then that would negate everything I’ve been saying about the importance of reaching the Singularity as quickly as safely possible. There would no longer be any urgency to push ahead in order to save people from being lost forever, because that risk would no longer exist; anyone who died before the Singularity could just be brought back later. In that scenario, the only important thing would simply be reaching the Singularity as safely as possible regardless of how long it took. And that would mean that instead of pushing steadily ahead, we’d instead want to go as slowly as possible – not just moderately slowly, but absurdly slowly – just so we could be sure we were doing everything as safely as we possibly could. There would no longer be any time pressure, so we’d be able to take our time and make sure we were getting everything absolutely perfectly right. Obviously, this would be the best-case scenario of all – so if it could be somehow shown that such an outcome would actually be possible, even in theory, I’d consider it the best news imaginable. But like I said, I’m not counting on it. I think that whether we like it or not, we’re going to have to save people from death before they die if we’re going to save them at all – and that means moving more quickly than the bare minimum.
…Unless, that is, there’s no chance at all that we’ll actually be able to achieve a positive Singularity, even under the best possible conditions. That’s the second big counterargument that could invalidate everything I’ve been saying here: If the people who are most pessimistic about ASI are actually right, and there’s literally zero chance we’ll be able to create ASI without it backfiring on us, then it should go without saying that we shouldn’t pursue it at all, and should completely stop AI development before it reaches the point of no return. To be sure, the case they’ve made for the reality of AI risk is an extremely strong one; they’ve certainly convinced me, at least, that it’s the biggest potential threat our species has ever faced. That being said, though, I’m still not convinced that it’s absolutely inevitable that continued AI development will necessarily lead to doom with near-total certainty. My overall impression (and again, I really want to stress how much I’m not an expert here, but just giving my general sense of things from the outside view) is that it is in fact possible to build an advanced AI that doesn’t decide to wipe us all out – and that we may even have a very good chance of doing so. It strikes me as entirely plausible, for instance, that once we’ve progressed far enough in our AI development, it’ll start to become apparent that there’s no feasible way to have a narrow AI make the leap to full AGI without it forming something like what we’d call “common sense” in the process; to repeat Piper’s line from earlier, “maybe alignment will turn out to be part and parcel of other problems we simply must solve to build powerful systems at all.” Or maybe it’ll turn out that we don’t actually have to figure out the alignment problem for non-human ASIs in the first place, because our first ASIs will come from emulations of scanned-and-uploaded human brains rather than being coded entirely from scratch. I could imagine a scenario in which, say, we continue to develop AI normally over these next few years, right up until the point where we haven’t quite achieved full AGI but have progressed far enough that AIs are able to figure out how to build technology to perfectly scan human brains and make digital replicas of them – and then at that point, we have those digital human brains bootstrap themselves into full ASIs, and we achieve superintelligence without ever having to run the risk of non-human ASI misalignment. (As Paul Christiano has suggested, this might not even need to involve recreating digital brains from the bottom up by mapping out all the individual neurons; instead, we might just give an advanced AI a bunch of neuroimaging data from a specific human brain, along with instructions to create some kind of digital model from scratch that produces those same outputs, and it’ll turn out that the simplest model it can create that meets the criteria is, in fact, an emulation of that very human brain – so the AI will have reverse-engineered the brain from the top down without ever even understanding it at the base level.) Or maybe neither of those scenarios will happen, and we actually will have to figure out the alignment problem, but we’ll be able to do so successfully by tackling it from some creative angle that turns it from an overwhelming all-or-nothing challenge to one that’s much more manageable and forgiving. Russell’s principles for inverse reinforcement learning, for instance (as summarized here by Wikipedia), seem like a genuinely promising example of the kind of approach that could actually work:
Russell begins by asserting that the standard model of AI research, in which the primary definition of success is getting better and better at achieving rigid human-specified goals, is dangerously misguided. Such goals may not reflect what human designers intend, such as by failing to take into account any human values not included in the goals. If an AI developed according to the standard model were to become superintelligent, it would likely not fully reflect human values and could be catastrophic to humanity.
[…]
Russell [instead] proposes an approach to developing provably beneficial machines that focus on deference to humans. Unlike in the standard model of AI, where the objective is rigid and certain, this approach would have the AI’s true objective remain uncertain, with the AI only approaching certainty about it as it gains more information about humans and the world. This uncertainty would, ideally, prevent catastrophic misunderstandings of human preferences and encourage cooperation and communication with humans.
[…]
Russell lists three principles to guide the development of beneficial machines. He emphasizes that these principles are not meant to be explicitly coded into the machines; rather, they are intended for human developers. The principles are as follows:
- The machine’s only objective is to maximize the realization of human preferences.
- The machine is initially uncertain about what those preferences are.
- The ultimate source of information about human preferences is human behavior.
The “preferences” Russell refers to “are all-encompassing; they cover everything you might care about, arbitrarily far into the future.” Similarly, “behavior” includes any choice between options, and the uncertainty is such that some probability, which may be quite small, must be assigned to every logically possible human preference.
Russell explores inverse reinforcement learning, in which a machine infers a reward function from observed behavior, as a possible basis for a mechanism for learning human preferences.
Alexander expands on this a bit:
If it’s important to control AI, and easy solutions like “put it in a box” aren’t going to work, what do you do?
[Reading Russell’s response to this question will be] exciting for people who read Bostrom but haven’t been paying attention since. Bostrom ends by saying we need people to start working on the control problem, and explaining why this will be very hard. Russell is reporting all of the good work his lab at UC Berkeley has been doing on the control problem in the interim – and arguing that their approach, Cooperative Inverse Reinforcement Learning, succeeds at doing some of the very hard things. If you haven’t spent long nights fretting over whether this problem was possible, it’s hard to convey how encouraging and inspiring it is to see people gradually chip away at it. Just believe me when I say you may want to be really grateful for the existence of Stuart Russell and people like him.
Previous stabs at this problem foundered on inevitable problems of interpretation, scope, or altered preferences. In Yudkowsky and Bostrom’s classic “paperclip maximizer” scenario, a human orders an AI to make paperclips. If the AI becomes powerful enough, it does whatever is necessary to make as many paperclips as possible – bulldozing virgin forests to create new paperclip mines, maliciously misinterpreting “paperclip” to mean uselessly tiny paperclips so it can make more of them, even attacking people who try to change its programming or deactivate it (since deactivating it would cause fewer paperclips to exist). You can try adding epicycles in, like “make as many paperclips as possible, unless it kills someone, and also don’t prevent me from turning you off”, but a big chunk of Bostrom’s [book Superintelligence] was just example after example of why that wouldn’t work.
Russell argues you can shift the AI’s goal from “follow your master’s commands” to “use your master’s commands as evidence to try to figure out what they actually want, a mysterious true goal which you can only ever estimate with some probability”. Or as he puts it:
The problem comes from confusing two distinct things: reward signals and actual rewards. In the standard approach to reinforcement learning, these are one and the same. That seems to be a mistake. Instead, they should be treated separately…reward signals provide information about the accumulation of actual reward, which is the thing to be maximized.
So suppose I wanted an AI to make paperclips for me, and I tell it “Make paperclips!” The AI already has some basic contextual knowledge about the world that it can use to figure out what I mean, and my utterance “Make paperclips!” further narrows down its guess about what I want. If it’s not sure – if most of its probability mass is on “convert this metal rod here to paperclips” but a little bit is on “take over the entire world and convert it to paperclips”, it will ask me rather than proceed, worried that if it makes the wrong choice it will actually be moving further away from its goal (satisfying my mysterious mind-state) rather than towards it.
Or: suppose the AI starts trying to convert my dog into paperclips. I shout “No, wait, not like that!” and lunge to turn it off. The AI interprets my desperate attempt to deactivate it as further evidence about its hidden goal – apparently its current course of action is moving away from my preference rather than towards it. It doesn’t know exactly which of its actions is decreasing its utility function or why, but it knows that continuing to act must be decreasing its utility somehow – I’ve given it evidence of that. So it stays still, happy to be turned off, knowing that being turned off is serving its goal (to achieve my goals, whatever they are) better than staying on.
This also solves the wireheading problem. Suppose you have a reinforcement learner whose reward is you saying “Thank you, you successfully completed that task”. A sufficiently weak robot may have no better way of getting reward than actually performing the task for you; a stronger one will threaten you at gunpoint until you say that sentence a million times, which will provide it with much more reward much faster than taking out your trash or whatever. Russell’s shift in priorities ensures that won’t work. You can still reinforce the robot by saying “Thank you” – that will give it evidence that it succeeded at its real goal of fulfilling your mysterious preference – but the words are only a signpost to the deeper reality; making you say “thank you” again and again will no longer count as success.
Of course, Alexander adds that this approach still has a ways to go before it can be fully perfected and implemented:
All of this sounds almost trivial written out like this, but number one, everything is trivial after someone thinks about it, and number two, there turns out to be a lot of controversial math involved in making it work out (all of which I skipped over). There are also some big remaining implementation hurdles. For example, the section above describes a Bayesian process – start with a prior on what the human wants, then update. But how do you generate the prior? How complicated do you want to make things? Russell walks us through an example where a robot gets great information that a human values paperclips at 80 cents – but the real preference was valuing them at 80 cents on weekends and 12 cents on weekdays. If the robot didn’t consider that a possibility, it would never be able to get there by updating. But if it did consider every single possibility, it would never be able to learn anything beyond “this particular human values paperclips at 80 cents on 12:08 AM on January 14th when she’s standing in her bedroom.” Russell says that there is “no working example” of AIs that can solve this kind of problem, but “the general idea is encompassed within current thinking about machine learning”, which sounds half-meaningless and half-reassuring.
So it’s certainly not a foregone conclusion that this approach of inverse reinforcement learning will ultimately be successful in the long run. Still, it does seem promising enough that the idea of just throwing up our hands and declaring that we’ll never be able to resolve the alignment problem feels like premature defeatism. Russell’s approach – or something like it – actually seems like it could (and I daresay probably would) work if implemented; or if nothing else, it seems like it’s on the right track. And in fact, a number of theorists – including even Bostrom, who’s known for being wary of advanced AI in general – have offered creative speculations about how such an approach could hypothetically work. Here’s one of Bostrom’s ideas, for instance:
Suppose we write down a description of a set of values on a piece of paper. We fold the paper and put it in a sealed envelope. We then create an agent with human-level general intelligence, and give it the following final goal: “Maximize the realization of the values described in the envelope.” What will this agent do?
The agent does not initially know what is written in the envelope. But it can form hypotheses, and it can assign those hypotheses probabilities based on their priors and any available empirical data. For instance, the agent might have encountered other examples of human-authored texts, or it might have observed some general patterns of human behavior. This would enable it to make guesses. One does not need a degree in psychology to predict that the note is more likely to describe a value such as “minimize injustice and unnecessary suffering” or “maximize returns to shareholders” than a value such as “cover all lakes with plastic shopping bags.”
When the agent makes a decision, it seeks to take actions that would be effective at realizing the values it believes are most likely to be described in the letter. Importantly, the agent would see a high instrumental value in learning more about what the letter says. The reason is that for almost any final value that might be described in the letter, that value is more likely to be realized if the agent finds out what it is, since the agent will then pursue that value more effectively. The agent would also discover the convergent instrumental reasons described [earlier] — goal system integrity, cognitive enhancement, resource acquisition, and so forth. Yet, assuming that the agent assigns a sufficiently high probability to the values described in the letter involving human welfare, it would not pursue these instrumental values by immediately turning the planet into computronium and thereby exterminating the human species, because doing so would risk permanently destroying its ability to realize its final value.
We can liken this kind of agent to a barge attached to several tugboats that pull in different directions. Each tugboat corresponds to a hypothesis about the agent’s final value. The engine power of each tugboat corresponds to the associated hypothesis’s probability, and thus changes as new evidence comes in, producing adjustments in the barge’s direction of motion. The resultant force should move the barge along a trajectory that facilitates learning about the (implicit) final value while avoiding the shoals of irreversible destruction; and later, when the open sea of more definite knowledge of the final value is reached, the one tugboat that still exerts significant force will pull the barge toward the realization of the discovered value along the straightest or most propitious route.
And potential solutions like this aren’t the only ones being proposed, either. There’s a whole array of other strategies that are currently being explored, and new ideas in the space are emerging every day. No one knows which of these strategies (if any) will prove to be successful, or if it’ll turn out that none of them are even necessary in the first place because misaligned AI never ends up becoming a real threat after all. This is all unexplored territory. And while that does mean there’s a very real chance that we could be taken by surprise by some unforeseen AI-induced cataclysm, it also means we could just as easily be taken by surprise by how well things go and how effectively we’re able to address problems as they arise. It might very well turn out that the parts of AI development we don’t yet fully understand will actually end up working out in our favor.
Again, this isn’t something that we can just take for granted. If it turns out that something like inverse reinforcement learning is what ultimately allows us to achieve a successful Singularity, it’ll only be because AI researchers actually did the hard work of developing and implementing it. There’s no question that it’ll take a whole lot of effort and a whole lot of competence to make sure we get things absolutely right. But I do think we can get things right. It’s not guaranteed – not by a long shot – but it is possible.
As far as what our exact odds are in numerical terms – again, nobody can say for sure. For what it’s worth, a recent survey of experts on AI risk found that the median estimate of AI doom was about 10% (versus a 90% chance of survival), as Alexander explains:
The new paper [by] Carlier, Clarke, and Schuett (not currently public, sorry, but you can read the summary here) […] instead of surveying all AI experts, […] surveys people who work in “AI safety and governance”, ie people who are already concerned with AI being potentially dangerous, and who have dedicated their careers to addressing this. As such, they were more concerned on average than the people in previous surveys, and gave a median ~10% chance of AI-related catastrophe (~5% in the next 50 years, rising to ~25% if we don’t make a directed effort to prevent it; means were a bit higher than medians). Individual experts’ probability estimates ranged from 0.1% to 100% (this is how you know you’re doing good futurology).
Alexander concludes that it’s noteworthy that “even people working in the field of aligning AIs mostly assign ‘low’ probability (~10%) that unaligned AI will result in human extinction” – and I agree; this is certainly a lot more encouraging than if the percentages were reversed. Still, you might reasonably argue that a 10% risk of total human extinction is still horrifyingly high. If you were about to board an airplane but then you found out there was a 10% chance it would crash, you would not board that plane. Personally, though, I don’t think this is actually the right analogy for our situation as a species – because like I said before, without ASI we’re all going to die anyway. To me, our current situation is more like sitting on a long conveyer belt called “mortality” that’s slowly moving toward a giant industrial shredding machine called “death;” and we don’t know what will happen to us if we jump off the conveyer belt – maybe there’s a 10% chance that it’ll turn out to be surrounded by lava or something, and jumping off will mean dying instantly instead of dying when we reach the end of the conveyer belt – but one thing we do know is that if we stay put instead of jumping off, it’s 100% certain that we’ll die. In that kind of situation, I’d consider jumping off the conveyer belt to be a risk worth taking, even with a 10% chance that we wouldn’t survive the attempt. And in fact, even if the odds were reversed and there was only a 10% chance of survival – or a 1% chance, for that matter – I’d still consider it a risk worth taking, simply because the alternative would be guaranteed death – slightly delayed death, sure, but guaranteed death nonetheless. So like I said before, it’s not a question of whether we want to risk dying now or die in 30-40 years; it’s a question of whether we want to die now (whether that’s right now or in 30-40 years – both are approximately “now”) or die in a trillion years (or however long we’d want to live in a post-Singularity world).
I realize this will be a pretty controversial way of looking at things. Even just writing it all out here, it feels like (if you’ll excuse one more analogy) sitting in a car that’s pointed toward a ramp at the edge of a massive canyon, and yelling “Floor it.” But the truth is, the side of the canyon we’re on right now isn’t safe; the specter of death is rushing toward us, and it will consume all of us if we stay put. The only hope any of us have of surviving is to make it to make it to the other side of the canyon. So as scary as it might be, and as real as the possibility of failure undoubtedly is, I see no better option than to just knuckle down, do everything within our power to make sure our car is safe and won’t fail on us, and then grit our teeth and hit the gas. The only thing worse than building the machine that kills us all is failing to build the machine that saves us all.