Tuesday, April 12, 2016

Self-aware robots and Newcomb's paradox

In my last blog-post, I discussed the Newcomb paradox and made some jokes about introducing new Swedish personal pronouns for one-boxers and two-boxers.

For those of you who didn't read that, Newcomb's problem lets you take the contents of any or both of two boxes, a small one which is known to contain $\$$1000 and a large one which contains either nothing or $\$$1,000,000. The catch is that a superintelligent being capable of simulating your brain has prepared the boxes, and put a million dollars in the large box only if it predicts that you will not take the small one.

Newcomb's paradox seems to be a purely hypothetical argument about a situation that obviously cannot occur. Even a superintelligent alien or AI will never be able to predict people's choices with that accuracy. Some claim that it's fundamentally impossible for reasons of quantum randomness (nobody I talked to dismissed it referring to free will though...).

In this post I argue that the Newcomb experiment is not only feasible, but might very well be significant to the design of intelligent autonomous robots (at least in the context of what Nick Bostrom imagines they could do in this TED talk). Finally, as requested by some readers of my previous post, I will reveal my own Newcombian orientation.

So here's the idea: instead of thinking of ourselves as the subject of the experiment, we can put ourselves in the role of the predictor. The subject will then have to be something that we can simulate with perfect accuracy. It could be a computer program or an abstract model written down on paper, but one way or another we must be able to compute its decision.

This looks simple enough. Just like a chess program chooses among the available move options and gets abstractly punished or rewarded for its decisions, we might let a computer program play Newcomb's game, trying to optimize its payoff. But that model is too simple to be of any interest. If the Newcomb game is all there is, then the experiment will favor one-boxers, punish two-boxers, and nothing more will come out of it.

It becomes more interesting in the context of autonomous agents with general artificial intelligence. A robot moving around in the physical world under the control of an internal computer will have to be able to understand not only its environment, but also its own role in that environment. An intriguing question is how an autonomous artificial intelligence would perceive itself. Should it somehow be aware that it is a robot governed by a computer program?

One thing that makes a computerized intellect very different from a human is that it might be able to "see" its own program in a way that humans can't. There is no way that a human brain would be able to understand in every detail how a human brain works, and it's easy to fall into the trap of thinking that this is some sort of deep truth about how things must be. But an artificial intelligence, even a powerful one, need not necessarily have a large source code. Perhaps its code would be only a few thousand lines, maybe a few million. Its memory will almost certainly be much larger, but will have a simple organization.

So an AI could very well be able to load its source code into its memory, inspect it, and in some sense know that "this is me". The actual computations that the program is able to perform (like building neural network based on everything on the internet) could require several orders of magnitude more memory than taken up by the source code. Still, an AI might very well have a basic organization simple enough for complete introspection.

So should the robot be "self-aware"? The first answer that comes to mind might be yes. Because that seems to make it more intelligent, and perhaps better at interacting with people. After all, we require it to have basic understanding of other physical objects and agents, so why should it have a blind spot stopping it from understanding itself?

But suppose the robot is asked to take a decision in order to optimize something. Then it had better believe that there are different options. If a self-driving car becomes aware of itself and the role of its program, then it might (correctly!) deduce that if it chooses to crash the car, then that choice was implicitly determined by its code, and therefore that's what it was supposed to do. And by the way don't tell me it could reprogram itself, because it couldn't (although Eliezer Yudkowsky at Less Wrong seems to have a different opinion). Whatever it was programmed to do doesn't count as reprogramming itself. Also it doesn't need to, because of Turing universality.

It's not that the AI would think that crashing the car was better than any other option. The problem is that if it becomes aware of being a deterministic process, then the concept of having different options loses its meaning. It's true that this argument doesn't provide a reason for crashing the car rather than doing something else, but I'm not sure I would feel safe in a vehicle driven by an AI that starts chatting about free will versus determinism.

Perhaps an agent with self-awareness and reasoning capabilities must be programmed with a counterfactual idea of having "free will" in order to prevent this sort of disaster? Spooky...

There are other problems too with self-aware algorithms. One is the old "Gödelization" trick. Suppose an AI has the task of inspecting computer programs and warn if there is a possibility that they might "freeze" by going into an infinite computation. This is a famous unsolvable problem (the halting problem, and yes that is unsolvable, not unsolved), but we can instruct the AI that it has the right to answer "I don't know" if it can't figure out the answer, just as long as it is always right when it does answer.

This seems fine, but a tiny amount of self-awareness is all it takes to goof up. Suppose the internal organization of the AI is such that it might first prove that program X never freezes, then go on to do something else for a while, and finally when giving the answer, remember only that "I proved a moment ago that program X will never freeze". Then it seems reasonable for it to trust its own previous conclusion and answer that program X never freezes, rather than going through the proof again. After all, if it can't ever trust that what it did a moment ago was correct, how could it ever be confident in anything?

So somehow its self-awareness leads it to accept the principle that "When I have proved something, it must be true". And this is already way out on a slippery slope.

Because now the AI is asked to inspect a program that we might call Perfect Simulator. Perfect Simulator is basically a so-called universal Turing-machine, with capabilities of simulating any other computer program. It takes input in two parts, that we can write as $(A, B)$. Part $A$ is interpreted as a computer program written in some universal programming language, it doesn't matter which, and part $B$ is what we can call auxiliary input. Then it simulates what the program encoded in part $A$ would have done when run with $(A, B)$ as input. You might have thought it should simulate what program $A$ does when run with input $B$, because that would already make it a universal computer (meaning that it can do whatever any computer program can do, if only it gets the right input). But it's even funnier to let it simulate $A$ run with input $(A, B)$.  

In order to make Perfect Simulator safe in the sense of never freezing, we first let it inspect the code in part $A$ using the AI's own method, effectively simulating the AI. This is easy, since the AI's program code is available. If the simulated version of the AI says that the program $A$ is fine (will never freeze on any input), then Perfect Simulator goes on to run it with input $(A, B)$, otherwise it stops.

So what happens when the self aware AI inspects Perfect Simulator? Easy, Perfect Simulator is OK, since it obviously only runs programs that are safe. If it is asked to run a program that wouldn't get the approval of the AI, then it stops without simulating it. So provided the AI's method is correct, Perfect Simulator will never freeze. Notice that the AI can see that the method used by Perfect Simulator to inspect programs before running them is the same as the one the AI itself is using.

Being aware of its own correctness, the AI therefore concludes that Perfect Simulator will never freeze. Which is incorrect.

Because now we run Perfect Simulator with two times its own source code as input, schematically \[PerfectSimulator(PerfectSimulator, PerfectSimulator).\] PerfectSimulator begins by using the self-aware AI to certify that the program specified in the first part of the input (which happens to be Perfect Simulator itself) is safe. Then it starts simulating what that program (that is, Perfect Simulator), would have done when run with input $(PerfectSimulator, PerfectSimulator)$. And we are back to square one, because PerfectSimulator now starts simulating itself, and in that simulation it starts simulating itself, and so on. In short, it freezes.

If you haven't seen this argument before, it might seem a bit complicated. Today it's considered standard, but it took geniuses like Kurt Gödel and Alan Turing to figure it out in the first place. If you read Swedish you can take a look at a little story I wrote in some of my first blog posts, outlining how the game of rock-paper-scissors could have led to the discovery of Gödel's and Turing's theorems.

The strange moral here is that once the program becomes aware of its own correctness (the fact that it never concludes that a program is safe if it isn't), it becomes incorrect! Also notice that we have no reason to think that the AI would be unable to follow this argument. It is not only aware of its own correctness, but also aware of the fact that if it thinks it is correct, it isn't. So an equally conceivable scenario is that it ends up in complete doubt of its abilities, and answers that it can't know anything.

The Newcomb problem seems to be just one of several ways in which a self-aware computer program can end up in a logical meltdown. Faced with the two boxes, it knows that taking both gives more than taking one, and at the same time that taking one box gives $\$$1,000,000 and taking both gives $\$$1000.

It might even end up taking none of the boxes: It shouldn't take the small one because that bungles a million dollars. And taking only the large one will give you a thousand dollars less than taking both, which gives you $\$$1000. Ergo, there is no way of getting any money!?

The last paragraph illustrates the danger with erroneous conclusions. They diffuse through the system. You can't have a little contradiction in an AI capable of reasoning. If you believe that $0=1$, you will also believe that you are the pope (how many popes are identical to you?). 

Roughly a hundred years ago, there was a "foundational crisis" in mathematics triggered by the Russell paradox. The idea of a mathematical proof had become clear enough that people tried to formalize it and write down the "rules" of theorem proving. But because of the paradoxes in naive set theory, it turned out that it wasn't so easy. Eventually the dust settled, and we got seemingly consistent first order axiomatizations of set theory as well as various type-theories. But if we want robots to be "self-aware" and capable of reasoning about their own computer programs, we might be facing similar problems again.

Finally, what about my own status, do I one-box or two-box? Honestly I think it depends on the amounts of money. In the standard formulation I would take one box, because I'm rich enough that a thousand dollars wouldn't matter in the long run, but a million would. On the other hand if the small box instead contains $\$$999,000, I take both boxes, even though the problem is qualitatively the same.

There is a very neat argument in a blog post of Scott Aaronson called "Dude, it's like you read my mind", brought to my attention by Olle Häggström. Aaronson's argument is that in order for a predictor to be able to predict your choice with the required certainty, it would have to simulate you with such a precision that the simulated agent would be indistinguishable from you. So indistinguishable that you might as well be that simulation. And if you are, then your choice when facing the (simulated) Newcomb boxes will have causal effects on the payoff of the "real" you. So you should take one box.

Although I like this argument, I don't think it holds. If you are a simulation, then you are a mathematical object (that might in particular have been simulated before on billions of other computers in parallel universes), so the idea of your choice having causal effects on this mathematical object is just as weird as the idea of the "real" you causing things to have already happened. I don't actually dismiss this idea (and a self-aware robot will have to deal with being a mathematical object). I just think that Aaronson's argument fails to get rid of the problem (causality) with the original "naive" argument for one-boxing.

Moreover, the idea that you might be the predictor's simulation seems to violate the conditions of the problem. If you don't know that the predictor is following the rules, then the problem changes. If for instance you can't exclude being in some sort of Truman show where the thousand people subjected to the experiment before you were just actors, then the setup is no longer the same. And if you can't even exclude the idea that the whole universe as you know it might cease to exist once you made your choice (because it was just a simulation of the "real" you), then it's even worse.

So if I dismiss this argument, why do I even consider taking just one box? "Because I want a million dollars" is not a valid explanation, since the two-boxers too want a million dollars (it's not their fault that they can't get it).

At the moment I don't think I can come up with anything better than
I know it's better to take both boxes, I'm just tired of being right all the time. Now give me a million dollars!
As far as I know, Newcomb's problem is still open. More importantly, and harder to sweep under the rug, so is the quest for a consistent system for a robot reasoning about itself, its options, and its program code, including the consequences of other agents (not just Newcomb's hypothetical predictor) knowing that code.

Should it take one box because that causes it to have a program that the predictor will have rewarded with a million dollars? The form of that argument is suspiciously similar to the argument of the self driving car that running a red light and hitting another car will cause it to have software that forced it to do so.


  1. I really appreciated your reasoning about the self-awareness of artificially intelligent entities, but there are a few things I want to straighten out.

    Firstly I want to point out that in your formulation of the paradox you allow the player to choose between the options of taking none of the boxes, the large box, the small box or both of the boxes. This is not the common formulation of the paradox, where the only alternatives are taking either the large box or both boxes. The formulation of the paradox that you present is not compatible with the one-box and two-box solutions, since there are then two alternatives of one-boxing.

    Secondly I’d like to address your answer on whether you’d choose the one-box or two-box strategy. It’s a common misconception that one-boxing is a viable strategy because those extra thousand dollars don’t really matter. It’s very important to stress that the whole problem is based on the premise that you should maximize the amount of money you win. So an argument based on your current living standards is invalid. To anyone who starts reasoning in this way I usually suggest that we change the game to be about human lives instead of money, where the small box contains the key to saving the life of one of your parents and the large box might contain the keys to saving the lives of your ten closest friends. How do you act then? I like this setting because then it’s hard to argue that the box with the key to saving the life of your parent doesn’t matter.

    I believe that Newcomb’s paradox most certainly is possible to perform in reality. You don’t even need a very good predictor for the paradox to occur, it only needs to perform slightly better than random

    1. Hi Anders, thank you for your comments! Yes I am aware that in the original formulation it is assumed that you have to take the large box, so that there are only two options. I just thought it might be easier to understand the problem if all four options are open a priori. Otherwise your first thought might be that you are supposed to choose between the two boxes. As far as I can see, the options involving not taking the large box are dominated by the option of taking both boxes no matter how one thinks about it. So the problem is essentially the same, and after dismissing the ridiculous options we can still speak about one-boxing and two-boxing.

      The arguments based on current living standards only serve to expose myself as not a true, die-hard n-boxer for any value of n. The annoying thing of course is that the standard arguments work just as well no matter the proportion between the two amounts. Would you stick with your choice, whatever it is, all the way from 1 dollar to 999,999 dollars in the small box?

      Changing the payoffs to saving lives will only confuse things in my opinion. It becomes less clear how one would value the potential outcomes, and the question might change from “would you...” to “is it morally right to...”, but it doesn't add anything to the paradox.

      I’m not sure what you mean by “performing” the paradox in reality. Of course you can run a TV-show where you invite believers in parapsychological mumbo-jumbo and give them a million dollars each, while at the same time inviting some supposedly rational (and not too rich) people that you pay a thousand dollars. But is this a paradox? The rational people know that the large box is empty, and will be no matter their choice. If you put too little in the small box, they will call your bluff. Wait, this sort of already exist, doesn't it?