Wednesday, February 6, 2019

Card play with a unicellular brain

In an earlier blog post I described the simplest shedding game I can think of. I called it Shed and it's perhaps more of a mathematical simplification than a real card game. But go ahead and try it with 6 or 7 cards, you might have a brain meltdown when you realize you're playing a game simpler than tic-tac-toe and you still can't get it right...

Quick recap of the rules: Two players take turns playing cards numbered 1,..., N to a pile. The card you play must be higher than all the cards currently in the pile. If you can't or don't want to play, you can (must) instead pick up the whole pile. The goal is to force the opponent to pick up all the cards. At the start of the game, the cards are in an open talon available to both players.

It's possible to play the game with a standard deck with different suits (see the blog post), but it seems challenging already with just one suit.

I first thought of this game around the same time that I wrote about simple models of whist, almost 20 years ago. Already at that time I had a short program in C that computed the outcome under optimal play for N up to about 13, and with a rewritten hack on my current laptop I now pushed that to N = 15.

Since each card can be in 4 different locations (in the hand of the player about to move, in the hand of the player who just moved, in the pile, or in the talon), there are (at most) $4^N$ positions to consider. The program creates a "table-base" by repeatedly looping through those $4^N$ positions, checking if it can determine their value under optimal play using values already stored. As soon as there is a full iteration without any more values being assigned, it concludes that the remaining positions are drawn and stops the search.

It can be done much more efficiently, but for = 15 my program requires something like 30 iterations through most of the approximately 1,000,000,000 positions, and more on a minority of them.

Since at any time only a limited number of moves can be played before someone picks up the pile (in theory N but often much fewer), we can trade space for time by only storing the values of the $3^N$ quiet positions - those where the pile is empty. That's what my program does, but for = 15 that's still more than 14 million values.

We can simplify even more by first considering only endgames - those positions where the talon is empty (and will therefore remain empty for the rest of the game). If we focus on quiet endgame positions we are down to $2^N$ - about 32,000 in the case of 15 cards.

I wanted to see if it's possible to play optimally using much less data by somehow assigning a point-value to each card and summing those values. For instance, the values of the 32 quiet endgame positions for = 5 can be compactly described by giving the cards 1, 2, 3, 4, 5 the point-values $-2, -1, 0, 1, 2$ respectively. To figure out if we're winning or not, we sum the values of our own cards, subtract the total value of our opponent's cards, and finally add or subtract 1 point for having or not having the move respectively. If the final score is positive we have a winning strategy, and if it's negative, we're losing.

For = 6 this sort of thing still works, but we have to tinker a bit to get the point-values right. One solution is to set them to $-7, -5, -2, 3, 6$, and $8$, with a correction of $\pm 4$ for having or not having the move. Technical remark to ensure reproducibility of my results: The position where the player to move has all cards on their hand is regarded as winning for that player (since they can play their cards from the bottom up and force the opponent to pick up). But that position is impossible to reach.  

So how do we find a set of point-values that makes this work, and can we do it for more than 6 cards?

To cast this in the terminology of machine-learning, let's not call it the point-value of a card, but instead the weight. So we want to find weights $w_1, w_2, \dots, w_N$ for the cards, and also a suitable bias $b$. That's what we now call the correction term for having or not having the move.

Since I tend to want to see things from the perspective of the player who just moved, I insist that the bias $b$ should be added when we're not making the next move and subtracted when we are, which means $b$ will have to be negative if having the move is an advantage.

In order to find the weights for the case N = 6, I used (buckle up, here are some buzz words of AI) gradient descent and a sigmoid neuron. There's also something called Support Vector Machines, but let's go with the sigmoid neuron here. The picture (well, for N = 7) looks like this:


There are N input nodes (labeled 1,..., 7 in the picture) that we can call $c_1,\dots, c_N$ (c for card) and that take the value $+1$ if we (who made the last move) have that card, and $-1$ if the other player has it. Each input is multiplied with the corresponding weight $w_i$, and the bias $b$ is added to the sum. The result is then passed to the sigmoid neuron, which computes the function
\[\sigma(x) = \frac1{1+e^{-x}}.\] The sigmoid function outputs a number between 0 and 1 which is close to 0 if the input is clearly negative and close to 1 if the input is clearly positive, with a smooth transition for inputs around 0. In our case we can think of the output as a tentative answer to how likely it is that the player who just moved is winning. Ideally, the answer should be 1 if that player actually has a forced win, and 0 otherwise (there exist drawn positions, but they are quite rare, so let's for the moment go with just classifying positions into winning or not winning).

The output will never be exactly 0 or 1, but we'd like to correctly classify each card distribution in the sense of having an output larger than 1/2 if the player who just moved is winning, and smaller than 1/2 otherwise. Since $\sigma(0) = 1/2$, this puts the threshold at zero, so what we're asking for is that the point-score (which is sent as input to the sigmoid function) should be positive if we're winning and negative if we're losing.

We can start by setting all the weights and the bias to 0 (or whatever, really). Then we use the table-base of known values to train the system (using known values like this is what's called supervised learning). We feed the system an input vector (a card distribution) and calculate the output. Then we make small adjustments to $w_1,\dots, w_N$ and $b$ that all push the output a little bit in the direction of making it more correct.  

Here it actually makes sense to adjust the bias $b$ by
\[ b := b + 0.1 \cdot \left(\text{correct value} - \text{actual output}\right).\] This pushes $b$ in the right direction (increasing it if we want a larger output, decreasing it if we want smaller), but looks a bit too simple. But in case you like partial derivatives of error functions and stuff, there's actually a theoretical justification for updating $b$ in this particular way, and a key word is cross entropy error. The learning rate $0.1$ is admittedly a bit arbitrary though.

The weights $w_i$ are updated in the same way, except that the sign of the correction is determined by $c_i$ (the input vector, which is $\pm 1$ depending on who has the i:th card):
\[ w_i := w_i + 0.1 \cdot c_i \cdot \left(\text{correct value} - \text{actual output}\right).\] In the case N = 6 this works like a charm, and already after one epoch of training (going through each of the 64 input vectors once and then making the total adjustments), the system finds the weights $-1.3, -0.9, -0.3, 0.5, 1.1, 1.5$ and the bias $-0.7$, which correctly classify each of the 64 card distributions. These decimal numbers give the integer point-values that I suggested earlier if we multiply everything by 5 and round away from 0.

But when we try the same thing for N = 7, we run into a bit of trouble. Our single neuron brilliantly classifies 124 out of the 128 quiet endgame positions, but has difficulty with the remaining four. With a bit of improvised notation again, those four positions are
\[\left(\text{7-6-5-4-3-2-1}^+ \,\, \square \,\, - \right) \, \mapsto \, 0,\] \[\left(\text{5}^+ \,\, \square \,\, \text{7-6-4-3-2-1}\right) \, \mapsto 0,\] \[\left(\text{7-5-3-2}^+ \,\, \square \,\, \text{6-4-1}\right) \, \mapsto 1, \text{ and}\] \[\left(\text{6-5-4-1}^+ \,\, \square \,\, \text{7-3-2}\right) \, \mapsto 1.\] Here the hands are written with a little square table between them as in bridge diagrams. The whole position (within parentheses) is then mapped by the ideal value to 1 (left player winning) or 0 (right player winning). The plus-signs indicate that a certain player just added cards to their hand (and the other is about to play).

We can now see that it's impossible for a single neuron to correctly classify these positions: If we sum the weights (and biases) involved for the first two positions, we get $2w_5 + 2b$ (the card 5 is on "our" hand in both of them, and all other cards have changed places). If we do the same thing for the last two, we get the same result: $2w_5 + 2b$. Demanding a negative point-score for the first two positions and at the same time a positive point-score for the last two is clearly asking too much: The sum of two negative numbers can't be the same as the sum of two positive ones.

It's true that the first of the four difficult positions is the final position of the game, where the left player has already lost, and that if we remove that position from the training data, the system will learn to correctly classify the remaining 127.

We could argue more generally that the problem is that we ask the system to compare hands with different number of cards in them. It makes sense therefore to try with one set of weights for each pattern, so that for instance we would have a specific set of weights for the pattern 4-3 obtained by training only on positions where the player who just picked up has 4 cards and the other player has 3. This works fine for quiet endgames of 7 cards, and it would be a feasible design also for much larger $N$, since even if we count the patterns of quiet but non-endgame positions, there would only be around $N^2/2$ of them.

But this approach too has its limitations, even for quiet endgames. For 10 cards there are drawn endgames, but even if those positions are removed from the training data, it turns out to be impossible for a single neuron to classify the remaining ones.

For the pattern 6-4 there are 210 positions of which 4 are drawn. When trained on the remaining 206 positions, the system keeps misclassifying two of them. It turns out that the values assigned to the remaining positions show clearly where the problem lies. If after 100 training epochs we list those positions that are assigned a value more than 0.3 away from the correct value, we get the following:
\[\left(\text{10-9-6-5-2-1}^+ \,\, \square \,\, \text{8-7-4-3} \right) \, \mapsto \, 0 \quad \neq 0.49,\] \[\left(\text{10-8-6-5-3-1}^+ \,\, \square \,\, \text{9-7-4-2}\right) \, \mapsto 0 \quad \neq 0.38,\] \[\left(\text{9-7-6-5-4-2}^+ \,\, \square \,\, \text{10-8-3-1}\right) \, \mapsto 0 \quad \neq 0.46,\] \[\left(\text{8-7-6-5-4-3}^+ \,\, \square \,\, \text{10-9-2-1}\right) \, \mapsto 0 \quad \neq 0.36,\] \[\left(\text{10-7-6-5-3-2}^+ \,\, \square \,\, \text{9-8-4-1}\right) \, \mapsto 1 \quad \neq 0.42,\] \[\left(\text{9-8-6-5-4-1}^+ \,\, \square \,\, \text{10-7-3-2}\right) \, \mapsto 1 \quad \neq 0.42.\] The numbers to the right are the outputs of the system. Here we can see that the average point-score of the first four (losing) positions will always be the same as that of the last two (winning) ones, namely $w_6+w_5 + b$. And the average of a list of positive numbers can't equal the average of a list of negative ones.

If we keep training, the system will provide values converging correctly to 0 or 1 for all other positions, while adapting the best it can to the impossible constraints given by the 6 positions listed by giving each of them a value (in the limit) of 1/3.

This isn't really a failure of the approach. We can't possibly expect a small set of linear functions to flawlessly tell winning positions from losing, any more than we expect point-values for chess pieces to solve the riddle of chess, or high-card points to provide a perfect bidding system in bridge.

It's interesting that the single-neuron system discovers these examples, and it actually gets weirder. Once we have set up the "architecture" of the system, we can train it also on non-endgame positions. We can choose to let an input value of $c_i = 0$ represent that the i:th card is still in the talon, and again train the system on a fixed pattern $(m, n)$, meaning that the player who just picked up has $m$ cards and the other player has $n$ cards.  

When we get to N = 6 and the pattern 2-0, we find something strange: In a few rounds the system learns to correctly classify all 15 positions, but it insists that $w_2$ must be smaller than $w_1$! And indeed the only time (for this particular pattern) it matters whether you have the 1 or the 2 is if the other card is the 5, and in that case it's better to have the 1 than to have the 2!  

For N = 7 and the pattern 2-2, again the single neuron fails to correctly classify all positions, and among the difficult positions we find \[\left(\text{7-2}^+ \,\, \square \,\, \text{6-3} \right) \, \mapsto \, 1.\] Here the left player is winning, but if we replace the 3 on the right hand by the 1, left is losing, and if moreover we replace the 2 on the left hand with the 3 that we removed from the right hand, left is still losing: \[\left(\text{7-3}^+ \,\, \square \,\, \text{6-1} \right) \, \mapsto \, 0.\]
Alright, that's all for now!

No comments:

Post a Comment