Jag som är utbildad matematiker ska väl inte bli förvånad över det här med längdskala, areaskala, och volymskala. En ar är 10 meter i kvadrat, alltså 100 kvadratmeter. En hektar är 100 gånger så mycket, vilket är arean av en kvadrat med sida 100 meter. Det är i sin tur bara en hundradels kvadratkilometer.
Sedan har vi det här med volymer. Zlatan Ibrahimovic har fått guldbollen 11 gånger. Om man verkligen gjorde en boll av allt världens guld, skulle den få en diameter på ungefär 25 meter och i stort sett få plats under den här bollbanan.
I högre dimensioner blir det ännu konstigare. Ett bowlingklot ska visst ha en diameter på 22 cm, och skulle förstås precis få plats i en kubisk låda med inre mått 22 cm. Men hur många klot med diameter 20 cm skulle få plats i en sådan låda, det vill säga om kubens sida är 10% större än klotets diameter? Det kan väl fortfarande inte få plats mer än ett kan man tycka, och i tre dimensioner stämmer det. Men i 100 dimensioner går det in två! Och om antalet rumsdimensioner går över en sorts tröskel vid ungefär 600 kommer man att kunna kasta in inte bara två, utan nästan hur många klot som helst slumpmässigt, och de kommer att studsa runt i lådan och nästan aldrig krocka!
I den kurs i linjär algebra som jag brukar hålla på höstarna brukar jag rita hur jag tänker mig att kuber ser ut i hög dimension. Så här ser det ut i mina gamla anteckningar:
Så länge klotet måste ligga i mitten får det bara plats ett, men så fort man kan peta dem en liten bit in i något av spröten, får man in nästan hur många som helst!
"Kan man föreställa sig 10 dimensioner?" frågar studenterna. "Kan man föreställa sig tre?" frågar jag tillbaka, halvt på skämt.
Men bara halvt. De här sakerna kände jag ju till, och jag visste också hur mycket jorden väger. Ändå blev jag på något sätt lite förvånad när jag i en tidigare bloggpost försökte illustrera hur ofattbart mycket $5 \cdot 10^{21}$ är. Jordens vikt i ton lät liksom inte så mycket.
Jorden är ett klot, ungefär 1273 mil i diameter. Metersystemet baserades för övrigt ursprungligen på att avståndet längs jordytan från ekvatorn till en pol ska vara tusen mil, och diametern är $4/\pi\approx 1.273$ gånger den sträckan.
Den är stor, men ändå inte ofattbart stor, tycker jag. På den här bilden ser man södra Sverige och samtidigt jordens krökning. I höstas körde jag en dag rutten Linköping-Rödeby-Göteborg. Det var väldigt mycket skog som susade förbi, och det tog många timmar, mest på 80-vägar. Men det gick på en dag, och då hann jag med att spela schackmatch i division 1 mot Rödeby (i Blekinge) också. Om man tittar på en jordglob eller föreställer sig att man zoomar ut från den här bilden, inser man att sträckan jag körde är ganska liten jämfört med hela jordklotet. Men den är ändå inte så liten att den inte syns. Om jag pekar på min dotters badboll modell jordglob, kan jag förklara för henne att här är Stockholm, här är Göteborg, och här är Polen där gammelmormor och gammelmorfar bor.
Avstånden mellan stjärnorna däremot betraktar jag som bortom min direkta fattningsförmåga. Om jag ska föreställa mig det, får jag skala om flera gånger.
Det lär vara möjligt att under ideala förhållanden se Uranus utan hjälpmedel, men Saturnus är det mest avlägsna jag vet med mig att jag har sett som inte lyser av sig självt. Jag har sett den många gånger men minns särskilt en magisk vårvinterkväll för cirka 15 år sedan med sedermera bortgångne schackvännen och amatörastronomen Arno Platau. Han pekade ut Merkurius, Venus, Mars, Jupiter och Saturnus för det lilla sällskap som hade samlats i Landeryd. Vi kunde se alla fem och dessutom månen, samtidigt. Jag vet inte hur ofta det är möjligt, men gissar att det är ganska sällsynt. Jag, min äldsta dotter och de övriga turades om att titta i teleskopet på Jupiters månar, Saturnus ringar, steniga lilla Merkurius, månens kratrar med mera.
Saturnus var den bortre gränsen för det solsystem som var känt under medeltiden.
Saturnus är större än jorden men syns ändå bara som en liten prick om man inte har kikare, så man ser att det måste vara långt. När den är som närmast och vi ser den i söder vid midnatt, är den hundra tusen jorddiametrar bort. I den skala som ges av badbollen blir det flera mil. Låt oss ändra skala igen, till 1:1,000,000,000,000,000, alltså ett till tusen biljoner. Nu är medeltidens universum en liten kula på ett par-tre millimeter i diameter där inte ens solen är stor nog att urskilja. Kulan får behändigt plats i en ruta på papperet som jag ritade den där kuben på.
I den skalan är ett ljusår 10 meter. Trippelstjärnesystemet Alpha Centauri ligger drygt 40 meter bort. Närmast oss just nu Proxima Centauri med minst en planet i sällskap, och ett par meter därifrån dubbelstjärnesystemet Rigil Kentaurus och Toliman.
Vår galax vintergatan är i den här skalan fortfarande i väldigt runda slängar lika stor som hela Sverige. Stjärnorna, var och en med sitt system av planeter, månar och kometer, ligger som mikroskopiska korn utströdda med några tiotals meter mellan varandra. I mörkret mellan dem driver säkert oräkneliga "rogue planets" och en del svarta hål omkring, men mest är det tomt. Vårt solsystem har virvlat omkring här i miljarder år utan att någonsin passera så nära någon annan tung himlakropp att planetbanorna de där millimetrarna närmast solen har störts. Enstaka klippblock har fallit ner och styrt om kursen för livet på jorden, men inte slagit ut det.
Inte kan det vara särskilt dammigt där ute heller. Om vi tittar upp en klar natt ser vi stjärnorna, men är det lite dimma gör vi inte det. Stjärnljuset kan färdas i tusentals år i rymden för att sedan stoppas upp av några vattendroppar i jordatmosfären.
Ska jag föreställa mig vintergatan, måste jag skala om och skala om och skala om. Annars förblir den bara obegripligt stor. Men jordklotet tycker jag mig kunna fatta. Fast gör jag egentligen det?
Från nordpolen till ekvatorn är det tusen mil. Jordens volym blir då i princip \[\frac{32}{3\pi^2} \approx 1.08\] gånger volymen hos en kub som har tusen mil som sida. Det är nära nog. Låt oss tänka oss att vi delar upp den volymen i kubikmetrar och lägger ut de meterstora kuberna i en lång rad i stället. Hur långt räcker den raden?
Tusen mil är 10 miljoner meter, och kubikmetrarna längs en kant på kuben blir förstås just 1000 mil, nästan en jorddiameter. Tar vi ett "lager" av meterkuber, en kvadratisk sida på tusenmilakuben, blir det 10 miljoner jorddiametrar, 100 gånger så stort som medeltidens universum ut till Saturnus. Det blir en decimeter i skalan 1 till tusen biljoner som jag pratade om nyss. En hundradel av ett ljusår. Men det var bara ett lager, och det finns 10 miljoner sådana lager. Lägger vi ut alla kubikmetrarna efter varandra, räcker de hundra tusen ljusår, tvärs över den hisnande vintergatan!
Jag sa att jag behövde skala om och skala om och skala om för att förstå hur stor vintergatan är. Tre ordentliga omskalningar blir det hur jag än gör. Och det är väl inte så konstigt då egentligen att det blir jordklotet i kubikmetrar. En omskalning gör jorden till en badboll, och det är tre dimensioner. Tio upphöjt till sju, och detta i sin tur upphöjt till tre, blir $10^{21}$.
Men ändå tycker jag det är slående. Och vi kan gå längre, för en kubikmeter är ju inte det minsta vi kan tänka oss. Om vi delar upp jorden i kubikdecimetrar, räcker de 100 gånger så långt (areaskala!). Då är vi långt bortanför Andromedagalaxen. Och med kubikcentimetrar kommer vi en miljard ljusår bort.
Slutligen kan vi föreställa oss att jordklotet skulle vara ett nystan gjort av en enda lång fiberoptisk kabel. Inte bara den tunna tråden alltså, utan hela knippet med hölje och allt, grov nog att hantera men ändå så att den skulle gå genom en av rutorna på mitt rutade papper. Vi tänker oss att det gick en signal från den ena änden av kabeln då jorden bildades, för över 4 miljarder år sedan. Signalen har susat fram i ofattbara ljusets hastighet, 30 gånger sidan på tusenmilakuben varje sekund, alltmedan kontinenter har flyttat sig, vågor slipat ner berg, och fiskarnas ättlingar fått ben och vandrat upp på land. Och den är fortfarande inte framme vid andra änden av kabeln!
Monday, February 11, 2019
Wednesday, February 6, 2019
Card play with a unicellular brain
In an earlier blog post I described the simplest shedding game I can think of. I called it Shed and it's perhaps more of a mathematical simplification than a real card game. But go ahead and try it with 6 or 7 cards, you might have a brain meltdown when you realize you're playing a game simpler than tic-tac-toe and you still can't get it right...
Quick recap of the rules: Two players take turns playing cards numbered 1,..., N to a pile. The card you play must be higher than all the cards currently in the pile. If you can't or don't want to play, you can (must) instead pick up the whole pile. The goal is to force the opponent to pick up all the cards. At the start of the game, the cards are in an open talon available to both players.
It's possible to play the game with a standard deck with different suits (see the blog post), but it seems challenging already with just one suit.
I first thought of this game around the same time that I wrote about simple models of whist, almost 20 years ago. Already at that time I had a short program in C that computed the outcome under optimal play for N up to about 13, and with a rewritten hack on my current laptop I now pushed that to N = 15.
Since each card can be in 4 different locations (in the hand of the player about to move, in the hand of the player who just moved, in the pile, or in the talon), there are (at most) $4^N$ positions to consider. The program creates a "table-base" by repeatedly looping through those $4^N$ positions, checking if it can determine their value under optimal play using values already stored. As soon as there is a full iteration without any more values being assigned, it concludes that the remaining positions are drawn and stops the search.
It can be done much more efficiently, but for N = 15 my program requires something like 30 iterations through most of the approximately 1,000,000,000 positions, and more on a minority of them.
Since at any time only a limited number of moves can be played before someone picks up the pile (in theory N but often much fewer), we can trade space for time by only storing the values of the $3^N$ quiet positions - those where the pile is empty. That's what my program does, but for N = 15 that's still more than 14 million values.
We can simplify even more by first considering only endgames - those positions where the talon is empty (and will therefore remain empty for the rest of the game). If we focus on quiet endgame positions we are down to $2^N$ - about 32,000 in the case of 15 cards.
I wanted to see if it's possible to play optimally using much less data by somehow assigning a point-value to each card and summing those values. For instance, the values of the 32 quiet endgame positions for N = 5 can be compactly described by giving the cards 1, 2, 3, 4, 5 the point-values $-2, -1, 0, 1, 2$ respectively. To figure out if we're winning or not, we sum the values of our own cards, subtract the total value of our opponent's cards, and finally add or subtract 1 point for having or not having the move respectively. If the final score is positive we have a winning strategy, and if it's negative, we're losing.
For N = 6 this sort of thing still works, but we have to tinker a bit to get the point-values right. One solution is to set them to $-7, -5, -2, 3, 6$, and $8$, with a correction of $\pm 4$ for having or not having the move. Technical remark to ensure reproducibility of my results: The position where the player to move has all cards on their hand is regarded as winning for that player (since they can play their cards from the bottom up and force the opponent to pick up). But that position is impossible to reach.
So how do we find a set of point-values that makes this work, and can we do it for more than 6 cards?
To cast this in the terminology of machine-learning, let's not call it the point-value of a card, but instead the weight. So we want to find weights $w_1, w_2, \dots, w_N$ for the cards, and also a suitable bias $b$. That's what we now call the correction term for having or not having the move.
Since I tend to want to see things from the perspective of the player who just moved, I insist that the bias $b$ should be added when we're not making the next move and subtracted when we are, which means $b$ will have to be negative if having the move is an advantage.
In order to find the weights for the case N = 6, I used (buckle up, here are some buzz words of AI) gradient descent and a sigmoid neuron. There's also something called Support Vector Machines, but let's go with the sigmoid neuron here. The picture (well, for N = 7) looks like this:
There are N input nodes (labeled 1,..., 7 in the picture) that we can call $c_1,\dots, c_N$ (c for card) and that take the value $+1$ if we (who made the last move) have that card, and $-1$ if the other player has it. Each input is multiplied with the corresponding weight $w_i$, and the bias $b$ is added to the sum. The result is then passed to the sigmoid neuron, which computes the function
\[\sigma(x) = \frac1{1+e^{-x}}.\] The sigmoid function outputs a number between 0 and 1 which is close to 0 if the input is clearly negative and close to 1 if the input is clearly positive, with a smooth transition for inputs around 0. In our case we can think of the output as a tentative answer to how likely it is that the player who just moved is winning. Ideally, the answer should be 1 if that player actually has a forced win, and 0 otherwise (there exist drawn positions, but they are quite rare, so let's for the moment go with just classifying positions into winning or not winning).
The output will never be exactly 0 or 1, but we'd like to correctly classify each card distribution in the sense of having an output larger than 1/2 if the player who just moved is winning, and smaller than 1/2 otherwise. Since $\sigma(0) = 1/2$, this puts the threshold at zero, so what we're asking for is that the point-score (which is sent as input to the sigmoid function) should be positive if we're winning and negative if we're losing.
We can start by setting all the weights and the bias to 0 (or whatever, really). Then we use the table-base of known values to train the system (using known values like this is what's called supervised learning). We feed the system an input vector (a card distribution) and calculate the output. Then we make small adjustments to $w_1,\dots, w_N$ and $b$ that all push the output a little bit in the direction of making it more correct.
Here it actually makes sense to adjust the bias $b$ by
\[ b := b + 0.1 \cdot \left(\text{correct value} - \text{actual output}\right).\] This pushes $b$ in the right direction (increasing it if we want a larger output, decreasing it if we want smaller), but looks a bit too simple. But in case you like partial derivatives of error functions and stuff, there's actually a theoretical justification for updating $b$ in this particular way, and a key word is cross entropy error. The learning rate $0.1$ is admittedly a bit arbitrary though.
The weights $w_i$ are updated in the same way, except that the sign of the correction is determined by $c_i$ (the input vector, which is $\pm 1$ depending on who has the i:th card):
\[ w_i := w_i + 0.1 \cdot c_i \cdot \left(\text{correct value} - \text{actual output}\right).\] In the case N = 6 this works like a charm, and already after one epoch of training (going through each of the 64 input vectors once and then making the total adjustments), the system finds the weights $-1.3, -0.9, -0.3, 0.5, 1.1, 1.5$ and the bias $-0.7$, which correctly classify each of the 64 card distributions. These decimal numbers give the integer point-values that I suggested earlier if we multiply everything by 5 and round away from 0.
But when we try the same thing for N = 7, we run into a bit of trouble. Our single neuron brilliantly classifies 124 out of the 128 quiet endgame positions, but has difficulty with the remaining four. With a bit of improvised notation again, those four positions are
\[\left(\text{7-6-5-4-3-2-1}^+ \,\, \square \,\, - \right) \, \mapsto \, 0,\] \[\left(\text{5}^+ \,\, \square \,\, \text{7-6-4-3-2-1}\right) \, \mapsto 0,\] \[\left(\text{7-5-3-2}^+ \,\, \square \,\, \text{6-4-1}\right) \, \mapsto 1, \text{ and}\] \[\left(\text{6-5-4-1}^+ \,\, \square \,\, \text{7-3-2}\right) \, \mapsto 1.\] Here the hands are written with a little square table between them as in bridge diagrams. The whole position (within parentheses) is then mapped by the ideal value to 1 (left player winning) or 0 (right player winning). The plus-signs indicate that a certain player just added cards to their hand (and the other is about to play).
We can now see that it's impossible for a single neuron to correctly classify these positions: If we sum the weights (and biases) involved for the first two positions, we get $2w_5 + 2b$ (the card 5 is on "our" hand in both of them, and all other cards have changed places). If we do the same thing for the last two, we get the same result: $2w_5 + 2b$. Demanding a negative point-score for the first two positions and at the same time a positive point-score for the last two is clearly asking too much: The sum of two negative numbers can't be the same as the sum of two positive ones.
It's true that the first of the four difficult positions is the final position of the game, where the left player has already lost, and that if we remove that position from the training data, the system will learn to correctly classify the remaining 127.
We could argue more generally that the problem is that we ask the system to compare hands with different number of cards in them. It makes sense therefore to try with one set of weights for each pattern, so that for instance we would have a specific set of weights for the pattern 4-3 obtained by training only on positions where the player who just picked up has 4 cards and the other player has 3. This works fine for quiet endgames of 7 cards, and it would be a feasible design also for much larger $N$, since even if we count the patterns of quiet but non-endgame positions, there would only be around $N^2/2$ of them.
But this approach too has its limitations, even for quiet endgames. For 10 cards there are drawn endgames, but even if those positions are removed from the training data, it turns out to be impossible for a single neuron to classify the remaining ones.
For the pattern 6-4 there are 210 positions of which 4 are drawn. When trained on the remaining 206 positions, the system keeps misclassifying two of them. It turns out that the values assigned to the remaining positions show clearly where the problem lies. If after 100 training epochs we list those positions that are assigned a value more than 0.3 away from the correct value, we get the following:
\[\left(\text{10-9-6-5-2-1}^+ \,\, \square \,\, \text{8-7-4-3} \right) \, \mapsto \, 0 \quad \neq 0.49,\] \[\left(\text{10-8-6-5-3-1}^+ \,\, \square \,\, \text{9-7-4-2}\right) \, \mapsto 0 \quad \neq 0.38,\] \[\left(\text{9-7-6-5-4-2}^+ \,\, \square \,\, \text{10-8-3-1}\right) \, \mapsto 0 \quad \neq 0.46,\] \[\left(\text{8-7-6-5-4-3}^+ \,\, \square \,\, \text{10-9-2-1}\right) \, \mapsto 0 \quad \neq 0.36,\] \[\left(\text{10-7-6-5-3-2}^+ \,\, \square \,\, \text{9-8-4-1}\right) \, \mapsto 1 \quad \neq 0.42,\] \[\left(\text{9-8-6-5-4-1}^+ \,\, \square \,\, \text{10-7-3-2}\right) \, \mapsto 1 \quad \neq 0.42.\] The numbers to the right are the outputs of the system. Here we can see that the average point-score of the first four (losing) positions will always be the same as that of the last two (winning) ones, namely $w_6+w_5 + b$. And the average of a list of positive numbers can't equal the average of a list of negative ones.
If we keep training, the system will provide values converging correctly to 0 or 1 for all other positions, while adapting the best it can to the impossible constraints given by the 6 positions listed by giving each of them a value (in the limit) of 1/3.
This isn't really a failure of the approach. We can't possibly expect a small set of linear functions to flawlessly tell winning positions from losing, any more than we expect point-values for chess pieces to solve the riddle of chess, or high-card points to provide a perfect bidding system in bridge.
It's interesting that the single-neuron system discovers these examples, and it actually gets weirder. Once we have set up the "architecture" of the system, we can train it also on non-endgame positions. We can choose to let an input value of $c_i = 0$ represent that the i:th card is still in the talon, and again train the system on a fixed pattern $(m, n)$, meaning that the player who just picked up has $m$ cards and the other player has $n$ cards.
When we get to N = 6 and the pattern 2-0, we find something strange: In a few rounds the system learns to correctly classify all 15 positions, but it insists that $w_2$ must be smaller than $w_1$! And indeed the only time (for this particular pattern) it matters whether you have the 1 or the 2 is if the other card is the 5, and in that case it's better to have the 1 than to have the 2!
For N = 7 and the pattern 2-2, again the single neuron fails to correctly classify all positions, and among the difficult positions we find \[\left(\text{7-2}^+ \,\, \square \,\, \text{6-3} \right) \, \mapsto \, 1.\] Here the left player is winning, but if we replace the 3 on the right hand by the 1, left is losing, and if moreover we replace the 2 on the left hand with the 3 that we removed from the right hand, left is still losing: \[\left(\text{7-3}^+ \,\, \square \,\, \text{6-1} \right) \, \mapsto \, 0.\]
Alright, that's all for now!
Quick recap of the rules: Two players take turns playing cards numbered 1,..., N to a pile. The card you play must be higher than all the cards currently in the pile. If you can't or don't want to play, you can (must) instead pick up the whole pile. The goal is to force the opponent to pick up all the cards. At the start of the game, the cards are in an open talon available to both players.
It's possible to play the game with a standard deck with different suits (see the blog post), but it seems challenging already with just one suit.
I first thought of this game around the same time that I wrote about simple models of whist, almost 20 years ago. Already at that time I had a short program in C that computed the outcome under optimal play for N up to about 13, and with a rewritten hack on my current laptop I now pushed that to N = 15.
Since each card can be in 4 different locations (in the hand of the player about to move, in the hand of the player who just moved, in the pile, or in the talon), there are (at most) $4^N$ positions to consider. The program creates a "table-base" by repeatedly looping through those $4^N$ positions, checking if it can determine their value under optimal play using values already stored. As soon as there is a full iteration without any more values being assigned, it concludes that the remaining positions are drawn and stops the search.
It can be done much more efficiently, but for N = 15 my program requires something like 30 iterations through most of the approximately 1,000,000,000 positions, and more on a minority of them.
Since at any time only a limited number of moves can be played before someone picks up the pile (in theory N but often much fewer), we can trade space for time by only storing the values of the $3^N$ quiet positions - those where the pile is empty. That's what my program does, but for N = 15 that's still more than 14 million values.
We can simplify even more by first considering only endgames - those positions where the talon is empty (and will therefore remain empty for the rest of the game). If we focus on quiet endgame positions we are down to $2^N$ - about 32,000 in the case of 15 cards.
I wanted to see if it's possible to play optimally using much less data by somehow assigning a point-value to each card and summing those values. For instance, the values of the 32 quiet endgame positions for N = 5 can be compactly described by giving the cards 1, 2, 3, 4, 5 the point-values $-2, -1, 0, 1, 2$ respectively. To figure out if we're winning or not, we sum the values of our own cards, subtract the total value of our opponent's cards, and finally add or subtract 1 point for having or not having the move respectively. If the final score is positive we have a winning strategy, and if it's negative, we're losing.
For N = 6 this sort of thing still works, but we have to tinker a bit to get the point-values right. One solution is to set them to $-7, -5, -2, 3, 6$, and $8$, with a correction of $\pm 4$ for having or not having the move. Technical remark to ensure reproducibility of my results: The position where the player to move has all cards on their hand is regarded as winning for that player (since they can play their cards from the bottom up and force the opponent to pick up). But that position is impossible to reach.
So how do we find a set of point-values that makes this work, and can we do it for more than 6 cards?
To cast this in the terminology of machine-learning, let's not call it the point-value of a card, but instead the weight. So we want to find weights $w_1, w_2, \dots, w_N$ for the cards, and also a suitable bias $b$. That's what we now call the correction term for having or not having the move.
Since I tend to want to see things from the perspective of the player who just moved, I insist that the bias $b$ should be added when we're not making the next move and subtracted when we are, which means $b$ will have to be negative if having the move is an advantage.
In order to find the weights for the case N = 6, I used (buckle up, here are some buzz words of AI) gradient descent and a sigmoid neuron. There's also something called Support Vector Machines, but let's go with the sigmoid neuron here. The picture (well, for N = 7) looks like this:
There are N input nodes (labeled 1,..., 7 in the picture) that we can call $c_1,\dots, c_N$ (c for card) and that take the value $+1$ if we (who made the last move) have that card, and $-1$ if the other player has it. Each input is multiplied with the corresponding weight $w_i$, and the bias $b$ is added to the sum. The result is then passed to the sigmoid neuron, which computes the function
\[\sigma(x) = \frac1{1+e^{-x}}.\] The sigmoid function outputs a number between 0 and 1 which is close to 0 if the input is clearly negative and close to 1 if the input is clearly positive, with a smooth transition for inputs around 0. In our case we can think of the output as a tentative answer to how likely it is that the player who just moved is winning. Ideally, the answer should be 1 if that player actually has a forced win, and 0 otherwise (there exist drawn positions, but they are quite rare, so let's for the moment go with just classifying positions into winning or not winning).
The output will never be exactly 0 or 1, but we'd like to correctly classify each card distribution in the sense of having an output larger than 1/2 if the player who just moved is winning, and smaller than 1/2 otherwise. Since $\sigma(0) = 1/2$, this puts the threshold at zero, so what we're asking for is that the point-score (which is sent as input to the sigmoid function) should be positive if we're winning and negative if we're losing.
We can start by setting all the weights and the bias to 0 (or whatever, really). Then we use the table-base of known values to train the system (using known values like this is what's called supervised learning). We feed the system an input vector (a card distribution) and calculate the output. Then we make small adjustments to $w_1,\dots, w_N$ and $b$ that all push the output a little bit in the direction of making it more correct.
Here it actually makes sense to adjust the bias $b$ by
\[ b := b + 0.1 \cdot \left(\text{correct value} - \text{actual output}\right).\] This pushes $b$ in the right direction (increasing it if we want a larger output, decreasing it if we want smaller), but looks a bit too simple. But in case you like partial derivatives of error functions and stuff, there's actually a theoretical justification for updating $b$ in this particular way, and a key word is cross entropy error. The learning rate $0.1$ is admittedly a bit arbitrary though.
The weights $w_i$ are updated in the same way, except that the sign of the correction is determined by $c_i$ (the input vector, which is $\pm 1$ depending on who has the i:th card):
\[ w_i := w_i + 0.1 \cdot c_i \cdot \left(\text{correct value} - \text{actual output}\right).\] In the case N = 6 this works like a charm, and already after one epoch of training (going through each of the 64 input vectors once and then making the total adjustments), the system finds the weights $-1.3, -0.9, -0.3, 0.5, 1.1, 1.5$ and the bias $-0.7$, which correctly classify each of the 64 card distributions. These decimal numbers give the integer point-values that I suggested earlier if we multiply everything by 5 and round away from 0.
But when we try the same thing for N = 7, we run into a bit of trouble. Our single neuron brilliantly classifies 124 out of the 128 quiet endgame positions, but has difficulty with the remaining four. With a bit of improvised notation again, those four positions are
\[\left(\text{7-6-5-4-3-2-1}^+ \,\, \square \,\, - \right) \, \mapsto \, 0,\] \[\left(\text{5}^+ \,\, \square \,\, \text{7-6-4-3-2-1}\right) \, \mapsto 0,\] \[\left(\text{7-5-3-2}^+ \,\, \square \,\, \text{6-4-1}\right) \, \mapsto 1, \text{ and}\] \[\left(\text{6-5-4-1}^+ \,\, \square \,\, \text{7-3-2}\right) \, \mapsto 1.\] Here the hands are written with a little square table between them as in bridge diagrams. The whole position (within parentheses) is then mapped by the ideal value to 1 (left player winning) or 0 (right player winning). The plus-signs indicate that a certain player just added cards to their hand (and the other is about to play).
We can now see that it's impossible for a single neuron to correctly classify these positions: If we sum the weights (and biases) involved for the first two positions, we get $2w_5 + 2b$ (the card 5 is on "our" hand in both of them, and all other cards have changed places). If we do the same thing for the last two, we get the same result: $2w_5 + 2b$. Demanding a negative point-score for the first two positions and at the same time a positive point-score for the last two is clearly asking too much: The sum of two negative numbers can't be the same as the sum of two positive ones.
It's true that the first of the four difficult positions is the final position of the game, where the left player has already lost, and that if we remove that position from the training data, the system will learn to correctly classify the remaining 127.
We could argue more generally that the problem is that we ask the system to compare hands with different number of cards in them. It makes sense therefore to try with one set of weights for each pattern, so that for instance we would have a specific set of weights for the pattern 4-3 obtained by training only on positions where the player who just picked up has 4 cards and the other player has 3. This works fine for quiet endgames of 7 cards, and it would be a feasible design also for much larger $N$, since even if we count the patterns of quiet but non-endgame positions, there would only be around $N^2/2$ of them.
But this approach too has its limitations, even for quiet endgames. For 10 cards there are drawn endgames, but even if those positions are removed from the training data, it turns out to be impossible for a single neuron to classify the remaining ones.
For the pattern 6-4 there are 210 positions of which 4 are drawn. When trained on the remaining 206 positions, the system keeps misclassifying two of them. It turns out that the values assigned to the remaining positions show clearly where the problem lies. If after 100 training epochs we list those positions that are assigned a value more than 0.3 away from the correct value, we get the following:
\[\left(\text{10-9-6-5-2-1}^+ \,\, \square \,\, \text{8-7-4-3} \right) \, \mapsto \, 0 \quad \neq 0.49,\] \[\left(\text{10-8-6-5-3-1}^+ \,\, \square \,\, \text{9-7-4-2}\right) \, \mapsto 0 \quad \neq 0.38,\] \[\left(\text{9-7-6-5-4-2}^+ \,\, \square \,\, \text{10-8-3-1}\right) \, \mapsto 0 \quad \neq 0.46,\] \[\left(\text{8-7-6-5-4-3}^+ \,\, \square \,\, \text{10-9-2-1}\right) \, \mapsto 0 \quad \neq 0.36,\] \[\left(\text{10-7-6-5-3-2}^+ \,\, \square \,\, \text{9-8-4-1}\right) \, \mapsto 1 \quad \neq 0.42,\] \[\left(\text{9-8-6-5-4-1}^+ \,\, \square \,\, \text{10-7-3-2}\right) \, \mapsto 1 \quad \neq 0.42.\] The numbers to the right are the outputs of the system. Here we can see that the average point-score of the first four (losing) positions will always be the same as that of the last two (winning) ones, namely $w_6+w_5 + b$. And the average of a list of positive numbers can't equal the average of a list of negative ones.
If we keep training, the system will provide values converging correctly to 0 or 1 for all other positions, while adapting the best it can to the impossible constraints given by the 6 positions listed by giving each of them a value (in the limit) of 1/3.
This isn't really a failure of the approach. We can't possibly expect a small set of linear functions to flawlessly tell winning positions from losing, any more than we expect point-values for chess pieces to solve the riddle of chess, or high-card points to provide a perfect bidding system in bridge.
It's interesting that the single-neuron system discovers these examples, and it actually gets weirder. Once we have set up the "architecture" of the system, we can train it also on non-endgame positions. We can choose to let an input value of $c_i = 0$ represent that the i:th card is still in the talon, and again train the system on a fixed pattern $(m, n)$, meaning that the player who just picked up has $m$ cards and the other player has $n$ cards.
When we get to N = 6 and the pattern 2-0, we find something strange: In a few rounds the system learns to correctly classify all 15 positions, but it insists that $w_2$ must be smaller than $w_1$! And indeed the only time (for this particular pattern) it matters whether you have the 1 or the 2 is if the other card is the 5, and in that case it's better to have the 1 than to have the 2!
For N = 7 and the pattern 2-2, again the single neuron fails to correctly classify all positions, and among the difficult positions we find \[\left(\text{7-2}^+ \,\, \square \,\, \text{6-3} \right) \, \mapsto \, 1.\] Here the left player is winning, but if we replace the 3 on the right hand by the 1, left is losing, and if moreover we replace the 2 on the left hand with the 3 that we removed from the right hand, left is still losing: \[\left(\text{7-3}^+ \,\, \square \,\, \text{6-1} \right) \, \mapsto \, 0.\]
Alright, that's all for now!
Subscribe to:
Posts (Atom)