Simulations were performed to establish the relation between the rating difference (R) between backgammon players of various strengths, and the error rates for the two players. This will allow a match analysis to translate the error rate of each player into an estimation of the absolute ratings of the players, once the rating of "perfect play" has been defined. Of course, this estimate will be only as good as the quality of the analysis and will become invalid for the estimation of the rating of players at a comparable level as at which the analysis was performed. However, the measured relation does not depend on the analysis level and remains valid for very high levels of analysis such as rollouts.
Two sets of simulations were performed. In the first the relation between the difference in normalized error rate per unforced decision (DEPD) and rating difference R was measured. The main result of the first simulation set is an almost linear relation between DEPD and R, which depends somewhat on the match length. As a by-product of this investigation the relative strengths of various playing levels of GNUBG were determined precisely by rating.
The second set of simulations attempts to improve on the linear model of the first set, by separating out the effects of cube errors and chequer play errors. A bilinear fit was made of R to both the normalized chequer play error per unforced move (EPM) and the cube error per unforced cube decision (EPC). The resulting bilinear fit is shown to improve the prediction of the rating difference, especially for matches with very poor cube handling compared to chequer play.
The scripts used for the simulations and a readme file explaining their use are available here.
The final result is the following formula for the rating difference:
R = a2(N)*EPM+b(N)*EPC,
where
a2(N) = 8798 + 25526/N,
and
b(N) = 863 - 519/N.
In order to estimate the rating difference between playing levels we played a series of 1000 matches of a given match length between various levels and estimated the rating difference by analyzing the results. From the analysis we obtain the normalized error rate per unforced decision and the "luck adjusted result" (P), which is an estimate of the match winning probability using variance reduction. These numbers were averaged over the 1000 matches and the variance was also estimated. The rating difference R and a 90% confidence interval was then estimated from the 90% confidence interval of P using the FIBS rating formula, P = 1/(1 + 10^(R*sqrt(N)/2000)), with N the match length.
Simulations were performed for match lengths of 1, 3, 5, 7, 11, and 23. For practical reasons it was necessary to perform the error and luck analysis at 0-ply (expert level of GNUBG) and consider only levels at the expert level (0-ply) and below (0-ply with noise addition). Equal noise was generated for chequer play and for cube decisions. The following levels (0-ply with given noise) were paired:
player 1 | player 2
|
0 | 0.005
|
0 | 0.015
|
0 | 0.02
|
0 | 0.04
|
0 | 0.06
|
0.04 | 0.06
|
Level pairings
|
The assumption that the rating difference between players depends only on DEPD and not on the absolute level of play was verified implicitly with this choice as the last pairing has both players with noise addition. If the absolute EPD mattered the last pairing would give results inconsistent with the first five, which is not the case except for 1pt matches where as described in the results section.
With these settings, the simulation took about 100 hours on a 950Mhz Pentium III. Simulations were performed on Windows XP, using GAWK and sh scripts within the GNU ULTILS in combination with the no-GUI version of GNUBG (build 03 08 07). They should work equally well on LINUX.
Three AWK scripts are used to create GNUBG scripts to play a given number of matches, analyze them, and to collect statistics. Separate scripts were used in order to modularize the tasks so the simulation can be interrupted and continued without loss of data. The resulting match statistics were then processed trough a fourth AWK script which computes the average rating R, and the average DEPD over the set, as well as the 90% confidence intervals as estimated for a variance estimation. The levels of the players and the analysis level is set in the .gnubgautorc file, outside the scripts. Further analysis was then performed in MATLAB.
The results of the simulations are given in the figure below for all match lengths. R versus DEPD for various match lengths
More detailed results are given in the figures below for each specific match length. The measured data points are indicated by the circles. The red line is a piecewise linear interpolation through the data points. The black lines indicate the 90% confidence interval. The green line is a least square fit through the data points (minimizing the sum of squares of absolute rating differences) 3pt match 5pt match 7pt match 11pt match 23pt match 1pt match
For all match length except 1, the linear fit seems quite good and we can approximate the rating difference by R= a(N)*DEPD, where N is the match length. The function N*a(N) is plotted in the figure below, along with a linear least square fit. Coefficient N*a(N) versus match length
The linear fit of N*a(N) gives the following rating
formula:
R = (11971 + 23681/N) * DEPD.
In the figure
below the formula is plotted along with the actual data points;
Rating versus DEPD approximated by R = (11971 + 23681/N) * DEPD. The dotted lines are the prediction, the solid lines are piecewise linear segments through the data.
The bump in the curve for the 1pt match around DEPD=0.015 is caused by the data point from the 0.04 - 0.06 (intermediate - beginner). Apparently the approximation that the rating difference does not depend on the absolute values of the error rates is less accurate for 1pt matches.
Note that, for the purpose of measuring DEPD between simulated players, it is a reasonable approximation to analyze a match at expert level if both players are playing at expert level or worse. This is because we are only interested in the difference between the error rates. When a match is re-analyzed at World Class level (2-ply) the individual error rates increase, but the difference remains approximately constant. To verify this explicitly, we ran 2 sets of 40 matches between 0-ply expert level and 0-ply advanced level (.015 noise) and analyzed them both at 0-ply and at 2-ply. The results are given in the table below.
0-ply | 2-ply
|
0.0075
| |
0.0068
| |
0.0080
| |
0.0072
|
We see that the error in the 0-ply estimation appears to be in the order of 5%. As an additional verification, a set of 100 3 pt matches was played between the World Class level (2-ply) and the expert level which was analyzed at 2-ply. It was found that the resulting DEPD and rating difference R was well predicted by the model. The data point is indicated by the 'X' on the figure for the 3pt match above and fits well on the curve.
The data is well approximated by a linear relation between rating difference and error per decision difference of the form R=a(N)*DEPD where N is the match length. An excellent approximation for a(N) is a(N) = 23681/N + 11971. The approximation is worst for N=1. The assumption that R does not depend on the absolute EPD but only on the difference was validated, except for N=1. The assumption thatDEPD for matches between players at or below the GNUBG expert level can we approximated by a 0-ply analysis was verified by re-analyzing some matches at 2-ply with small effect on the DEPD. An explicit simulation of a series of 2-ply versus 0-ply matches resulted in an estimated R which fitted the model very well, supporting the assumption that the linear relation between R and DEPD is valid for any level of analysis.
In the above we have assumed that the rating depends only onDEPD = (totMoveErrors + totCubeErrors)/totalDecisions. In reality the rating difference will depend on some combination of chequerplay skill and cube skill, at which the above formula makes an educated guess. A more powerful predictor taking into account the separate effect of cube errors and chequerplay errors can be obtained by simulating matches with independently varied cube and chequerplay noise and fitting the rating by a bilinear form R= a2(N)*EPM +b(N)*EPC, with EPC = moverErr/unforcedMoves and EPC=cubeErr/unforcedCubeDecisions.
For match lengths of 3, 5, 7, 11, and 23, 500 matches between GNUBG 0-ply and GNUBG 0-ply with 16 different noise settings were played. The chequer play noise settings were 0, 0.02, 0.04, 0.06 and the cube play noise settings were 0.05, 0.1, 0.4, 0.8, 16 combinations in total. For each set of 500 matches the EPM and EPC were averaged. The rating difference was estimated as in Part I by using the luck adjusted result.
For each match length the 16 data points were fitted to R = a2(N)*EPM +b(N)*EPC with a least square method, minimizing the squares of the rating differences. For N=1 there are no cube decisions and a2(1) = a(1) = 37916.
In the figures, below, we plot R against DEPD. The green "O" indicates the measure rating differences, the green dots the 90% confidence interval of the data, the red "X" indicates the linear fit using the formula from Part I, and the blue "X" indicates the prediction from the bilinear fit with the coefficients a2 and b as indicated. Measurements and fit for match length 3. Measurements and fit for match length 5. Measurements and fit for match length 7. Measurements and fit for match length 11. Measurements and fit for match length 23.
The coefficients a2(N) and b(N) as a function of match length are well approximated by a linear function of 1/N. We show these coefficients and a linear fit in the two figures below. Fitting Na2(N) with a linear function. Fitting Nb(N) with a linear function.
The bilinear fit improves the linear fit especially for
large cube errors. Usually the linear estimate is too high, reflecting too much
weight given to the cube errors, for which the bilinear formula corrects. The
data is well approximated by
R = a2(N)*EPM+b(N)*EPC,
where
a2(N) = 8798 + 25526/N,
and
b(N) = 863 - 519/N.
Player1 | Player2 | Match length | R1-R2 | DEPD (e2-e1) | EPM | EPC
|
expInt | intExp | 5pt | 472 [451 493] | - | -0.0344 | -0.01
|
expBeg | advExp | 5pt | 115 [103 127] | - | -0.00865 | +0.0176
|
expExp | expBeg | 5pt | 31 [21 41] | - | 0 | -0.0185
|
expExp | expBeg | 11pt | 29 [23 36] | - | 0 | -0.0165
|
expExp | expBeg2 (1.0 cubenoise) | 5pt | 191 [176 206] | - | 0 | -0.192
|
wclassWclass | wclassExp | 5pt | 6 [-15 27] | - | - | -
|
1-ply | 0-ply | 5pt | 13 [-4 30] | - | - | -
|
Some misc. measurements. expBeg for example means chequer play at level expert, cubeplay at level beginner. Rating interval is the 90% confidence interval.