Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: planning a SSE-optimized chess engine

Author: Gerd Isenberg

Date: 13:11:36 01/13/05

Go up one level in this thread


<snip>
>Having said that it would be nice if I could show straightforward vectorization
>of your code. Alas, things are not that simple (and I hope to get new insights
>in this forum). Let’s start with a slight simplification (pre-compute the shift
>factors and use a 32-bit bitboard):
>
>unsigned int bits32[64];  /* precomputed shifts */
>
>int dotProduct32(unsigned int bb, unsigned char weight[])
>{
> int i;
> unsigned int sum = 0;
>#pragma vector aligned   /* <- used assuming weight is 16-byte aligned */

I see MOVDQA versus MOVDQU.

> for (i=0; i < 32; i++) {
>    if (bb & bits32[i]) sum += weight[i];
> }
> return sum;
>}
>
>This will vectorize using the Intel compiler (also note that your “hint” on
>masking the reduction is not required):


Hi Aart,

wow - really more than i expected - branchless SSE-code!

I guess your main focus with SSE/2/3 is more on float or double arithmetic
rather than on integers, signed and unsigned chars, shorts, ints and even
__int64 with rather unorthogonal instruction sets and a lot of special cases.

One of these special cases applies here - using the psadbw instruction to add
unsigned chars. As far i can see, you process four conditional adds per run
instead of 16 possible using psadbw with second operand zero.
The idea is not zero extending (assuming xmm1 is zero) the bytes to dwords, but
making 8 unpacked copies of each byte masking with 0x8040201008040201. But of
course this is very special for unsigned chars, while your output is more
general for all integer types.

Such nasty and error-prone tricks, assuming weight bytes are <= 63, and
performing three vertical saturation (for safety) adds before one horizontal is
still the domain of assembler programmers - guessing there is no pragma to tell
the compiler a reduced value range ;-)

>
>[C:/temp] icl –Fa –Qunroll0 -nologo -QxP -c dot32.c
>dot32.c
>dot32.c(10) : (col. 2) remark: LOOP WAS VECTORIZED.
>
>In its “rerolled” form (for simplicity I used –Qunroll0), the generated code
>looks like:
>
>         <setup>
>L:      movdqa    xmm4, XMMWORD PTR _bits32[0+eax*4]
>        pand      xmm4, xmm0
>        pcmpeqd   xmm4, xmm1
>        movd      xmm3, DWORD PTR [eax+edx]
>        punpcklbw xmm3, xmm1
>        punpcklwd xmm3, xmm1

I guess xmm1 is zero and you zero extend four bytes to four dwords.

>        add       eax, 4
>        cmp       eax, 32
>        pandn     xmm4, xmm3
>        paddd     xmm2, xmm4
>        jb        L
>        <compute partial sums>
>
>
>Your 64-bit version gives my vectorizer more headaches however. Let me ponder
>about this some more to see what can be improved in the Intel compiler (looks
>like I am back at my job rather than focusing on the chess engine :-).

Sorry for that ;-)

The assembly routine was btw. a kind of team work here. Tony Werten asked a
initial question about a weighted popcount iirr, Bob Hyatt told something about
Cray's vector instructions and gave me the flash for the idea with copy, pand,
pcmp, pand, padd - the initial routine simply made eight copies of a bitboard
using one punpcklqdq and three movdqa, masking with eight quad words from
0x8080..,...,0x0101...
The byte array was therefore rotated by 90 degree. A few cycles less - so about
34 on my amd64 box. Anthony Cozzie had the idea using the unpack sequence.

I still prefere the rotated version with precalculated const weight arrays - 64
for center plus controls near own king - 64 for controls near opposite king,
indexed by the appropriate king squares (for one gamestate) added bytewise on
the fly before performing the "and"-multiplication.

Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.