Author: Gerd Isenberg
Date: 13:11:36 01/13/05
Go up one level in this thread
<snip> >Having said that it would be nice if I could show straightforward vectorization >of your code. Alas, things are not that simple (and I hope to get new insights >in this forum). Let’s start with a slight simplification (pre-compute the shift >factors and use a 32-bit bitboard): > >unsigned int bits32[64]; /* precomputed shifts */ > >int dotProduct32(unsigned int bb, unsigned char weight[]) >{ > int i; > unsigned int sum = 0; >#pragma vector aligned /* <- used assuming weight is 16-byte aligned */ I see MOVDQA versus MOVDQU. > for (i=0; i < 32; i++) { > if (bb & bits32[i]) sum += weight[i]; > } > return sum; >} > >This will vectorize using the Intel compiler (also note that your “hint” on >masking the reduction is not required): Hi Aart, wow - really more than i expected - branchless SSE-code! I guess your main focus with SSE/2/3 is more on float or double arithmetic rather than on integers, signed and unsigned chars, shorts, ints and even __int64 with rather unorthogonal instruction sets and a lot of special cases. One of these special cases applies here - using the psadbw instruction to add unsigned chars. As far i can see, you process four conditional adds per run instead of 16 possible using psadbw with second operand zero. The idea is not zero extending (assuming xmm1 is zero) the bytes to dwords, but making 8 unpacked copies of each byte masking with 0x8040201008040201. But of course this is very special for unsigned chars, while your output is more general for all integer types. Such nasty and error-prone tricks, assuming weight bytes are <= 63, and performing three vertical saturation (for safety) adds before one horizontal is still the domain of assembler programmers - guessing there is no pragma to tell the compiler a reduced value range ;-) > >[C:/temp] icl –Fa –Qunroll0 -nologo -QxP -c dot32.c >dot32.c >dot32.c(10) : (col. 2) remark: LOOP WAS VECTORIZED. > >In its “rerolled” form (for simplicity I used –Qunroll0), the generated code >looks like: > > <setup> >L: movdqa xmm4, XMMWORD PTR _bits32[0+eax*4] > pand xmm4, xmm0 > pcmpeqd xmm4, xmm1 > movd xmm3, DWORD PTR [eax+edx] > punpcklbw xmm3, xmm1 > punpcklwd xmm3, xmm1 I guess xmm1 is zero and you zero extend four bytes to four dwords. > add eax, 4 > cmp eax, 32 > pandn xmm4, xmm3 > paddd xmm2, xmm4 > jb L > <compute partial sums> > > >Your 64-bit version gives my vectorizer more headaches however. Let me ponder >about this some more to see what can be improved in the Intel compiler (looks >like I am back at my job rather than focusing on the chess engine :-). Sorry for that ;-) The assembly routine was btw. a kind of team work here. Tony Werten asked a initial question about a weighted popcount iirr, Bob Hyatt told something about Cray's vector instructions and gave me the flash for the idea with copy, pand, pcmp, pand, padd - the initial routine simply made eight copies of a bitboard using one punpcklqdq and three movdqa, masking with eight quad words from 0x8080..,...,0x0101... The byte array was therefore rotated by 90 degree. A few cycles less - so about 34 on my amd64 box. Anthony Cozzie had the idea using the unpack sequence. I still prefere the rotated version with precalculated const weight arrays - 64 for center plus controls near own king - 64 for controls near opposite king, indexed by the appropriate king squares (for one gamestate) added bytewise on the fly before performing the "and"-multiplication. Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.