Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: planning a SSE-optimized chess engine

Author: Gerd Isenberg

Date: 23:39:16 01/12/05

Go up one level in this thread


On January 12, 2005 at 23:31:34, Aart J.C. Bik wrote:

>Hi Folks,
>This is my first post to this forum. I work in the Intel Compiler Lab on
>automatic vectorization for multimedia extensions (see
>http://www.intelcompiler.com for some details), but in my free time I have
>started working on a chess engine that will be optimized for the
>Streaming-SIMD-extensions (SSE/SSE2/SSE3). Currently I am making my first steps
>on getting my own Universal Chess Interface implementation to work with my
>favorite chess environment: the Fritz interface. I noticed that some things
>seeem to work a little different in Fritz than in the formal description (is
>this a known deviation?), but I am pretty confident I will figure it out.
>Are there other ongoing research projects that try to exploit the
>Streaming-SIMD-extensions? If so, I would like to hear about this. Also,
>although initially I want to focus on the engine, eventually I may also want to
>implement my own book and access to Nalimov tablebases during the search. Stefan
>Meyer-Kahlen kindly pointed me to the Crafty download to figure out how the
>tablebases can be accessed. Will incorporating some Crafty source into my own
>engine have any impact on licensing?
>Nice "meeting you folks" and looking forward to some nice discussions!
>Aart Bik
>http://www.aartbik.com


Hi Aart,

very interesting. I do a let say private "research" using SIMD instructions in
my bitboard based chess program, in my current program MMX inline assembly. But
i am working on an x86-64 approach using SSE2 with own C++ wrapper classes - a
base for memory layout and two derivations, one with SSE2-intrinsics and one
SWAR derivation in plain C++. Both derivations have most operators overloaded.

Main targets of SIMD-Instructions are fill based attack generation using
parallel prefix Kogge-Stone, introduced here some time ago by Steffan Westcott
(you may use the CCC search engine for further information).

For instance i can write Kogge-Stone fills in C++ with template type <XMM> or
<GP>, controlling the register incarnations of local variables for each fill
direction. So one can mix different disjoint direction fills with different
register files.

template <class T> __forceinline
void leftAttacks(sTarget* pTarget, const sSource* pSource)
{
 T gl(pSource->rooks);
 T pl(pSource->occup);
 pl = (~pl).notH();
 gl |= pl & (gl>>1);
 pl &= pl>>1;
 gl |= pl & (gl>>2);
 pl &= pl>>2;
 gl |= pl & (gl>>4);
 (gl>>1).notH().store(&pTarget->le);
}

This is not a domain of SSE2, but SSE2 may be used here to gain some additional
throughtput in conjunction with independent 2*64-bit gp instruction chains.

Another interesting applications using SSE2 in chess programs are IMHO
dot-products of vectors of 32/64-bits with 32/64 bytes or short ints. That might
be usefull for evaluation purposes, eg. to do some weighted population count of
several attack bitboards. Something like this with inline asm:

int dotProduct(BitBoard bb, BYTE weightsOfMax0x3f[] /* XMM_ALIGN */)
{
 static const BitBoard XMM_ALIGN bits[4] =
{0x8040201008040201,0x8040201008040201, 0, 0};
 __asm
 {
  movq      xmm0, [bb]  ; 00000000000000008040201008040201
  punpcklbw xmm0, xmm0  ; 80804040202010100808040402020101
  lea       edx,  [bits]
  mov       eax,  [weightsOfMax0x3f]
  movdqa    xmm2, xmm0
  punpcklwd xmm0, xmm0  ; 08080808040404040202020201010101
  punpckhwd xmm2, xmm2  ; 80808080404040402020202010101010
  movdqa    xmm1, xmm0
  movdqa    xmm3, xmm2
  punpckldq xmm0, xmm0  ; 02020202020202020101010101010101
  punpckhdq xmm1, xmm1  ; 08080808080808080404040404040404
  punpckldq xmm2, xmm2  ; 20202020202020201010101010101010
  punpckhdq xmm3, xmm3  ; 80808080808080804040404040404040
  pand      xmm0, [edx] ; mask the bits
  pand      xmm1, [edx]
  pand      xmm2, [edx]
  pand      xmm3, [edx]
  pcmpeqb   xmm0, [edx]  ; extend bits to bytes
  pcmpeqb   xmm1, [edx]
  pcmpeqb   xmm2, [edx]
  pcmpeqb   xmm3, [edx]
  pand      xmm0, [eax+0*16] ; multiply by "and" with -1 or 0
  pand      xmm1, [eax+1*16]
  pand      xmm2, [eax+2*16]
  pand      xmm3, [eax+3*16]
  paddusb   xmm0, xmm1    ; add all bytes (with saturation)
  paddusb   xmm2, xmm3    ; saturation is used to avoid "accidental" overflows
  paddusb   xmm0, xmm2	  ; eg. 64+64+64+64
  psadbw    xmm0, [edx+16]; horizontal add 2 * 8 byte
  pextrw    edx,  xmm0, 4 ; extract both intermediate sums to gp
  pextrw    eax,  xmm0, 0
  add       eax,  edx     ; final add
 }
}

It would be interesting to see how intel compiler may generate such code.

Cheers,
Gerd



This page took 0 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.