Author: Gerd Isenberg
Date: 23:39:16 01/12/05
Go up one level in this thread
On January 12, 2005 at 23:31:34, Aart J.C. Bik wrote: >Hi Folks, >This is my first post to this forum. I work in the Intel Compiler Lab on >automatic vectorization for multimedia extensions (see >http://www.intelcompiler.com for some details), but in my free time I have >started working on a chess engine that will be optimized for the >Streaming-SIMD-extensions (SSE/SSE2/SSE3). Currently I am making my first steps >on getting my own Universal Chess Interface implementation to work with my >favorite chess environment: the Fritz interface. I noticed that some things >seeem to work a little different in Fritz than in the formal description (is >this a known deviation?), but I am pretty confident I will figure it out. >Are there other ongoing research projects that try to exploit the >Streaming-SIMD-extensions? If so, I would like to hear about this. Also, >although initially I want to focus on the engine, eventually I may also want to >implement my own book and access to Nalimov tablebases during the search. Stefan >Meyer-Kahlen kindly pointed me to the Crafty download to figure out how the >tablebases can be accessed. Will incorporating some Crafty source into my own >engine have any impact on licensing? >Nice "meeting you folks" and looking forward to some nice discussions! >Aart Bik >http://www.aartbik.com Hi Aart, very interesting. I do a let say private "research" using SIMD instructions in my bitboard based chess program, in my current program MMX inline assembly. But i am working on an x86-64 approach using SSE2 with own C++ wrapper classes - a base for memory layout and two derivations, one with SSE2-intrinsics and one SWAR derivation in plain C++. Both derivations have most operators overloaded. Main targets of SIMD-Instructions are fill based attack generation using parallel prefix Kogge-Stone, introduced here some time ago by Steffan Westcott (you may use the CCC search engine for further information). For instance i can write Kogge-Stone fills in C++ with template type <XMM> or <GP>, controlling the register incarnations of local variables for each fill direction. So one can mix different disjoint direction fills with different register files. template <class T> __forceinline void leftAttacks(sTarget* pTarget, const sSource* pSource) { T gl(pSource->rooks); T pl(pSource->occup); pl = (~pl).notH(); gl |= pl & (gl>>1); pl &= pl>>1; gl |= pl & (gl>>2); pl &= pl>>2; gl |= pl & (gl>>4); (gl>>1).notH().store(&pTarget->le); } This is not a domain of SSE2, but SSE2 may be used here to gain some additional throughtput in conjunction with independent 2*64-bit gp instruction chains. Another interesting applications using SSE2 in chess programs are IMHO dot-products of vectors of 32/64-bits with 32/64 bytes or short ints. That might be usefull for evaluation purposes, eg. to do some weighted population count of several attack bitboards. Something like this with inline asm: int dotProduct(BitBoard bb, BYTE weightsOfMax0x3f[] /* XMM_ALIGN */) { static const BitBoard XMM_ALIGN bits[4] = {0x8040201008040201,0x8040201008040201, 0, 0}; __asm { movq xmm0, [bb] ; 00000000000000008040201008040201 punpcklbw xmm0, xmm0 ; 80804040202010100808040402020101 lea edx, [bits] mov eax, [weightsOfMax0x3f] movdqa xmm2, xmm0 punpcklwd xmm0, xmm0 ; 08080808040404040202020201010101 punpckhwd xmm2, xmm2 ; 80808080404040402020202010101010 movdqa xmm1, xmm0 movdqa xmm3, xmm2 punpckldq xmm0, xmm0 ; 02020202020202020101010101010101 punpckhdq xmm1, xmm1 ; 08080808080808080404040404040404 punpckldq xmm2, xmm2 ; 20202020202020201010101010101010 punpckhdq xmm3, xmm3 ; 80808080808080804040404040404040 pand xmm0, [edx] ; mask the bits pand xmm1, [edx] pand xmm2, [edx] pand xmm3, [edx] pcmpeqb xmm0, [edx] ; extend bits to bytes pcmpeqb xmm1, [edx] pcmpeqb xmm2, [edx] pcmpeqb xmm3, [edx] pand xmm0, [eax+0*16] ; multiply by "and" with -1 or 0 pand xmm1, [eax+1*16] pand xmm2, [eax+2*16] pand xmm3, [eax+3*16] paddusb xmm0, xmm1 ; add all bytes (with saturation) paddusb xmm2, xmm3 ; saturation is used to avoid "accidental" overflows paddusb xmm0, xmm2 ; eg. 64+64+64+64 psadbw xmm0, [edx+16]; horizontal add 2 * 8 byte pextrw edx, xmm0, 4 ; extract both intermediate sums to gp pextrw eax, xmm0, 0 add eax, edx ; final add } } It would be interesting to see how intel compiler may generate such code. Cheers, Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.