Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Matt Taylor's magic de Bruijn Constant

Author: Bas Hamstra

Date: 03:57:48 07/13/03

Go up one level in this thread


Coicidentally I am working a bit on LastOne too right now. I also have a Athlon
XP (like you I believe). So BSF isn't too fast. The question is: can you
outperform it? At first it seems so, in a zillion times loop the following code
appeared to be nearly a factor 3 faster than the BSF routine below:

const char Tabel[64] =
		{	0, 0, 0,15, 0, 1,28, 0,16, 0, 0, 0, 2,21,29, 0,
			0, 0,19,17,10, 0,12, 0, 0, 3, 0, 6, 0,22,30, 0,
			14, 0,27, 0, 0, 0,20, 0,18, 9,11, 0, 5, 0, 0,13,
			26, 0, 0, 8, 0, 4, 0,25, 0, 7,24, 0,23, 0,31, 0
		};


__forceinline int LastOne2(unsigned __int64 Bits)

{	unsigned int *p = (unsigned int*) &Bits;
	unsigned int Short;

	if(p[0])
	{	Short = p[0];
		Short = Short ^ (Short-1);
		Short *= 7*255*255*25;
		Short >>= 26;
		return Tabel[Short&63];
	}
	Short = p[1];
	Short = Short ^ (Short-1);
	Short *= 7*255*255*255;
	Short >>= 26;
	return Tabel[Short&63]+32;
}

VC produces very efficient asm code for it, as far as I can judge. However how
about the overall performance in the program? It seems to be noticably slower
than plain BSF, like below:

__forceinline int LastOne(BB M)
{   __asm
    {       mov EAX, dword ptr [M]
            XOR EDX, EDX
            cmp EAX, 0
            jnz L2
            mov EAX, dword ptr [M+4]
            add EDX, 32
        L2: bsf EAX, EAX
            ADD EAX, EDX
    }
}

Both routines have an extra branch to test the first 32 bits, so that is not the
problem!?


Regards,
Bas.








On July 13, 2003 at 06:03:00, Gerd Isenberg wrote:

>Hi all,
>
>Matt Taylor's invention of the super magic de Bruijn Constant 0x78291ACF is of
>course usefull for yet another, but fast and portable _firstOne_ Bitscan routine
>with only one 32bit multiplication!
>
>
>int firstOneKey(BitBoard bb)
>{
>	BitBoard lsb = bb ^ (bb - 1);
>	unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32));
>	return (foldedLSB * 0x78291ACF) >> (32-6); // range is 0..63
>       // to get the bit index, a table lookup is required
>}
>
>
>Because i'am still looking for a fast key-fuction for two single populated move
>bitboards, i tried the obvious approach to use Matt's routine twice. It works
>like firstOneKey(from)*64 + firtsOneKey(to) and therefore maps to a 64*64-range.
>
>int getMoveKey(BitBoard frbit, BitBoard tobit)
>{
>	frbit -= 1; tobit -= 1;
>	unsigned int foldedfr = ((int) frbit)  ^ ((int)(frbit>>32));
>	unsigned int foldedto = ((int) tobit)  ^ ((int)(tobit>>32));
>	return ((foldedfr*0x78291ACF) ^ ((foldedto*0x78291ACF) >> 6)) >> 20;
>}
>
>x86 assembler output of this routine:
>
>00401435 83 C6 FF             add         esi,0FFFFFFFFh
>00401438 83 D1 FF             adc         ecx,0FFFFFFFFh
>0040143B 83 C2 FF             add         edx,0FFFFFFFFh
>0040143E 83 D0 FF             adc         eax,0FFFFFFFFh
>00401441 33 CE                xor         ecx,esi
>00401443 33 C2                xor         eax,edx
>00401445 69 C9 CF 1A 29 78    imul        ecx,ecx,78291ACFh
>0040144B 69 C0 CF 1A 29 78    imul        eax,eax,78291ACFh
>00401451 C1 E8 06             shr         eax,6
>00401454 33 C1                xor         eax,ecx
>00401456 C1 E8 14             shr         eax,14h
>
>only 36 Bytes - but two 32-bit multiplications.
>
>Two questions:
>Any ideas to it better, eg. only by one multiplication?
>Why are the unsigned multiplications translated to "imul" by the compiler?
>
>Thanks in advance,
>Gerd



This page took 0.05 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.