Author: Bas Hamstra
Date: 03:57:48 07/13/03
Go up one level in this thread
Coicidentally I am working a bit on LastOne too right now. I also have a Athlon XP (like you I believe). So BSF isn't too fast. The question is: can you outperform it? At first it seems so, in a zillion times loop the following code appeared to be nearly a factor 3 faster than the BSF routine below: const char Tabel[64] = { 0, 0, 0,15, 0, 1,28, 0,16, 0, 0, 0, 2,21,29, 0, 0, 0,19,17,10, 0,12, 0, 0, 3, 0, 6, 0,22,30, 0, 14, 0,27, 0, 0, 0,20, 0,18, 9,11, 0, 5, 0, 0,13, 26, 0, 0, 8, 0, 4, 0,25, 0, 7,24, 0,23, 0,31, 0 }; __forceinline int LastOne2(unsigned __int64 Bits) { unsigned int *p = (unsigned int*) &Bits; unsigned int Short; if(p[0]) { Short = p[0]; Short = Short ^ (Short-1); Short *= 7*255*255*25; Short >>= 26; return Tabel[Short&63]; } Short = p[1]; Short = Short ^ (Short-1); Short *= 7*255*255*255; Short >>= 26; return Tabel[Short&63]+32; } VC produces very efficient asm code for it, as far as I can judge. However how about the overall performance in the program? It seems to be noticably slower than plain BSF, like below: __forceinline int LastOne(BB M) { __asm { mov EAX, dword ptr [M] XOR EDX, EDX cmp EAX, 0 jnz L2 mov EAX, dword ptr [M+4] add EDX, 32 L2: bsf EAX, EAX ADD EAX, EDX } } Both routines have an extra branch to test the first 32 bits, so that is not the problem!? Regards, Bas. On July 13, 2003 at 06:03:00, Gerd Isenberg wrote: >Hi all, > >Matt Taylor's invention of the super magic de Bruijn Constant 0x78291ACF is of >course usefull for yet another, but fast and portable _firstOne_ Bitscan routine >with only one 32bit multiplication! > > >int firstOneKey(BitBoard bb) >{ > BitBoard lsb = bb ^ (bb - 1); > unsigned int foldedLSB = ((int) lsb) ^ ((int)(lsb>>32)); > return (foldedLSB * 0x78291ACF) >> (32-6); // range is 0..63 > // to get the bit index, a table lookup is required >} > > >Because i'am still looking for a fast key-fuction for two single populated move >bitboards, i tried the obvious approach to use Matt's routine twice. It works >like firstOneKey(from)*64 + firtsOneKey(to) and therefore maps to a 64*64-range. > >int getMoveKey(BitBoard frbit, BitBoard tobit) >{ > frbit -= 1; tobit -= 1; > unsigned int foldedfr = ((int) frbit) ^ ((int)(frbit>>32)); > unsigned int foldedto = ((int) tobit) ^ ((int)(tobit>>32)); > return ((foldedfr*0x78291ACF) ^ ((foldedto*0x78291ACF) >> 6)) >> 20; >} > >x86 assembler output of this routine: > >00401435 83 C6 FF add esi,0FFFFFFFFh >00401438 83 D1 FF adc ecx,0FFFFFFFFh >0040143B 83 C2 FF add edx,0FFFFFFFFh >0040143E 83 D0 FF adc eax,0FFFFFFFFh >00401441 33 CE xor ecx,esi >00401443 33 C2 xor eax,edx >00401445 69 C9 CF 1A 29 78 imul ecx,ecx,78291ACFh >0040144B 69 C0 CF 1A 29 78 imul eax,eax,78291ACFh >00401451 C1 E8 06 shr eax,6 >00401454 33 C1 xor eax,ecx >00401456 C1 E8 14 shr eax,14h > >only 36 Bytes - but two 32-bit multiplications. > >Two questions: >Any ideas to it better, eg. only by one multiplication? >Why are the unsigned multiplications translated to "imul" by the compiler? > >Thanks in advance, >Gerd
This page took 0.05 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.