Author: Matt Taylor
Date: 14:05:08 01/22/03
Go up one level in this thread
On January 22, 2003 at 10:40:20, Joshua Haglund wrote: >On January 22, 2003 at 08:28:11, Matt Taylor wrote: > >>I'm exploring the last couple possibilities in those bitscan routines. What I >>have decided is that I will do bit = (bb & -bb) - 1 to get a mask that looks >>like this: >> >>bit index mask (byte) >>----------------------- >> 1 0 00000000 >> 2 1 00000001 >> 4 2 00000011 >> 8 3 00000111 >> 16 4 00001111 >> 32 5 00011111 >> 64 6 00111111 >>128 7 01111111 >> >>(Extrapolate for 32-bits; I'm lazy!) >> >>Now, this is convenient for me because x86 will record the last bit shifted out >>into the carry flag. Furthermore, I can accomplish the index = (index << 1) | >>((bit & mask) != 0) with a simple add with carry. That's pretty ideal for C -- 2 >>instructions for 2 lines of code. Also, in the process, it modifies mask in a >>potentially valuable way. >> >>All I need to make this a clever and slick way to eat bits. After the first >>iteration, the masks for the second half of the table should be identical to the >>masks in the first half. In other words, if the index >= 4, shift left 4, else >>no-op. (A cmovz will sort've do it.) >> >>Alternatively, I can use subtraction to mess with the numbers and use the carry. >> >>Any suggestion would be welcome at any level of thought; higher-level concepts >>may help me make a connection that I did not formerly see. Also, even more >>complex expressions can sometimes be reduced very nicely. >> >>-Matt > > >Sounds like a good idea!! > >What's this for again ? hehehehe ;) > >Good Day, > >Joshua >toneewa@yahoo.com A routine to find the least-significant bit set in a 64-bit number. Obviously useful for bitboard decoding. Athlon can issue 3 ops/cycle. Ideally each "round" of the binary search algorithm can be executed in 1 cycle. The add with carry is extremely convenient here because it only eats 1 instruction and does a shift + or for me. (I add a number with itself -- same as a shift left by 1.) I have two remaining cycles to compute the carry and fold the result. If I can achieve this optimum 3 instructions/round, I can execute the search in 5 cycles, and the setup cost is 6 cycles. This gives it a good speed/space tradeoff. The code is something like 100 bytes long, but it doesn't pollute the data cache, and it would be tied for fastest bitscan. -Matt -Matt
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.