Author: Gerd Isenberg
Date: 11:18:23 07/16/03
Go up one level in this thread
<snip> >Remember this test is intended for 64 bits platforms not for outdated IA32. >At my K7 i also saw that the 64 bit mod was very slow. > >I do not know the number of cycles that 64 bit mod costs but it was trivial that >at each cpu it would cost a different number of clock cycles. The loop is even >more expensive. Up to 65 ns is no exception there. However it gets >first measured and after that the latency gets *reduced* by the costs of that >64 bits mod. > >That is a very correct way of doing it. That's the question. I don't know what side affects happen with the remaining code in the loop body, if a cache miss occurs. On the other hand, what i don't understand here, and that's a lot, is about latency hiding: do { BITBOARD index = RanrotA()%nents; dummyres ^= hashtable[index]; } while( i++ < n ); With my naive imagination about processor architecture and latency hiding i would expect following scenario: If the processor is waiting for hashtable[index] to do the xor, he (or it?) may already out of order execute i++, predicting correctly the branch into the loop body, doing (parts) of next RanrotA and even the 64bit mod. The pending xor may immediatly be executed before the next hashread starts, which covers the remaining instructions in the loop due to the huge latency. But obviously the opposite happens. Does a pending read stalls all other data load/store units - even if these other (eg. randbuffer and locals) data are already in cache? Is there really some kind of additional latency due to RAM's hardware interface with opening Rows or Columns, if you pass random "worst case" addresses? Btw.: Does P4's hyperthreading "hide" memory latencies, in the sense if one virtual thread waits for read, the other may run with up to 100%? Regards, Gerd
This page took 0.01 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.