Author: Eugene Nalimov
Date: 21:57:11 07/14/03
Go up one level in this thread
On July 15, 2003 at 00:29:03, Robert Hyatt wrote: >On July 14, 2003 at 16:32:20, Eugene Nalimov wrote: > >>On July 14, 2003 at 16:07:27, Robert Hyatt wrote: >> >>>On July 14, 2003 at 15:33:37, Gerd Isenberg wrote: >>> >>>>On July 14, 2003 at 10:54:49, Vincent Diepeveen wrote: >>>> >>>>>On July 13, 2003 at 17:10:10, Russell Reagan wrote: >>>>> >>>>>>On July 13, 2003 at 13:17:56, Bas Hamstra wrote: >>>>>> >>>>>>>It is used *extremely* intensive. Therefore I assumed that most of the time the >>>>>>>table sits in cache. But apparently no... Makes you wonder about other simple >>>>>>>lookup's. A lot of 10 cycle penalties, it seems. >>>>>> >>>>>>Hi Bas, >>>>>> >>>>>>Why you say "10 cycles"? I thought memory latency was many more cycles (~75 - >>>>>>150+). >>>>> >>>>>Random read from memory at dual P4 or dual K7 is like nearly 400 nanoseconds. >>>>>So that's at 2Ghz around 800 cycles. >>>>> >>>>>Best regards, >>>>>Vincent >>>> >>>>Hi Vincent, >>>> >>>>puhh... that's about 1/2 microsecond. I remember the days with >>>>2MHz - 8085 or Z80 CPU - can't beleave it. A few questions... >>> >>> >>> >>>Don't believe it because it is _wrong_. Run "lm-bench" on your computer. >>>It will very accurately measure random access latency. The slowest I have >>>seen is 150ns on my dual, using registered DDRAM. My laptop uses SDRAM and >>>clocks in around 120ns. My quad xeons are all around 125ns. >>> >>>I've not seen any 400+ ns numbers although it is very possible that rambus >>>might be that slow on latency, although it is very fast on bandwidth. >>> >>> >>>> >>>>I'm not familar with dual-architectures. Is it a kind of shared memory via >>>>pci-bus? How do you access such ram - are the some alloc like api-functions? >>>>What happens, if one perocessor writes this memory through cache - what about >>>>possible cache copies of this address in the other processor, or in general how >>>>do the severel processor caches syncronise? >>>>I guess each processor has it's own local main-memory. >>>> >>> >>> >>> >>>No. Each processor sits on the same bus with memory. So both can access >>>it independently. However, cache coherency is a problem, but in the Intel >>>world it is handled by some clever cache design so that the cache controllers >>>are aware of what is being done by the "other cache" and knows when the other >>>cache modifies a value that is in the local cache. It's messy, but it works. >>> >>>Caches still use write-back update policy so that memory is not updated until >>>the cache line (Modified cache line) is about to be overwritten. However, if >>>two caches have the same cache line (memory addresses) and one modifies any of >>>the cache line, the other invalidates its copy so the next read will refresh >>>things correctly. >>> >>> >>> >>> >>>>Do you know the read latencies of single processor P4 or K7 with state of the >>>>art chipsets? >>> >>> >>>Typical numbers are in the 120-150ns range. Lower for non-registered type >>>memory. Registered memory is mainly used in duals that are set up as servers, >>>for higher reliability. >>> >>>Aaron has a sub-75ns latency machine that is overclocked. That's the fastest >>>PC latency I have ever seen. In fact, it is probably the fastest latency of >>>any kind I have seen, period. >>> >>> >>> >>> >>>> >>>>1.) if data is already in 1. level cache >>> >>>This is a one-cycle deal. >>> >>> >>> >>>>2.) if data is in 2. level cache but not in 1. >>> >>>This is something like 6 cycles but I don't think there is a standard >>>"number" here since processor speeds vary so much. >>> >>> >>> >>>>3.) in worst case, if data is only in main memory but in no cache >>> >>>125ns is a good first approximation. >> >>I had seen 700+ ns on a 32-way system. But that was a worst case, and changes in >>the program helped -- read-only data was moved in a separate cache line, and >>algorithm was changed to allow each CPU have its own writable data that are >>merged together from time to time. >> >>Thanks, >>Eugene > >I haven't played with any 32 cpu PC-type machines. However, I have played >with several NUMA machines and obviously the farther away the memory, the >longer the latency. This was a characteristic of the first connection >machine, for example. That wasn't PC-class machine :-) And latency under high load was much worse than documented one, probably because of all the extra traffic on the buses, cache coherence overhead, etc. Thanks, Eugene >However, I have no idea where vincent gets his 450ns for duals. The slowest >I have seen is 150. > >> >>>You can answer _all_ of the above by running lm-bench. It will tell >>>you each one of those numbers, plus others. >>> >>> >>> >>> >>>> >>>>Thanks in advance, >>>>Gerd
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.