Computer Chess Club Archives


Search

Terms

Messages

Subject: Re: Source code to measure it

Author: Robert Hyatt

Date: 06:33:39 07/15/03

Go up one level in this thread


On July 15, 2003 at 06:24:58, Vincent Diepeveen wrote:

>On July 14, 2003 at 16:07:27, Robert Hyatt wrote:
>
>You measure the latency with those benches of sequential reads.

No.  lm-bench does _random_ reads and computes the _random-access_
latency.

Don't know why you have a problem grasping that.


>So already opened cache lines you can get data faster from than
>random reads to memory.

That also makes no sense.  Perhaps you mean "already opened memory
rows"?


>
>Random reads to memory are about 280 ns at single cpu P4 and about 400ns at dual
>P4s.

No they aren't.


>
>I will now post my source code here to measure it. this works both with
>visual c++ as well as at *nix systems.
>
>Compile it and run it for example with a buffer of 500MB and 2 processors:

NO wonder you can't compute random latency correctly.  "two processors"?  What
are you measuring?  Hint:  It isn't what you think.

>
>c:\win2000\> latency 500000000 2
>
>
>/*-----------------10-6-2003 3:48-------------------*
> *
> * This program rasml.c measures the Random Average Shared Memory Latency
>(RASML)
> * Thanks to Agner Fog for his excellent random number generator.
> *
> * This testset is using a heavily optimized and to 64 bits modified version
> * of Agner Fog's ranrot generator.
> *
> * Created by Vincent Diepeveen who is author of this and therefore has
> * the copyright.
> *
> * Nevertheless i encourage persons to use this test UNMODIFIED. It's intention
>is
> * to measure the average latency to read and write data to shared memory at all
>the
> * processors at the same time.
> *
> * What it does is allocate a big block of memory (gigabytes or
> * terabytes preferably), and then n processes go either read from that
> * memory in a RANDOM way, and another test is reading AND writing
> * at a random way. All the processors perform the same action. They
> * keep the results and write them back to shared memory. Then all the processes
> * except P0 quits. P0 then calculates over all the processors the average
> * and it will show it clearly printed to the screen expressed in nanoseconds.
> *
> * Of course the smallest datasize used in this testset is 64 bits.
> * I wouldn't know how to else access more than 2^32 bytes.
> *
> * There are many things to consider when doing such tests. Like Level1 cache,
>Level2 cache.
> * Caches at routers and another big bunch of tricks. The caches i clearly
>mention here
> * because a lookup might by accident already have been done before
> * by the same processor or by another processor in the same node that uses the
>same RAM.
> *
> * Another influence of the times calculated is caused by the random number
>generator.
> *
> * Currently it gets very primitive initialized.
> *
> * There is a big need for this test i feel. In the future more and more
>Artificial Intelligence
> * and/or searching software will be there. They all will be busy doing a lot of
>random accesses
> * to the RAM.
> *
> * The original reason to create this testset is very sad.
>
> *       "The paper supports everything"
> *                                                     (Arturo Ochoa at Caracas,
>Venezuela)
> *
> * Especially of course when you never actually test the latency. A few quick
>searches at the
> * internet already show that paper supports everything with regards to latency.
> *
> * Copyrights: i have extensively searched past year after 'random average
>shared memory latencies'.
> * I found nothing that has to do with memory latencies in general even
>*approaching* reality where
> * programmers despite all the paper latencies must deal with.
> *
> * Therefore i claim unconditional definition rights at 'random average shared
>memory latency' (RASML).
> * In order to measure and publish randon memory latencies, this source code
>without written
> * permission by me, may not get modified.
> *
> * In that way i avoid the usual problems that are there in supercomputing
>currently
> * where marketing managers use their own definition of the word 'latency'.
> *
> * Currently the word latency by marketing managers is most likely 'the speed
>that i imagine
> * my product might be able to achieve at a certain component of a smaller
>version of
> * the machine, without taking into account inferior parts of the computer which
> * prevent such fantastic latency numbers in practice'.
> *
> * Vincent Diepeveen                 diep@xs4all.nl
> * Veenendaal, The Netherlands       10 june 2003
> *
> * first a few lines about the random number generator. Note that I modified it
> * very slightly. Basically its initialization has been done better and some
>dead
> * slow FPU code.
> */
>
>#define UNIX 0  /* put to 1 when you are under unix or using gcc a look like
>compilers */
>#define IRIX 0  /* this value only matters when UNIX is set to 1. For Linux put
>to 0
>                 * basically allocating shared memory in linux is pretty buggy
>done in
>                 * its kernel.
>                 *
>                 * Therefore you might want to do 'cat /proc/sys/kernel/shmmax'
>                 * and look for yourself how much shared memory YOU can allocate
>in linux.
>                 *
>                 * If that is not enough to benchmark this program then try
>modifying it with:
>                 *    echo <newsize> > /proc/sys/kernel/shmmmax
>                 * Be sure you are root when doing that each time the system
>boots.
>                 */
>#define FREEBSD 1 // be sure to not use more than 2 GB memory with freebsd with
>this test. sorry.
>
>
>#if UNIX
>  #include <pthread.h>
>  #include <sys/ipc.h>
>  #include <sys/shm.h>
>  #include <sys/times.h>
>  #include <sys/time.h>
>  #include <unistd.h>
>#else
>  #include <windows.h>
>  #include <winbase.h> // for GetTickCount()
>  #include <process.h> // _spawnl
>#endif
>
>#include <stdio.h>
>#include <string.h>
>#include <stdlib.h>
>#include <math.h>
>#include <time.h>
>
>#define SWITCHTIME       300000 /* in milliseconds. Modify this to let a test
>run longer.
>                                 * basically it is a good idea to use about the
>cpu number times
>                                 * thousand for this. 30 seconds is fine for
>PC's, but a very
>                                 * bad idea for supercomputers. I recomment
>several minutes
>                                 * there. Of course that let's a test take way
>way longer.
>                                 */
>#define MAXPROCESSES     2048   /* this test can go up to this amount of
>processes to be tested */
>#define CACHELINELENGTH  128    /* cache line length at the machine. Modify this
>if you want to */
>
>
>#if UNIX
>  #include <memory.h>
>  #define FORCEINLINE       __inline
>  /* UNIX and such this is 64 bits unsigned variable: */
>  #define BITBOARD                     unsigned long long
>#else
>  #define FORCEINLINE       __forceinline
>  /* in WINDOWS we also want to be 64 bits: */
>  #define BITBOARD                     unsigned _int64
>#endif
>
>#define     STATUS_NOTSTARTED    0
>#define     STATUS_READ          1
>#define     STATUS_MEASUREREAD   2
>#define     STATUS_MEASUREDREAD  3
>
>#define     STATUS_QUIT         10
>
>struct ProcessState {
>  volatile int status; /*  0  = not started yet
>                        *  1  = ready to start reading
>                        *
>                        *  10 = quitted
>                        * */
>
>  /* now the numbers each cpu gathers. The name of the first number is what
>   * cpu0 is doing and the second name what all the other cpu's were doing at
>that
>   * time
>   */
>  volatile BITBOARD readread; /* */
>  char dummycacheline[CACHELINELENGTH];
>};
>
>typedef struct {
>  BITBOARD nentries; // number of entries of 64 bits used for cache.
>  struct ProcessState ps[MAXPROCESSES];
>} GlobalTree;
>
>void     RanrotAInit(void);
>float    ToNano(BITBOARD);
>int      GetClock(void);
>float    TimeRandom(void);
>
>void     ParseBuffer(BITBOARD);
>void     ClearHash(void);
>void     DeAllocate(void);
>int      DoNrng(BITBOARD);
>int      DoNreads(BITBOARD);
>int      DoNreadwrites(BITBOARD);
>void     TestLatency(float);
>int      AllocateTree(void);
>void     InitTree(int);
>void     WaitForStatus(int,int);
>void     PutStatus(int,int);
>int      CheckAllStatus(int,int);
>void     Slapen(int);
>float    LoopRandom(void);
>
>
>
>/* define parameters (R1 and R2 must be smaller than the integer size): */
>#define KK  17
>#define JJ  10
>#define R1   5
>#define R2   3
>
>/* global variables Ranrot */
>BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers
>*/
>
>0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239,0x195e36fe715fad23,
>
>0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b,0x5db2d651a7bdf825,
>
>0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd,0x00d47d10ffdc8a9f,
>
>0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47,0x43d64ed75a9ad5d9
>
>/*0xa05a7755512c0c03,0x960880d9ea857ccd,0x7d9c520a4cc1d30f,0x73b1eb7d8891a8a1,0x116e3fc3a6b7aadb*/
>};
>int r_p1, r_p2;          /* indexes into history buffer */
>
>/* global variables RASML */
>BITBOARD *hashtable,nentries,globaldummy=0;
>GlobalTree *tree;
>int ProcessNumber;
>#if UNIX
>int shm_tree,shm_hash;
>#endif
>char rasmexename[2048];
>
> /******************************************************** AgF 1999-03-03 *
> *  Random Number generator 'RANROT' type B                               *
> *  by Agner Fog                                                          *
> *                                                                        *
> *  This is a lagged-Fibonacci type of random number generator with       *
> *  rotation of bits.  The algorithm is:                                  *
> *  X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b               *
> *                                                                        *
> *  The last k values of X are stored in a circular buffer named          *
> *  randbuffer.                                                           *
> *                                                                        *
> *  This version works with any integer size: 16, 32, 64 bits etc.        *
> *  The integers must be unsigned. The resolution depends on the integer  *
> *  size.                                                                 *
> *                                                                        *
> *  Note that the function RanrotAInit must be called before the first    *
> *  call to RanrotA or iRanrotA                                           *
> *                                                                        *
> *  The theory of the RANROT type of generators is described at           *
> *  www.agner.org/random/ranrot.htm                                       *
> *                                                                        *
> *************************************************************************/
>
>FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<<r)|(x>>(64-r));}
>
>/* returns a random number of 64 bits unsigned */
>FORCEINLINE BITBOARD RanrotA(void) {
>  /* generate next random number */
>  BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) +
>rotl(randbuffer[r_p1], R2);
>  /* rotate list pointers */
>  if( --r_p1 < 0)
>    r_p1 = KK - 1;
>  if( --r_p2 < 0 )
>    r_p2 = KK - 1;
>  return x;
>}
>
>/* this function initializes the random number generator.      */
>void RanrotAInit(void) {
>  int i;
>
>  /* one can fill the randbuffer here with possible other values here */
>
>  /* initialize pointers to circular buffer */
>  r_p1 = 0;
>  r_p2 = JJ;
>
>  /* randomize */
>  for( i = 0; i < 300; i++ )
>    (void)RanrotA();
>}
>
>/* Now the RASML code */
>char *To64(BITBOARD x) {
>  static char buf[256];
>  char *sb;
>
>  sb = &buf[0];
>  #if UNIX
>  sprintf(buf,"%llu",x);
>  #else
>  sprintf(buf,"%I64u",x);
>  #endif
>  return sb;
>}
>
>int GetClock(void) {
>/* The accuracy is measured in millisecondes. The used function is very accurate
>according
> * to the NT team, way more accurate nowadays than mentionned in the MSDN
>manual. The accuracy
> * for linux or unix we can only guess. Too many experts there.
> */
>  #if UNIX
>  struct timeval timeval;
>  struct timezone timezone;
>  gettimeofday(&timeval, &timezone);
>  return((int)(timeval.tv_sec*1000+(timeval.tv_usec/1000)));
>  #else
>  return((int)GetTickCount());
>  #endif
>}
>
>float ToNano(BITBOARD nps) {
>  /* convert something from times a second to nanoseconds.
>   * NOTE THAT THERE IS COMPILER BUGS SOMETIMES AT OLD COMPILERS
>   * SO THAT'S WHY MY CODE ISN'T A 1 LINE RETURN HERE. PLEASE DO
>   * NOT MODIFY THIS CODE */
>  float tn;
>  tn = 1000000000/(float)nps;
>  return tn;
>}
>
>float TimeRandom(void) {
>  /* timing the random number generator is very easy of course. Returns
>   * number of random numbers a second that can get generated
>   */
>  BITBOARD bb=0,i,value,nps;
>  float ns_rng;
>  int t1,t2,took;
>
>  printf("Benchmarking Pseudo Random Number Generator speed, RanRot type
>'B'!\n");
>  printf("Speed depends upon CPU and compile options from RASML,\n therefore we
>benchmark the RNG\n");
>  printf("Please wait a few seconds.. "); fflush(stdout);
>  value = 100000;
>  took  = 0;
>  while( took < 3000 ) {
>    value <<= 2; //  x4
>    t1 = GetClock();
>
>    for( i = 0; i < value; i++ ) {
>      bb ^= RanrotA();
>    }
>    t2 = GetClock();
>    took = t2-t1;
>  }
>
>  nps = (1000*value)/(BITBOARD)took;
>
>  #if UNIX
>  printf("..took %i milliseconds to generate %llu numbers\n",took,value);
>  printf("Speed of RNG = %llu numbers a second\n",nps);
>  #else
>  printf("..took %i milliseconds to generate %I64 numbers\n",took,value);
>  printf("Speed of RNG = %I64u numbers a second\n",nps);
>  #endif
>
>  ns_rng = ToNano(nps);
>  printf("So 1 RNG call takes %f nanoseconds\n",ns_rng);
>
>
>  return ns_rng;
>}
>
>void ParseBuffer(BITBOARD nbytes) {
>  tree->nentries = nbytes/sizeof(BITBOARD);
>  #if UNIX
>  printf("Trying to allocate %llu entries. ",tree->nentries);
>  printf("In total %llu bytes\n",tree->nentries*(BITBOARD)sizeof(BITBOARD));
>  #else
>  printf("Trying to allocate %s entries. ",To64(tree->nentries));
>  printf("In total %s bytes\n",To64(tree->nentries*(BITBOARD)sizeof(BITBOARD)));
>  #endif
>}
>
>void ClearHash(void) {
>  BITBOARD i,nentries = tree->nentries;
>  /* clearing hashtable */
>  printf("Clearing hashtable\n");
>  for( i = 0 ; i < nentries ; i++ ) /* very unoptimized way of clearing */
>    hashtable[i] = i;
>}
>
>void DeAllocate(void) {
>  #if UNIX
>  shmctl(shm_tree,IPC_RMID,0);
>  shmctl(shm_hash,IPC_RMID,0);
>  #else
>  UnmapViewOfFile(tree);
>  UnmapViewOfFile(hashtable);
>  #endif
>}
>
>int DoNrng(BITBOARD n) {
>  BITBOARD i=1,dummyres,nents;
>  int t1,t2;
>
>  nents = nentries; /* hopefully this gets into a register */
>  dummyres = globaldummy;
>
>  t1 = GetClock();
>  do {
>    BITBOARD index = RanrotA()%nents;
>    dummyres ^= index;
>  } while( i++ < n );
>  t2 = GetClock();
>
>  globaldummy = dummyres;
>  return(t2-t1);
>}
>
>int DoNreads(BITBOARD n) {
>  BITBOARD i=1,dummyres,nents;
>  int t1,t2;
>
>  nents = nentries; /* hopefully this gets into a register */
>  dummyres = globaldummy;
>
>  t1 = GetClock();
>  do {
>    BITBOARD index = RanrotA()%nents;
>    dummyres ^= hashtable[index];
>  } while( i++ < n );
>  t2 = GetClock();
>
>  globaldummy = dummyres;
>
>  return(t2-t1);
>}
>
>int DoNreadwrites(BITBOARD n) {
>  BITBOARD i=1,dummyres,nents;
>  int t1,t2;
>
>  nents = nentries; /* hopefully this gets into a register */
>  dummyres = globaldummy;
>
>  t1 = GetClock();
>  do {
>    BITBOARD index = RanrotA()%nents;
>    dummyres ^= hashtable[index];
>    hashtable[index] = dummyres;
>  } while( i++ < n );
>  t2 = GetClock();
>
>  globaldummy = dummyres;
>
>  return(t2-t1);
>}
>
>void TestLatency(float ns_rng) {
>  BITBOARD n,nps_read,nps_rw,nps_rng;
>  float ns,fns;
>  int timetaken;
>
>  printf("Doing random RNG test. Please wait..\n");
>  n = 50000000; // 50 mln
>  timetaken = DoNrng(n);
>  nps_rng = (1000*n) / (BITBOARD)timetaken;
>  fns  = ToNano(nps_rng);
>  printf("Machine needs %f ns for RND loop\n",fns);
>
>  /* READING SINGLE CPU RANDOM ENTRIES */
>  printf("Doing random read tests single cpu. Please wait..\n");
>  n = 100000000; // 100 mln
>  timetaken = DoNreads(n);
>  nps_read = (1000*n) / (BITBOARD)timetaken;
>  ns  = ToNano(nps_read);
>  printf("Machine needs %f ns for single cpu random reads.\nExtrapolated=%f
>nanoseconds a read\n",ns,ns-fns);
>
>  /* READING AND THEN WRITING SINGLE CPU RANDOM ENTRIES */
>  printf("Doing random readwrite tests single cpu. Please wait..\n");
>  n = 100000000; // 100 mln
>  timetaken = DoNreadwrites(n);
>  nps_rw = (1000*n) / (BITBOARD)timetaken;
>  ns  = ToNano(nps_rw);
>  printf("Machine needs %f ns for single cpu random readwrites.\n",ns);
>  printf("Extrapolated=%f nanoseconds a readwrite (to the same
>slot)\n\n",ns-fns);
>
>  printf("So far the useless tests.\nBut we have vague read/write nodes a second
>numbers now\n");
>}
>
>int AllocateTree(void) { /* initialize the tree. returns 0 if error */
>  #if UNIX
>  shm_tree = shmget(
>              #if IRIX
>              ftok(".",'t'),
>              #else
>              IPC_PRIVATE,
>              #endif
>              sizeof(GlobalTree),IPC_CREAT|0777);
>  if( shm_tree == -1 )
>    return 0;
>  tree = (GlobalTree *)shmat(shm_tree,0,0);
>  if( tree == (GlobalTree *)-1 )
>    return 0;
>  #else /* so windows NT. This might even work under win98 and such crap OSes,
>but not win95 */
>  if( !ProcessNumber ) {
>    HANDLE TreeFileMap;
>    TreeFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
>     (DWORD)sizeof(GlobalTree),"RASM_Tree");
>    if( TreeFileMap == NULL )
>      return 0;
>    tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
>    if( tree == NULL )
>      return 0;
>  }
>  else { /* Slaves attach also try to attach to the tree */
>    HANDLE TreeFileMap;
>    TreeFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Tree");
>    if( TreeFileMap == NULL )
>      return 0;
>    tree = (GlobalTree *)MapViewOfFile(TreeFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
>    if( tree == NULL )
>      return 0;
>  }
>  #endif
>  return 1;
>}
>
>int AllocateHash(void) { /* initialize the hashtable (cache). returns 0 if error
>*/
>  #if UNIX
>  shm_hash = shmget(
>              #if IRIX
>              ftok(".",'h'),
>              #else
>              IPC_PRIVATE,
>              #endif
>              tree->nentries*8,IPC_CREAT|0777);
>  if( shm_hash == -1 )
>    return 0;
>  hashtable = (BITBOARD *)shmat(shm_hash,0,0);
>  if( hashtable == (BITBOARD *)-1 )
>    return 0;
>  #else /* so windows NT. This might even work under win98 and such crap OSes,
>but not win95 */
>  if( !ProcessNumber ) {
>    HANDLE HashFileMap;
>    HashFileMap = CreateFileMapping((HANDLE)0xFFFFFFFF,NULL,PAGE_READWRITE,0,
>     (DWORD)tree->nentries*8,"RASM_Hash");
>    if( HashFileMap == NULL )
>      return 0;
>    hashtable = (BITBOARD
>*)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
>    if( hashtable == NULL )
>      return 0;
>  }
>  else { /* Slaves attach also try to attach to the tree */
>    HANDLE HashFileMap;
>    HashFileMap = OpenFileMapping(FILE_MAP_ALL_ACCESS,FALSE,"RASM_Hash");
>    if( HashFileMap == NULL )
>      return 0;
>    hashtable = (BITBOARD
>*)MapViewOfFile(HashFileMap,FILE_MAP_ALL_ACCESS,0,0,0);
>    if( hashtable == NULL )
>      return 0;
>  }
>  #endif
>  return 1;
>}
>
>int StartProcesses(int ncpus) {
>  char buf[256];
>  int i;
>  /* returns 1 if ncpus-1 started ok */
>  if( ncpus == 1 )
>    return 1;
>
>  for( i = 1 ; i < ncpus ; i++ ) {
>    sprintf(buf,"%i_%i",i+1,ncpus);
>    #if UNIX
>    if( !fork() )
>      execl(rasmexename,rasmexename,buf,NULL);
>    #else
>    (void)_spawnl(_P_NOWAIT,rasmexename,rasmexename,buf,NULL);
>     #endif
>  }
>  return 1;
>}
>
>void InitTree(int ncpus) {
>  int i;
>
>  for( i = 0 ; i < ncpus ; i++ ) {
>    tree->ps[i].status   = STATUS_NOTSTARTED;
>    tree->ps[i].readread = 0;
>  }
>}
>
>void WaitForStatus(int ncpus,int waitforstate) {
>  /* wait for all processors to have the same state */
>  int i,badluck=1;
>
>  while( badluck ) {
>    badluck = 0;
>    for( i = 0 ; i < ncpus ; i++ ) {
>      if( tree->ps[i].status != waitforstate )
>        badluck = 1;
>    }
>  }
>}
>
>void PutStatus(int ncpus,int statenew) {
>  int i;
>  for( i = 0 ; i < ncpus ; i++ ) {
>    tree->ps[i].status = statenew;
>  }
>}
>
>int CheckAllStatus(int ncpus,int status) {
>  /* Tries with a single loop to determine whether the other cpu's also finished
>   *
>   * returns:
>   *     true  ==> when all the processes have this status
>   *     false ==> when 1 or more are still busy measuring
>   */
>  int i,badluck=1;
>  for( i = 0 ; i < ncpus ; i++ ) {
>    if( tree->ps[i].status != status ) {
>      badluck = 0;
>      break;
>    }
>  }
>  return badluck;
>}
>
>void Slapen(int ms) {
>  #if UNIX
>  usleep(ms*1000); /* 0.050 000 secondes, it is in microseconds! */
>  #else
>  Sleep(ms);     /* 0.050 seconds, it is in milliseconds */
>  #endif
>}
>
>float LoopRandom(void) {
>  BITBOARD n,nps_rng;
>  float fns;
>  int timetaken;
>  printf("Benchmarking random RNG test. Please wait..\n");
>  n = 25000000; // 50 mln
>  timetaken = 0;
>  while( timetaken < 500 ) {
>    n += n;
>    timetaken = DoNrng(n);
>  }
>  printf("timetaken=%i\n",timetaken);
>  nps_rng = (1000*n) / (BITBOARD)timetaken;
>  fns  = ToNano(nps_rng);
>  printf("Machine needs %f ns for RND loop\n",fns);
>  return fns;
>}
>
>
>/* Example showing how to use the random number generator: */
>int main(int argc,char *argv[]) {
>  /* allocate a big memory buffer parameter is in bytes.
>   * don't hesitate to MODIFY this to how many gigabytes
>   * you want to try.
>   * The more the better i keep saying to myself.
>   *
>   * Note that under linux your maximum shared memory limit can be set with:
>   *
>   * echo <size> > /proc/sys/kernel/shmmax
>   *
>   * and under IRIX it is usually 80% from the total RAM onboard that can get
>allocated
>   */
>
>  BITBOARD nbytes,firstguess;
>  float ns_rng,f_loop;
>  int cpus,tottimes,t1,t2;
>
>
>  if( argc <= 1 ) {
>    printf("Latency test usage is: latency <buffer> <cpus>\n");
>    printf("Where 'buffer' is the buffer in number of bytes to allocate\n");
>    printf("and where 'cpus' is the number of processes that this test will try
>to use (1 = default) \n");
>    return 1;
>  }
>
>  /* parse the input */
>  nbytes = 0;
>  cpus   = 1; // default
>
>  if( strchr(argv[1],'_') == NULL ) { /* main startup process */
>    int np = 0;
>    #if UNIX
>     #if FREEBSD
>     nbytes = (BITBOARD)atoi(argv[1]); // freebsd doesn't support > 2 GB memory
>     #else
>     nbytes = (BITBOARD)atoll(argv[1]);
>     #endif
>    #else
>    nbytes = (BITBOARD)_atoi64(argv[1]);
>    #endif
>
>    printf("Welcome to RASM Latency!\n");
>    printf("RASML measures the RANDOM AVERAGE SHARED MEMORY LATENCY!\n\n");
>
>    if( argc > 2 ) {
>      cpus = 0;
>      do {
>        cpus *= 10;
>        cpus += (int)(argv[2][np]-'1')+1;
>        np++;
>      } while( argv[2][np] >= '0' && argv[2][np] <= '9' );
>    }
>    //printf("Master: buffer = %s bytes. #CPUs = %i\n",To64(nbytes),cpus);
>    ProcessNumber = 0;
>
>    /* check whether we are not getting out of bounds */
>    if( cpus > MAXPROCESSES ) {
>      printf("Error: Recompile with a bigger stack for MAXPROCESSES. %i
>processors is too much\n",cpus);
>      return 1;
>    }
>
>    /* find out the file name */
>    #if UNIX
>    strcpy(rasmexename,argv[0]);
>    #else
>    GetModuleFileName(NULL,rasmexename,2044);
>    #endif
>    printf("Stored in rasmexename = %s\n",rasmexename);
>  }
>  else { //   latency 2_452  ==>  means processor 2 out of 452.
>    int np = 0;
>
>    ProcessNumber = 0;
>    do {
>      ProcessNumber *= 10;
>      ProcessNumber += (argv[1][np]-'1')+1;      // n
>      np++;
>    } while( argv[1][np] >= '0' && argv[1][np] <= '9' );
>
>    ProcessNumber--; // 1 less because of ProcessNumber ==> [0..n-1]
>
>    np++; // skip underscore
>
>    cpus = 0;
>    do {
>      cpus *= 10;
>      cpus += (argv[1][np]-'1')+1;      // n
>      np++;
>    } while( argv[1][np] >= '0' && argv[1][np] <= '9' );
>    //printf("Slave: ProcessNumber=%i cpus=%i\n",ProcessNumber,cpus);
>  }
>
>  /* first we setup the random number generator. */
>  RanrotAInit();
>
>  /* initialize shared memory tree; it gets used for communication between the
>processes */
>  if( !AllocateTree() ) {
>    printf("Error: ProcessNumber %i could not allocate the
>tree\n",ProcessNumber);
>    return 1;
>  }
>
>  if( !ProcessNumber )
>    ParseBuffer(nbytes);
>
>  nentries = tree->nentries;
>
>  /* Now some stuff only the Master has to do */
>  if( !ProcessNumber ) {
>    /* Master: now let's time the pseudo random generators speed in nanoseconds
>a call */
>    ns_rng = TimeRandom();
>    f_loop = LoopRandom();
>
>    printf("Trying to Allocate Buffer\n");
>    t1 = GetClock();
>    if( !AllocateHash() ) {
>      printf("Error: Could not allocate buffer!\n");
>      return 1;
>    }
>    t2 = GetClock();
>    printf("Took %i.%03i seconds to allocate Hash\n",(t2-t1)/1000,(t2-t1)%1000);
>    ClearHash();
>    t1 = GetClock();
>    printf("Took %i.%03i seconds to clear Hash\n",(t1-t2)/1000,(t1-t2)%1000);
>
>    /* so now hashtable is setup and we know quite some stuff. So it is time to
>     * start all other processes */
>    InitTree(cpus);
>
>    printf("Starting Other processes\n");
>    t1 = GetClock();
>    if( !StartProcesses(cpus) ) {
>      printf("Error: Could not start processes\n");
>      DeAllocate();
>    }
>  }
>  else { /* all Slaves do this */
>    if( !AllocateHash() ) {
>      printf("Error: slave %i Could not allocate buffer!\n",ProcessNumber);
>      return 1;
>    }
>  }
>
>  tree->ps[ProcessNumber].status = STATUS_READ;
>
>  if( !ProcessNumber ) {
>    WaitForStatus(cpus,STATUS_READ);
>    t2 = GetClock();
>    printf("Took %i milliseconds to start %i additional
>processes\n",t2-t1,cpus-1);
>    printf("Read latency measurement STARTS NOW using steps of 2 * %i.%03i
>seconds :\n",
>     (SWITCHTIME/1000),(SWITCHTIME%1000));
>  }
>
>  firstguess = 200000;
>  tottimes   = 0;
>
>  for( ;; ) {
>    int timetaken = 0;
>    if( tree->ps[ProcessNumber].status == STATUS_MEASUREREAD ) {
>      /* this really MEASURES the readread */
>      BITBOARD ntried = 0,avnumber;
>      int totaltime=0;
>      while( totaltime < SWITCHTIME ) { /* go measure around switchtime seconds
>*/
>        totaltime += DoNreads(firstguess);
>        ntried += firstguess;
>      }
>      /* now put the average number of readreads into the shared memory */
>      avnumber = (ntried*1000) / (BITBOARD)totaltime;
>      tree->ps[ProcessNumber].readread = avnumber;
>
>      /* show that it is finished */
>      tree->ps[ProcessNumber].status = STATUS_MEASUREDREAD;
>
>      /* now keep doing the same thing until status gets modified */
>      while( tree->ps[ProcessNumber].status == STATUS_MEASUREDREAD ) {
>        (void)DoNreads(firstguess);
>        if( !ProcessNumber ) {
>          if( CheckAllStatus(cpus,STATUS_MEASUREDREAD) ) {
>            PutStatus(cpus,STATUS_QUIT);
>            break;
>          }
>        }
>      }
>    }
>    else if( tree->ps[ProcessNumber].status == STATUS_READ ) {
>      BITBOARD nextguess;
>      /* now software must try to determine how many reads a seconds are
>possible for that
>       * process
>       */
>      //printf("proc=%i trying %s reads\n",ProcessNumber,To64(firstguess));
>      timetaken = DoNreads(firstguess);
>      /* try to guess such that next test takes 1 second, or if test was too
>inaccurate
>       * then double the number simply. also prevents divide by zero error ;)
>       */
>      if( timetaken < 400 )
>        nextguess = firstguess*2;
>      else
>        nextguess = (firstguess*1000)/(BITBOARD)timetaken;
>      firstguess = nextguess;
>      if( !ProcessNumber ) {
>        tottimes += timetaken;
>        if( tottimes >= SWITCHTIME ) { // 30 seconds to a few minutes
>          PutStatus(cpus,STATUS_MEASUREREAD);
>          //PutStatus(cpus,STATUS_QUIT);
>          tottimes = 0;
>        }
>      }
>    }
>    else if( tree->ps[ProcessNumber].status == STATUS_QUIT )
>      break;
>  }
>
>  /* now do the latency tests
>   */
>  //TestLatency(ns_rng);
>  tree->ps[ProcessNumber].status = STATUS_QUIT;
>  if( !ProcessNumber ) {
>    BITBOARD averagereadread;
>    int i;
>    averagereadread = 0;
>    WaitForStatus(cpus,STATUS_QUIT);
>    for( i = 0; i < cpus ; i++ ) {
>      averagereadread += tree->ps[i].readread;
>    }
>    averagereadread /= (BITBOARD)cpus;
>    printf("Raw Average measured read read time at %i processes = %f
>ns\n",cpus,ToNano(averagereadread));
>    printf("Now for the final calculation it gets compensated:\n");
>    printf("  Average measured read read time at %i processes = %f
>ns\n",cpus,ToNano(averagereadread)-f_loop);
>  }
>
>  DeAllocate();
>  return 0;
>}
>
>/* EOF latency.c */



This page took 0.03 seconds to execute

Last modified: Thu, 15 Apr 21 08:11:13 -0700

Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.