Computer Chess Club Archives

Search

Terms
Messages

Subject: Re: assembly--not really that fast

Author: Eugene Nalimov
Date: 15:23:31 01/14/02
int a, b, c, d, x;

void foo (void)
{
	a = b; x = 1; c = d;
}

_foo    PROC NEAR
; File c:\repro\c1.c
; Line 5
        mov     eax, DWORD PTR _b
        mov     ecx, DWORD PTR _d
        mov     DWORD PTR _a, eax
        mov     DWORD PTR _x, 1
        mov     DWORD PTR _c, ecx
; Line 6
        ret     0
_foo    ENDP

Regarding 30% slowdown: I believe what happens here is that you are comparing
carefully optimized assembly program (and optimizations took several years) with
recently written C program which is not optimized yet. You have to profile your
program, look at the hot spots, play with "inline" ("__forceinline" for VC),
probably look at the assembly output, etc. I would not be surprized if you'll
have to change some algorithms, as you cannot efficiently code them in C. It
looks that you had written the correct program. Now you should make it faster --
don't expect that compiler will go all the road for you.

Intel's documentation: start from www.intel.com and choose "Software
developers". Specifically, P4 optimization manual is located at
http://developer.intel.com/design/pentium4/manuals/248966.htm.

Eugene

On January 14, 2002 at 16:54:41, Ed Schröder wrote:

>On January 14, 2002 at 13:38:06, Eugene Nalimov wrote:
>
>>On January 14, 2002 at 04:16:54, Ed Schröder wrote:
>>
>>>On January 13, 2002 at 23:36:19, Eugene Nalimov wrote:
>>>
>>>>Can you please send me the function that was so badly compiled (probably via
>>>>e-mail)? I'd like to find where VC screwed up. It's too late to fix it for VC7,
>>>>but probably we can do it for VC7.x.
>>>>
>>>>Eugene
>>>
>>>
>>>Screwed up is a big word, ASM being being just 30% faster than C is a very good
>>>performance I would say. By head I remember the following cases:
>>>
>>>#1. a=b; c=d;
>>>
>>>The compiler will output something like:
>>>
>>>mov  EAX,b
>>>mov  a,EAX
>>>mov  EAX,d
>>>mov  c,EAX
>>>
>>>Wheras it should generate:
>>>
>>>mov  EAX,b
>>>mov  EBX,d
>>>mov  a,EAX
>>>mov  c,EBX
>>
>>---- File c1.c:
>>
>>int a, b, c, d;
>>
>>void foo (void)
>>{
>>	a = b; c = d;
>>}
>>
>>---- File c1.asm (compiled with "cl /Ox /Fa c1.c")
>>
>>[Some assembly stuff deleted]
>>
>>_foo	PROC NEAR
>>; File c:\repro\c1.c
>>; Line 5
>>	mov	eax, DWORD PTR _b
>>	mov	ecx, DWORD PTR _d
>>	mov	DWORD PTR _a, eax
>>	mov	DWORD PTR _c, ecx
>>; Line 6
>>	ret	0
>>_foo	ENDP
>
>
>That's good.
>
>Very well, but can the compiler for instance recognize:
>
>a = b; x=1; c = d;
>
>and do a good pipe-line job too?
>
>The combinations are endless of course.
>
>
>
>>>#2. Always these unavoidable MOVSX and MOVZX instructions. No compiler can
>>>optimize this because it is impossible, only the ASM programmer knows what it is
>>>allowed under the circumstances.
>>
>>Sometimes you can use C casts to avoid those... But yes, here assembly
>>programmer is definitely better.
>>
>>>#3. Register use, same story as (2). I for instance use EBP and even ESP when I
>>>am short on registers.
>>
>>VC, of course, use EBP when it decides it's beneficial.
>>
>>>#4. "char" use in MSVC, for instance: char a1,a2,a3,a4,a5,a6,a7,a8;
>>>
>>>Will NOT produce the 8 characters as a sequential memory block. So in case I
>>>want to zero the 8 bytes I will be forced to write 8 instructions. Some other
>>>compilers do generate a sequential memory block so you can redefine a1 and a5 as
>>>32-bit and with 2 instructions zero them. This is pretty crucial in a chess
>>>program, at least in mine, also because I have to "stack" many stuff when going
>>>one ply deeper in the tree or when climbing back.
>>
>>Never, never, do that on PIII and especially on P4. For the detailed explanation
>>look, for example, at "Intel Pentium 4 and Intel Xeon Processor Optimization
>>Reference Manual", Section 1-22 "Store Forwarding".
>
>Sounds alarming, my program is polluted with these kind of juicy ASM tricks. Do
>you think it is a problem in ASM code too? And maybe you have an URL of the
>documentation by hand?
>
>Ed
>
>
>>Eugene
>>
>>>#5. Special stuff, no compiler is able to recognize as only the ASM programmer
>>>knows. I recently posted an example how to use the "indirect jump" the processor
>>>is offering you when for instance generating moves.
>>>
>>>So it is not about bugs, it is more why no compiler will be ever able to beat an
>>>experienced ASM programmer. However I do think that there is space for
>>>improvement in the (1) and (4) case, maybe even on (3).
>>>
>>>Ed
>>>
>>>
>>>
>>>
>>>>On January 13, 2002 at 18:51:02, Ed Schröder wrote:
>>>>
>>>>>On January 13, 2002 at 16:29:21, Tom Kerrigan wrote:
>>>>>
>>>>>>On January 13, 2002 at 07:05:02, Ed Schröder wrote:
>>>>>>
>>>>>>>I have to disagree, I have a MSVC6 version of Rebel and it runs 30% slower than
>>>>>>>the ASM version.
>>>>>>
>>>>>>What do you attribute this difference to? Is it simply not possible to write C
>>>>>>that produces the same assembly as your hand-written code? Or do you take
>>>>>>certain liberties in the C code (perhaps in the same of readability?) that's
>>>>>>slowing things down?
>>>>>>
>>>>>>-Tom
>>>>>
>>>>>Just have a look at the ASM code MSVC6 produces, it often is bad stuff. By
>>>>>re-writing (optimizing) this "bad ASM stuff" I got my +30%.
>>>>>
>>>>>One ambiguous remark, don't believe everthing commercials are telling you :)
>>>>>
>>>>>Ed
This page took 0 seconds to execute
Last modified: Thu, 15 Apr 21 08:11:13 -0700
Current Computer Chess Club Forums at Talkchess. This site by Sean Mintz.