The longest arithmetic operation
So far this is the absolute record for the binary size of one division/remainder/multiplication operation:

35 instructions, 87 bytes of code just to calculate a remainder of division by 2:
long long smod(long long x) { return x % 2; }
(compiled by gcc 3.4.4)
Anyone to come up with a longer single arithmeric operation?

Comments
MSVC is not really much smarter.
00000 6a 00 push 0
00002 6a 02 push 2
00004 ff 74 24 10 push DWORD PTR _a$[esp+8]
00008 ff 74 24 10 push DWORD PTR _a$[esp+8]
0000c e8 00 00 00 00 call __allrem
00011 c2 08 00 ret 8
And then if you expand _allrem, the code is actually longer than the case you mentioned :)
I wonder what's the smallest number of bytes you need to implement that signed modulo 2 function. Shortest one that I can come with is 18 bytes. (Uses _stdcall calling convention).
Posted by: Ludvig | December 23, 2005 03:32 PM
16 now :)
Posted by: Ludvig | December 23, 2005 03:38 PM
Well, as soon as you see the name "__allrem", you are done - you know the meaning of the code and can proceed.
With a long sequence of instructions without any names you are at your own.
Posted by: ilfak | December 31, 2005 08:11 AM
Outrageous. So it's back to hand-optimizing code? At least, we should check out the binary every once in a while.
Posted by: Alex | December 31, 2005 04:12 PM
In fact this is a very good code and runs almost optimally. Spending time on making it faster is not worth the trouble at all.
I completely agree with you that binaries should be checked from time to time. Some of them hide quite unexpected things...
Posted by: ilfak | December 31, 2005 05:35 PM
I always thought %2 should just return the lsb... But you say this code is almost optimal? What am I missing?
Posted by: Nonymous | January 1, 2006 03:21 PM
Thank you for posting that temp fix for the wmf vulnerability,my msn messenger went heywire, tried everything, antispy, anti-virus, without much success then i remembered reading about that exploit, so did some snooping and found your posting, i installed it and back to normal again,so do you think microsoft will post a fix in there upcomming malicious removal tool,did you say in your article to unistall the temp fix before running windows updates, the only trouble is you dont know if they have got it in the removal tool, any suggestions. Anyway nice to see there is guys who are savvy in coding fighting on the good side. Regards Mike!
Posted by: mike | January 1, 2006 04:27 PM
LSB will not work because
-1 % 2 = -1
So you have to preserve the sign.
Posted by: ilfak | January 1, 2006 04:40 PM
Hi Mike,
Thanks for your nice words! I think this comment is better in the other entry. I'll try to move it there if I find how.
Posted by: ilfak | January 1, 2006 04:41 PM
I see. Mathematically the sign needn't be preserved, I think, but I can see how it might be useful in programming. It still doesn't look very optimal to me, but ok ;)
Posted by: Nonymous | January 1, 2006 05:37 PM
Oh, you are absolutely right about optimization! I should have checked it once more before saying anything :)
Posted by: ilfak | January 2, 2006 03:32 AM
I can't believe people are saying this is anywhere near optimal. For starters, you can obviously delete redundant lines at 2c3, 2c9 and 2ce. And in reality before line eax only had one bit of useful information in it, the high bit. And all instructions from 2c7 to 2d1 either operate on edx (which is wiped out at 2de) or work on eax but don't alter the top bit. So you can delete all instructions from 2c7 to 2d1.
Personally, I came up with the following sequence. I haven't used Intel since before the 386, but I'm assuming the result comes out in edx:eax (H:L).
mov eax, dword ptr [ebp+arg_0]
and eax, 01h
mov edx, dword ptr [ebp+arg_0+4]
add edx, 0
jns lbl
neg eax
lbl: sar edx, 1Fh
retn
I don't know how long it is, and again, I'm not an Intel jock, so maybe it could be even better.
Posted by: how about this? | January 2, 2006 04:04 AM
You are right and I am wrong. The code is far from being optimal. I used the optimization flag in the command line but obviously I dropped it later.
Posted by: Anonymous | January 2, 2006 04:21 AM
The mentioned code sequence has the problem that it contains a jump, which is VERY BAD for performance.
While i agree that the code sequence is not optimal (though i've seen about the same function (%8 in that case) in about ~100 instructions on an Atmel AVR), the sign preserval is the real problem.
Solution: use "unsigned" not only to extend the range, but also to express what has to be done with negative numbers.
checking a user supplied array index "%size" would be a huge security risk, if the index is declared signed.
MSVC once had a pretty neat way of preserving the signs in ~4 instructions without any jumps (that was for 32bit, though). Watcom just used idiv.
Posted by: Felix | January 2, 2006 06:17 PM
Yeah, I know it has a jump. I tried to remove it, but I messed up, so I posted the other version.
Jump or no, this version is faster than the other version. Additionally, it's smaller, and smaller (overall) means less paging from disk, less cache utilization, less memory bandwidth needed to fetch instructions. That means it's even faster.
I know how to preserve sign on other architectures without jumping. I'll figure it out for Intel. It'll probably some from some kind of bittest (which goes into the carry on Intel) and then a carrying add to zero, thus putting the bit into the low part of a word.
Either way, gcc did a poor job here.
Posted by: Yeah, it has a jump | January 2, 2006 08:51 PM
Hmm, in 6502 code at least, %2 of a signed number would be:
cmp #$80 ;high bit -> carry
adc #$00 ;add carry to low bit, flipping it if set
and #$01 ;retain only the low bit
Should translate to x86 :)
Posted by: White Flame (aka David Holz) | January 11, 2006 06:58 PM
Unfortunately the code for 6502 gives
-1 % 2 = 0
which is incorrect.
Posted by: ilfak | January 12, 2006 12:58 PM
Well, I managed to get it down to 20 bytes (11 instructions). No jumps, no idiv.
Although to be honest, I can't quite see why you'd need the modulus of a negative number anyway...
I quite liked the fact that MSVC's code to peform the function call to __allrem is actually longer than my entire routine :-)
Ludvig, would you consider posting your 16-byte version? I'd like to see it.
_smod:
pop ecx ; 59
pop eax ; 58
pop edx ; 5a
rol edx ; d1 c2
and eax, 1 ; 83 e0 01
and edx, eax ; 21 c2
sub eax, edx ; 29 d0
sub eax, edx ; 29 d0
cdq ; 99
sub esp, 8 ; 83 ec 08
jmp ecx ; ff e1
Posted by: Kayamon | January 19, 2006 05:04 AM
arg, I seem to have gotten too caught up in the post above me, having been shown this page from elsewhere. :) Anyway, in restitution, what about this method?
Roll the upper bit into the lower bit:
SxxxxxxxxxxxxxL (S=sign bit, L=LSB, too few x's)
xxxxxxxxxxxxxLS
Then AND by 3 and do a table lookup, holding the table right after the function return.
(non-x86 guy goes elsewhere now) :)
Posted by: White Flame | January 23, 2006 10:09 AM