__udivmod{d,t}i4()
Functions__ashl{s,d,t}i3()
Functions__ashr{s,d,t}i3()
Functions__lshr{s,d,t}i3()
Functions__rotl{s,d,t}i3()
and __rotr{s,d,t}i3()
Functions__cmp{s,d,t}i2()
Functions__ucmp{s,d,t}i2()
Functions__mul{s,d,t}i3()
Functions__neg{s,d,t}i2()
Functions__absv{s,d,t}i2()
Functions__addv{s,d,t}i3()
Functions__subv{s,d,t}i3()
Functions__mulv{s,d,t}i3()
Functions__negv{s,d,t}i2()
Functions__parity{s,d,t}i2()
Functions__builtin_parity()
Builtin__builtin_mul_overflow()
Builtin__builtin_sub_overflow()
Builtin__builtin_copysign()
Builtin-fsanitize=signed-integer-overflow
Command Line Option-ftrapv
Command Line OptionShell Game, or
Undefined Behaviourin
__bswapsi2()
FunctionUndefined Behaviouror Optimiser Failure?
libgcc
library, especially the rather
poor code generator.
libgcc
library) is rather poor, up to an order of
magnitude worse than properly written (optimised) code.
Note: I recommend to read Randall Hyde’s ACM article, especially the following part:
Observation #6: Software engineers have been led to believe that their time is more valuable than CPU time; therefore, wasting CPU cycles in order to reduce development time is always a win. They've forgotten, however, that the application users' time is more valuable than their time.
example (t):
bool isWhitespace(char c)
{
return c == ' '
|| c == '\r'
|| c == '\n'
|| c == '\t';
}
He then shows the following code generated by
GCC 9.1 for the
AMD64 platform, which uses 7 instructions in 27 bytes:
isWhitespace(char): xor eax, eax ; result = false cmp dil, 32 ; is c > 32 ja .L4 ; if so, exit with false movabs rax, 4294977024 ; rax = 0x100002600 shrx rax, rax, rdi ; rax >>= c and eax, 1 ; result = rax & 1 .L4: retThis code is but not optimal – the conditional branch impairs performance, it should and can be avoided, as demonstrated by the following equivalent code, which uses 7 instructions in 26 bytes:
; Copyleft © 2020-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
isWhitespace:
movabs rax, 100002600h
shrx rax, rax, rdi
cmp dil, 33
sbb edi, edi
neg edi
and eax, edi
ret
Since the SHRX
instruction is only available on newer
CPUs, also
branch-free code that runs on all AMD64 processors,
with 8 instructions in just 25 bytes:
; Copyleft © 2020-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
isWhitespace:
mov ecx, edi
movabs rax, 100002600h
shr rax, cl
cmp cl, 33
sbb ecx, ecx
neg ecx
and eax, ecx
ret
Finally equivalent and branch-free code that works on 35 year old
i386 processors, needing neither an AMD64
processor nor the
SHRX
instruction,
with but 10 instructions in 26 bytes when the preprocessor macro
OPTIMIZE
is not defined, else just 25 bytes:
; Copyleft © 2020-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
isWhitespace:
mov ecx, [esp+4]
#ifndef OPTIMIZE
mov eax, 2600h ; eax = (1 << '\r') | (1 << '\n') | (1 << '\t')
test cl, cl
setnz al ; eax |= (c != '\0')
#else
xor eax, eax
cmp eax, ecx ; CF = (c != '\0')
adc eax, 2600h ; eax = (c != '\0') | (1 << '\r') | (1 << '\n') | (1 << '\t')
#endif
shr eax, cl ; eax >>= (c % ' ')
xor edx, edx
cmp ecx, 33 ; CF = (c <= ' ')
adc edx, edx ; edx = (c <= ' ')
and eax, edx
ret
__udivmod{d,t}i4()
Functionslibgcc
provides the __udivmoddi4()
and
__udivmodti4()
functions for unsigned integer division:
The last 5 statements of the following part of their source perform two multi-word left shifts on the variable pair
- Runtime Function: unsigned long __udivmoddi4 (unsigned long a, unsigned long b, unsigned long *c)
- Runtime Function: unsigned long long __udivmodti4 (unsigned long long a, unsigned long long b, unsigned long long *c)
These functions calculate both the quotient and remainder of the unsigned division of a and b. The return value is the quotient, and the remainder is placed in variable pointed to by c.
d1, d0
holding the divisor and the variable triple n2, n1, n0
holding the dividend:
/* More subroutines needed by GCC output code on some machines. */
/* Compile this one with gcc. */
/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
[…]
UDWtype
__udivmoddi4 (UDWtype n, UDWtype d, UDWtype *rp)
{
[…]
UWtype d0, d1, n0, n1, n2;
UWtype b, bm;
[…]
count_leading_zeros (bm, d1);
if (bm == 0)
[…]
else
{
UWtype m1, m0;
/* Normalize. */
b = W_TYPE_SIZE - bm;
d1 = (d1 << bm) | (d0 >> b);
d0 = d0 << bm;
n2 = n1 >> b;
n1 = (n1 << bm) | (n0 >> b);
n0 = n0 << bm;
[…]
For the AMD64 target platform,
GCC compiles this
sequence of shifts to the following unoptimised and rather
awful code:
[…] 0000000000000000 <__udivmodti4>: […] b0: 4c 0f bd da bsr %rdx,%r11 b4: 49 83 f3 3f xor $0x3f,%r11 b8: 45 85 db test %r11d,%r11d bb: 75 33 jne f0 <__udivmodti4+0xf0> […] f0: 49 63 c3 movslq %r11d,%rax f3: bd 40 00 00 00 mov $0x40,%ebp f8: 44 89 d9 mov %r11d,%ecx fb: 4d 89 ec mov %r13,%r12 fe: 48 29 c5 sub %rax,%rbp 101: 48 d3 e2 shl %cl,%rdx 104: 49 89 f2 mov %rsi,%r10 107: 48 89 f8 mov %rdi,%rax 10a: 89 e9 mov %ebp,%ecx 10c: 44 89 db mov %r11d,%ebx 10f: 49 d3 ec shr %cl,%r12 112: 44 89 d9 mov %r11d,%ecx 115: 49 d3 e5 shl %cl,%r13 118: 89 e9 mov %ebp,%ecx 11a: 49 09 d4 or %rdx,%r12 11d: 49 d3 ea shr %cl,%r10 120: 44 89 d9 mov %r11d,%ecx 123: 48 d3 e6 shl %cl,%rsi 126: 89 e9 mov %ebp,%ecx 128: 4c 89 d2 mov %r10,%rdx 12b: 48 d3 e8 shr %cl,%rax 12e: 44 89 d9 mov %r11d,%ecx 131: 48 09 c6 or %rax,%rsi 134: 48 d3 e7 shl %cl,%rdi 137: 48 89 f0 mov %rsi,%rax 13a: 49 89 f9 mov %rdi,%r9 […]Oops: the
TEST
instruction at address b8
is superfluous.
Ouch: the optimiser fails to recognise the common and well-known idiom of complementary shifts, for which it should but generate the following properly optimised and significantly shorter code:
[…] mov %r11d, %ecx xor n2, n2 shld %cl, n1, n2 shld %cl, n0, n1 shl %cl, n0 shld %cl, d0, d1 shl %cl, d0 […]Note: the optimiser but recognises the following similar, also common and well-known idiom used for rotation:
unsigned long long __rotldi3(unsigned long long value, int count) {
return (value << (63 & count))
| (value >> (63 & -count));
}
The code generated by
GCC for the
__udivmodti4()
function shows 141 instructions in 439
bytes, while properly written and faster code uses only 85
instructions in just 232 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# * The software is provided "as is" without any warranty, neither express
# nor implied.
# * In no event will the author be held liable for any damage(s) arising
# from the use of the software.
# * Redistribution of the software is allowed only in unmodified form.
# * Permission is granted to use the software solely for personal private
# and non-commercial purposes.
# * An individuals use of the software in his or her capacity or function
# as an agent, (independent) contractor, employee, member or officer of
# a business, corporation or organization (commercial or non-commercial)
# does not qualify as personal private and non-commercial purpose.
# * Without written approval from the author the software must not be used
# for a business, for commercial, corporate, governmental, military or
# organizational purposes of any kind, or in a commercial, corporate,
# governmental, military or organizational environment of any kind.
# Unix System V calling convention for AMD64 platform:
# - first 6 integer or pointer arguments (from left to right) are passed
# in registers RDI/R7, RSI/R6, RDX/R2, RCX/R1, R8 and R9
# (R10 is used as static chain pointer in case of nested functions);
# - surplus arguments are pushed on stack in reverse order (from right to
# left), 8-byte aligned;
# - 128-bit integer arguments are passed as pair of 64-bit integer arguments,
# low part before/below high part;
# - 128-bit integer result is returned in registers RAX/R0 (low part) and
# RDX/R2 (high part);
# - 64-bit integer or pointer result is returned in register RAX/R0;
# - 32-bit integer result is returned in register EAX;
# - registers RBX/R3, RSP/R4, RBP/R5, R12, R13, R14 and R15 must be
# preserved;
# - registers RAX/R0, RCX/R1, RDX/R2, RSI/R6, RDI/R7, R8, R9, R10 (in
# case of normal functions) and R11 are volatile and can be clobbered;
# - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
# before calling other functions (CALL instruction pushes 8 bytes);
# - a "red zone" of 128 bytes below the stack pointer can be clobbered.
# NOTE: raises "division exception" when divisor is 0!
.file "udivmodti4.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = dividend
# rcx:rdx = divisor
__umodti3:
sub rsp, 24
mov r8, rsp # r8 = oword ptr remainder
call __udivmodti4
pop rax
pop rdx # rdx:rax = remainder
pop rcx
ret
# rsi:rdi = dividend
# rcx:rdx = divisor
__udivti3:
xor r8, r8
# rsi:rdi = dividend
# rcx:rdx = divisor
# r8 = oword ptr remainder
__udivmodti4:
cmp rdi, rdx
mov rax, rsi
sbb rax, rcx
jb .trivial # dividend < divisor?
mov r10, rdx
mov r11, rcx # r11:r10 = divisor
bsr rcx, rcx # rcx = index of most significant '1' bit
# in high qword of divisor
jnz .extended # high qword of divisor <> 0?
# remainder < divisor < 2**64
cmp rsi, rdx
jae .long # high qword of dividend >= (low qword of) divisor?
# dividend < divisor * 2**64: quotient < 2**64
# perform normal division
.normal:
mov rdx, rsi
mov rax, rdi # rdx:rax = dividend
div r10 # rax = (low qword of) quotient,
# rdx = (low qword of) remainder
test r8, r8
jz 0f # address of remainder = 0?
mov [r8], rdx
mov [r8+8], r11 # high qword of remainder = 0
0:
mov rdx, r11 # rdx:rax = quotient
ret
# dividend >= divisor * 2**64: quotient >= 2**64
# perform "long" alias "schoolbook" division
.long:
mov rdx, r11 # rdx = 0
mov rax, rsi # rdx:rax = high qword of dividend
div r10 # rax = high qword of quotient,
# rdx = high qword of remainder'
mov rcx, rax # rcx = high qword of quotient
mov rax, rdi # rax = low qword of dividend
div r10 # rax = low qword of quotient,
# rdx = (low qword of) remainder
test r8, r8
jz 1f # address of remainder = 0?
mov [r8], rdx
mov [r8+8], r11 # high qword of remainder = 0
1:
mov rdx, rcx # rdx:rax = quotient
ret
# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
test r8, r8
jz 2f # address of remainder = 0?
mov [r8], rdi
mov [r8+8], rsi # remainder = dividend
2:
xor eax, eax
xor edx, edx # rdx:rax = quotient = 0
ret
# dividend >= divisor >= 2**64: quotient < 2**64
.extended:
xor ecx, 63 # ecx = number of leading '0' bits
# in (high qword of) divisor
jz .special # divisor >= 2**127?
# perform "extended & adjusted" division
shld r11, r10, cl
shl r10, cl # r11:r10 = divisor * 2**(index + 1)
# = divisor'
xor edx, edx
shld rdx, rsi, cl
shld rsi, rdi, cl
shl rdi, cl # rdx:rsi:rdi = dividend * 2**(index + 1)
# = dividend'
mov rax, rsi
div r11 # rax = quotient',
# rdx = high qword of remainder'
mov rsi, rdx # rsi:rdi = remainder'
mov r9, rax # r9 = quotient'
mul r10 # rdx:rax = quotient' * low qword of divisor'
# = remainder"
.ifnotdef JCCLESS
cmp rdx, rsi
ja 4f
jb 3f
cmp rax, rdi
ja 4f # remainder" > remainder'?
.else
push rbx
mov rbx, rdx
cmp rax, rdi
sbb rbx, rsi
pop rbx
ja 4f # quotient' > quotient?
.endif
3:
test r8, r8
jnz 5f # address of remainder <> 0?
xor edx, edx
mov rax, r9 # rdx:rax = quotient
ret
4:
dec r9 # r9 = quotient' - 1
# = (low qword of) quotient
test r8, r8
jz 6f # address of remainder = 0?
sub rax, r10
sbb rdx, r11 # rdx:rax = remainder" - divisor'
5:
sub rdi, rax
sbb rsi, rdx # rsi:rdi = remainder * 2**(index + 1)
shrd rdi, rsi, cl
shr rsi, cl # rsi:rdi = remainder
mov [r8], rdi
mov [r8+8], rsi
6:
xor edx, edx
mov rax, r9 # rdx:rax = quotient
ret
# dividend >= divisor >= 2**127:
# quotient = 1, remainder = dividend - divisor
.special:
test r8, r8
jz 7f # address of remainder = 0?
sub rdi, r10
sbb rsi, r11 # rsi:rdi = dividend - divisor
# = remainder
mov [r8], rdi
mov [r8+8], rsi
7:
xor eax, eax
xor edx, edx
inc eax # rdx:rax = quotient = 1
ret
.size __udivmodti4, .-__udivmodti4
.type __udivmodti4, @function
.global __udivmodti4
.size __udivti3, .-__udivti3
.type __udivti3, @function
.global __udivti3
.size __umodti3, .-__umodti3
.type __umodti3, @function
.global __umodti3
.end
Note: exploration of the same unoptimised and awful
code in the __divmodti4()
, __divti3()
,
__modti3()
, __udivti3()
and
__umodti3()
as well as the __divmoddi4()
,
__divdi3()
, __moddi3()
,
__udivmoddi4()
, __udivdi3()
and
__umoddi3()
functions is left as an exercise to the
reader.
[…] 00000000 <__udivmoddi4>: […] 90: 0f bd ef bsr %edi,%ebp 93: 83 f5 1f xor $0x1f,%ebp 96: 75 58 jne f0 <__udivmoddi4+0xf0> […] f0: b8 20 00 00 00 mov $0x20,%eax f5: 89 e9 mov %ebp,%ecx f7: 89 c2 mov %eax,%edx f9: d3 e7 shl %cl,%edi fb: 89 f0 mov %esi,%eax fd: 29 ea sub %ebp,%edx ff: 89 d1 mov %edx,%ecx 101: 89 54 24 1c mov %edx,0x1c(%esp) 105: d3 e8 shr %cl,%eax 107: 89 e9 mov %ebp,%ecx 109: 09 f8 or %edi,%eax 10b: 89 f7 mov %esi,%edi 10d: d3 e7 shl %cl,%edi 10f: 89 44 24 0c mov %eax,0xc(%esp) 113: 89 d1 mov %edx,%ecx 115: 89 d8 mov %ebx,%eax 117: d3 e8 shr %cl,%eax 119: 89 e9 mov %ebp,%ecx 11b: 89 7c 24 14 mov %edi,0x14(%esp) 11f: 89 c6 mov %eax,%esi 121: 8b 44 24 10 mov 0x10(%esp),%eax 125: d3 e3 shl %cl,%ebx 127: 89 d1 mov %edx,%ecx 129: 89 f2 mov %esi,%edx 12b: d3 e8 shr %cl,%eax 12d: 89 e9 mov %ebp,%ecx 12f: 89 c7 mov %eax,%edi 131: 8b 44 24 10 mov 0x10(%esp),%eax 135: 09 df or %ebx,%edi 137: d3 e0 shl %cl,%eax 139: 89 44 24 18 mov %eax,0x18(%esp) 13d: 89 f8 mov %edi,%eax […]
Properly written code uses but only 95 instructions in just 236 bytes instead of 168 instructions in 487 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# Common "cdecl" calling and naming convention for i386 platform:
# - arguments are pushed on stack in reverse order (from right to left),
# 4-byte aligned;
# - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
# low part below high part;
# - 64-bit integer result is returned in registers EAX (low part) and
# EDX (high part);
# - 32-bit integer or pointer result is returned in register EAX;
# - registers EAX, ECX and EDX are volatile and can be clobbered;
# - registers EBX, ESP, EBP, ESI and EDI must be preserved;
# - function names are prefixed with an underscore.
# NOTE: raises "division exception" when divisor is 0!
.file "udivmoddi4.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+20] = (optional) qword ptr remainder
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__udivmoddi4:
mov ecx, [esp+8] # ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] # edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+8] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
mov eax, [esp+4] # edx:eax = dividend
div ecx # eax = (low dword of) quotient,
# edx = (low dword of) remainder
mov ecx, [esp+20] # ecx = address of remainder
test ecx, ecx
jz 0f # address of remainder = 0?
mov [ecx], edx # [ecx] = (low dword of) remainder
xor edx, edx
mov [ecx+4], edx
0:
xor edx, edx # edx:eax = quotient
ret
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
push eax # [esp] = high dword of quotient
mov eax, [esp+8] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
mov ecx, [esp+24] # ecx = address of remainder
test ecx, ecx
jz 1f # address of remainder = 0?
mov [ecx], edx # [ecx] = (low dword of) remainder
xor edx, edx
mov [ecx+4], edx
1:
pop edx # edx:eax = quotient
ret
# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
mov ecx, [esp+20] # ecx = address of remainder
test ecx, ecx
jz 2f # address of remainder = 0?
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = dividend
mov [ecx], eax
mov [ecx+4], edx # [ecx] = remainder = dividend
2:
xor eax, eax
xor edx, edx # edx:eax = quotient = 0
ret
# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor >= 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx
jb 3f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
3:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx # CF = (high dword of dividend < divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
.if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+12] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
push ebx # [esp] = quotient"
mov eax, [esp+20] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
imul ebx, [esp+24] # ebx = high dword of divisor * quotient"
mov ecx, [esp+16] # ecx = high dword of dividend
sub ecx, ebx # ecx = high dword of dividend
# - high dword of divisor * quotient"
mov ebx, [esp+12] # ebx = low dword of dividend
sub ebx, eax
sbb ecx, edx # ecx:ebx = dividend - divisor * quotient"
# = remainder"
.ifnotdef JCCLESS
pop eax # eax = quotient"
jnb 4f # remainder" >= 0?
# (with borrow, it is off by divisor,
# and quotient" is off by 1)
add ebx, [esp+16]
adc ecx, [esp+20] # ecx:ebx = remainder" + divisor
# = remainder
dec eax # eax = quotient" - 1
# = (low dword of) quotient
4:
.else # JCCLESS
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
cdq # edx = (remainder" < 0) ? -1 : 0
add [esp], eax # [esp] = quotient" - (remainder" < 0)
# = (low dword of) quotient
and eax, [esp+20]
and edx, [esp+24] # edx:eax = (remainder" < 0) ? divisor : 0
add ebx, eax
adc ecx, edx # ecx:ebx = remainder" + divisor
# = remainder
pop eax # eax = (low dword of) quotient
.endif # JCCLESS
mov edx, [esp+24] # edx = address of remainder
test edx, edx
jz 5f # address of remainder = 0?
mov [edx], ebx
mov [edx+4], ecx # [edx] = remainder
5:
xor edx, edx # edx:eax = quotient
pop ebx
ret
# dividend >= divisor >= 2**63:
# quotient = 1, remainder = dividend - divisor
.special:
or ecx, [esp+20] # ecx = address of remainder
jz 6f # address of remainder = 0?
.if 0
neg edx
neg eax
sbb edx, 0 # edx:eax = -divisor
add eax, [esp+4]
adc edx, [esp+8] # edx:eax = dividend - divisor
# = remainder
.else
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = dividend
sub eax, [esp+12]
sbb edx, [esp+16] # edx:eax = dividend - divisor
# = remainder
.endif
mov [ecx], eax
mov [ecx+4], edx # [ecx] = remainder
6:
xor eax, eax
xor edx, edx
inc eax # edx:eax = quotient = 1
ret
.size __udivmoddi4, .-__udivmoddi4
.type __udivmoddi4, @function
.global __udivmoddi4
.end
Properly written code uses but only 62 instructions in just 140 bytes instead of 112 instructions in 266 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: raises "division exception" when divisor is 0!
.file "udivdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__udivdi3:
mov ecx, [esp+8] # ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] # edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+8] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
mov eax, [esp+4] # edx:eax = dividend
div ecx # eax = (low dword of) quotient,
# edx = (low dword of) remainder
xor edx, edx # edx:eax = quotient
ret
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
push eax # [esp] = high dword of quotient
mov eax, [esp+8] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
pop edx # edx:eax = quotient
ret
# dividend < divisor: quotient = 0
.trivial:
xor eax, eax
xor edx, edx # edx:eax = quotient = 0
ret
# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor >= 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx
jb 0f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
0:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx # CF = (high dword of dividend < divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
.if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+12] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
mov eax, [esp+16] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
.ifnotdef JCCLESS
mov ecx, [esp+20] # ecx = high dword of divisor
imul ecx, ebx # ecx = high dword of divisor * quotient"
add edx, ecx # edx:eax = divisor * quotient"
jc 1f # divisor * quotient" >= 2**64?
mov ecx, [esp+12] # ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx # CF = (dividend < divisor * quotient")
# = (remainder" < 0)
1:
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
add eax, ebx # eax = quotient" - (remainder" < 0)
# = (low dword of) quotient
xor edx, edx # edx:eax = quotient
.else # JCCLESS
mov ecx, [esp+12] # ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx # ecx:... = dividend
# - low dword of divisor * quotient"
mov eax, [esp+20] # eax = high dword of divisor
imul eax, ebx # eax = high dword of divisor * quotient"
.if 0
sub ecx, eax # ecx:... = dividend - divisor * quotient"
# = remainder"
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
add eax, ebx # eax = quotient" - (remainder" < 0)
# = (low dword of) quotient
xor edx, edx # edx:eax = quotient
.else
xor edx, edx # edx = high dword of quotient = 0
sub ecx, eax # ecx:... = dividend - divisor * quotient"
# = remainder"
mov eax, ebx # eax = quotient"
sbb eax, edx # eax = quotient" - (remainder" < 0)
# = (low dword of) quotient
.endif
.endif # JCCLESS
pop ebx
ret
# dividend >= divisor >= 2**63: quotient = 1
.special:
xor eax, eax
xor edx, edx
inc eax # edx:eax = quotient = 1
ret
.size __udivdi3, .-__udivdi3
.type __udivdi3, @function
.global __udivdi3
.end
Properly written code uses but only 64 instructions in just 161 bytes instead of 138 instructions in 333 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: raises "division exception" when divisor is 0!
.file "umoddi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__umoddi3:
mov ecx, [esp+8] # ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] # edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+8] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
mov eax, [esp+4] # edx:eax = dividend
div ecx # eax = (low dword of) quotient,
# edx = (low dword of) remainder
mov eax, edx # eax = (low dword of) remainder
xor edx, edx # edx:eax = remainder
ret
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
mov eax, [esp+4] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
mov eax, edx # eax = (low dword of) remainder
xor edx, edx # edx:eax = remainder
ret
# dividend < divisor: remainder = dividend
.trivial:
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = remainder = dividend
ret
# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor >= 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx
jb 0f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
0:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+12] # edx = high dword of dividend
sub edx, ebx # CF = (high dword of dividend < divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
.if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+12] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
mov eax, [esp+16] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
imul ebx, [esp+20] # ebx = high dword of divisor * quotient"
mov ecx, [esp+12] # ecx = high dword of dividend
sub ecx, ebx # ecx = high dword of dividend
# - high dword of divisor * quotient"
mov ebx, [esp+8] # ebx = low dword of dividend
sub ebx, eax
sbb ecx, edx # ecx:ebx = dividend - divisor * quotient"
# = remainder"
.ifnotdef JCCLESS
jnb 1f # remainder" >= 0?
# (with borrow, it is off by divisor,
# and quotient" is off by 1)
add ebx, [esp+16]
adc ecx, [esp+20] # ecx:ebx = remainder" + divisor
# = remainder
1:
mov eax, ebx
mov edx, ecx # edx:eax = remainder
.else # JCCLESS
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
cdq # edx = (remainder" < 0) ? -1 : 0
and eax, [esp+16]
and edx, [esp+20] # edx:eax = (remainder" < 0) ? divisor : 0
add eax, ebx
adc edx, ecx # edx:eax = remainder" + divisor
# = remainder
.endif # JCCLESS
pop ebx
ret
# dividend >= divisor >= 2**63: remainder = dividend - divisor
.special:
.if 0
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = dividend
sub eax, [esp+12]
sbb edx, [esp+16] # edx:eax = dividend - divisor
# = remainder
.else
neg edx
neg eax
sbb edx, ecx # edx:eax = -divisor
add eax, [esp+4]
adc edx, [esp+8] # edx:eax = dividend - divisor
# = remainder
.endif
ret
.size __umoddi3, .-__umoddi3
.type __umoddi3, @function
.global __umoddi3
.end
Properly written code uses but only 86 instructions in just 199 bytes instead of 132 instructions in 330 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: raises "division exception" when divisor is 0!
.file "divdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__divdi3:
# determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] # eax:ecx = dividend
cdq # edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx # eax:ecx = (dividend < 0) ? -dividend : dividend
# = |dividend|
mov [esp+4], ecx # write |dividend| back on stack
mov [esp+8], eax
push edx # [esp] = (dividend < 0) ? -1 : 0
# determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] # edx:eax = divisor
mov ecx, edx
sar ecx, 31 # ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx # edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx # edx:eax = (divisor < 0) ? -divisor : divisor
# = |divisor|
mov [esp+16], eax # write |divisor| back on stack
mov [esp+20], edx
xor [esp], ecx # [esp] = (dividend < 0) ^ (divisor < 0) ? -1 : 0
# = (quotient < 0) ? -1 : 0
mov ecx, [esp+12] # ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+12] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
mov eax, [esp+8] # edx:eax = dividend
div ecx # eax = (low dword of) quotient,
# edx = (low dword of) remainder
# xor edx, edx # edx:eax = |quotient|
jmp .quotient
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
push eax # [esp] = high dword of quotient
mov eax, [esp+12] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
pop edx # edx:eax = |quotient|
pop ecx # ecx = (quotient < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx # edx:eax = quotient
ret
# dividend < divisor: quotient = 0
.trivial:
pop ecx # ecx = (quotient < 0) ? -1 : 0
xor eax, eax
xor edx, edx # edx:eax = quotient = 0
ret
# 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor = 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+16] # edx = high dword of dividend
cmp edx, ebx
jb 0f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
0:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+16] # edx = high dword of dividend
cmp edx, ebx # CF = (high dword of dividend < divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
.if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+16] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
mov eax, [esp+20] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
mov ecx, [esp+24] # ecx = high dword of divisor
imul ecx, ebx # ecx = high dword of divisor * quotient"
add edx, ecx # edx:eax = divisor * quotient"
# jc 1f # divisor * quotient" >= 2**64?
mov ecx, [esp+16] # ecx = high dword of dividend
cmp [esp+12], eax
sbb ecx, edx # CF = (dividend < divisor * quotient")
# = (remainder" < 0)
1:
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
add eax, ebx # eax = quotient" - (remainder" < 0)
# = (low dword of) |quotient|
# xor edx, edx # edx:eax = |quotient|
pop ebx
.quotient:
pop edx # edx = (quotient < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx # edx:eax = quotient
ret
# dividend = divisor = -2**63: quotient = 1
.special:
pop eax # eax = sign of quotient = 0
inc eax # eax = (low dword of) quotient = 1
xor edx, edx # edx:eax = quotient = 1
ret
.size __divdi3, .-__divdi3
.type __divdi3, @function
.global __divdi3
.end
Properly written code uses but only 84 instructions in just 204 bytes instead of 148 instructions in 401 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: raises "division exception" when divisor is 0!
.file "moddi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__moddi3:
# determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] # eax:ecx = dividend
cdq # edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # ecx:eax = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx # ecx:eax = (dividend < 0) ? -dividend : dividend
# = |dividend|
mov [esp+4], ecx # write |dividend| back on stack
mov [esp+8], eax
push edx # [esp] = (dividend < 0) ? -1 : 0
# determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] # edx:eax = divisor
mov ecx, edx
sar ecx, 31 # ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx # edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx # edx:eax = (divisor < 0) ? -divisor : divisor
# = |divisor|
mov [esp+16], eax # write |divisor| back on stack
mov [esp+20], edx
mov ecx, [esp+12] # ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+12] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
jmp .next
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
.next:
mov eax, [esp+8] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
mov eax, edx # eax = (low dword of) |remainder|
# xor edx, edx # edx:eax = |remainder|
pop edx # edx = (remainder < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx # edx:eax = remainder
ret
# dividend < divisor: remainder = dividend
.trivial:
mov eax, [esp+8]
mov edx, [esp+12] # edx:eax = |remainder| = |dividend|
jmp .remainder
# 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor = 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+16] # edx = high dword of dividend
cmp edx, ebx
jb 0f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
0:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+16] # edx = high dword of dividend
sub edx, ebx # edx = high dword of dividend - divisor'
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
.if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+16] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
mov eax, [esp+20] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
imul ebx, [esp+24] # ebx = high dword of divisor * quotient"
add edx, ebx # edx:eax = divisor * quotient"
mov ecx, [esp+12]
mov ebx, [esp+16] # ebx:ecx = dividend
sub ecx, eax
sbb ebx, edx # ebx:ecx = dividend - divisor * quotient"
# = remainder"
.ifnotdef JCCLESS
jnb 1f # remainder" >= 0?
# (with borrow, it is off by divisor
# and quotient" is off by 1)
add ecx, [esp+20]
adc ebx, [esp+24] # ebx:ecx = remainder" + divisor
# = remainder
1:
mov eax, ecx
mov edx, ebx # edx:eax = |remainder|
.else # JCCLESS
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
cdq # edx = (remainder" < 0) ? -1 : 0
and eax, [esp+20]
and edx, [esp+24] # edx:eax = (remainder" < 0) ? divisor : 0
add eax, ecx
adc edx, ebx # edx:eax = remainder" + divisor
# = |remainder|
.endif # JCCLESS
pop ebx
.remainder:
pop ecx # ecx = (remainder < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx # edx:eax = remainder
ret
# dividend = divisor = -2**63: remainder = 0
.special:
pop eax # eax = sign of remainder = -1
inc eax
xor edx, edx # edx:eax = remainder = 0
ret
.size __moddi3, .-__moddi3
.type __moddi3, @function
.global __moddi3
.end
__ashl{s,d,t}i3()
Functionslibgcc
provides the __ashlsi3()
,
__ashldi3()
and __ashlti3()
functions for
integer shift operations:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __ashlsi3 (int a, int b)
- Runtime Function: long __ashldi3 (long a, int b)
- Runtime Function: long long __ashlti3 (long long a, int b)
These functions return the result of shifting a left by b bits.
__ashlti3()
to the following unoptimised and
awful code:
[…] 0000000000000000 <__ashlti3>: 0: 48 85 d2 test %rdx,%rdx 3: 74 3b je 40 <__ashlti3+0x40> 5: 41 b8 40 00 00 00 mov $0x40,%r8d b: 49 29 d0 sub %rdx,%r8 e: 4d 85 c0 test %r8,%r8 11: 7e 1d jle 30 <__ashlti3+0x30> 13: 89 d1 mov %edx,%ecx 15: 48 89 f8 mov %rdi,%rax 18: 48 d3 e6 shl %cl,%rsi 1b: 48 d3 e0 shl %cl,%rax 1e: 44 89 c1 mov %r8d,%ecx 21: 48 89 f2 mov %rsi,%rdx 24: 48 d3 ef shr %cl,%rdi 27: 48 09 fa or %rdi,%rdx 2a: c3 retq 2b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 30: 44 89 c1 mov %r8d,%ecx 33: 31 c0 xor %eax,%eax 35: f7 d9 neg %ecx 37: 48 d3 e7 shl %cl,%rdi 3a: 48 89 fa mov %rdi,%rdx 3d: c3 retq 3e: 66 90 xchg %ax,%ax 40: 48 89 f8 mov %rdi,%rax 43: 48 89 f2 mov %rsi,%rdx 46: c3 retq […]Oops: the
TEST
instruction at address e
is superfluous.
Properly written branch-free code uses but only 11 instructions in just 32 bytes instead of 26 instructions in 71 bytes and handles shift counts greater than 127:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "ashlti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = value
# rdx = count
__ashlti3:
mov ecx, edx
xor edx, edx
xor eax, eax
shld rsi, rdi, cl
shl rdi, cl # rsi:rdi = value << count % 64
cmp ecx, 127
cmovbe rdx, rdi
cmp ecx, 63
cmovbe rdx, rsi
cmovbe rax, rdi # rdx:rax = value << count
ret
.size __ashlti3, .-__ashlti3
.type __ashlti3, @function
.global __ashlti3
.end
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "ashlti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = value
# rdx = count
__ashlti3:
cmp edx, 127
ja .zero # count > 127?
mov ecx, edx
cmp edx, 63
ja .high # count > 63?
.low:
shld rsi, rdi, cl
shl rdi, cl # rsi:rdi = value << count % 64
mov rdx, rsi
mov rax, rdi # rdx:rax = value << count
ret
.high:
shl rdi, cl
mov rdx, rdi
xor eax, eax # rdx:rax = value << count
ret
.zero:
xor edx, edx
xor eax, eax # rdx:rax = 0
# = value << count
ret
.size __ashlti3, .-__ashlti3
.type __ashlti3, @function
.global __ashlti3
.end
Note: exploration of the equally bad code generated
for the __ashldi3()
function on the i386
platform is left as an exercise to the reader.
[…] 00000000 <__ashldi3>: 0: 56 push %esi 1: 53 push %ebx 2: 8b 5c 24 0c mov 0xc(%esp),%ebx 6: 8b 4c 24 14 mov 0x14(%esp),%ecx a: 8b 54 24 10 mov 0x10(%esp),%edx e: 89 d8 mov %ebx,%eax 10: 85 c9 test %ecx,%ecx 12: 74 15 je 29 <__ashldi3+0x29> 14: be 20 00 00 00 mov $0x20,%esi 19: 29 ce sub %ecx,%esi 1b: 85 f6 test %esi,%esi 1d: 7e 11 jle 30 <__ashldi3+0x30> 1f: d3 e2 shl %cl,%edx 21: d3 e0 shl %cl,%eax 23: 89 f1 mov %esi,%ecx 25: d3 eb shr %cl,%ebx 27: 09 da or %ebx,%edx 29: 5b pop %ebx 2a: 5e pop %esi 2b: c3 ret 2c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 30: 83 e9 20 sub $0x20,%ecx 33: 31 c0 xor %eax,%eax 35: d3 e3 shl %cl,%ebx 37: 89 da mov %ebx,%edx 39: 5b pop %ebx 3a: 5e pop %esi 3b: c3 ret […]
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: applies shift count modulo 64!
.file "ashldi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+12] = count
# [esp+8] = high dword of value
# [esp+4] = low dword of value
__ashldi3:
mov ecx, [esp+12]
mov eax, [esp+4]
test cl, 32
jnz 0f # count > 31?
mov edx, [esp+8]
shld edx, eax, cl
shl eax, cl # edx:eax = value << count
ret
0:
mov edx, eax
shl edx, cl
xor eax, eax # edx:eax = value << count
ret
.size __ashldi3, .-__ashldi3
.type __ashldi3, @function
.global __ashldi3
.end
__ashr{s,d,t}i3()
Functionslibgcc
provides the __ashrsi3()
,
__ashrdi3()
and __ashrti3()
functions for
integer shift operations:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __ashrsi3 (int a, int b)
- Runtime Function: long __ashrdi3 (long a, int b)
- Runtime Function: long long __ashrti3 (long long a, int b)
These functions return the result of arithmetically shifting a right by b bits.
__ashrti3()
to the following unoptimised and
awful code:
[…] 0000000000000000 <__ashrti3>: 0: 48 89 d1 mov %rdx,%rcx 3: 48 85 d2 test %rdx,%rdx 6: 74 38 je 40 <__ashrti3+0x40> 8: 41 b8 40 00 00 00 mov $0x40,%r8d e: 49 29 d0 sub %rdx,%r8 11: 48 89 f2 mov %rsi,%rdx 14: 4d 85 c0 test %r8,%r8 17: 7e 17 jle 30 <__ashrti3+0x30> 19: 48 d3 ef shr %cl,%rdi 1c: 48 d3 fa sar %cl,%rdx 1f: 48 89 f0 mov %rsi,%rax 22: 44 89 c1 mov %r8d,%ecx 25: 48 d3 e0 shl %cl,%rax 28: 48 09 f8 or %rdi,%rax 2b: c3 retq 2c: 0f 1f 40 00 nopl 0x0(%rax) 30: 44 89 c1 mov %r8d,%ecx 33: 48 89 f0 mov %rsi,%rax 36: 48 c1 fa 3f sar $0x3f,%rdx 3a: f7 d9 neg %ecx 3c: 48 d3 f8 sar %cl,%rax 3f: c3 retq 40: 48 89 f8 mov %rdi,%rax 43: 48 89 f2 mov %rsi,%rdx 46: c3 retq […]Oops: the
TEST
instruction at address 14
is superfluous.
Properly written branch-free code uses but only 12 instructions in just 36 bytes instead of 25 instructions in 71 bytes and handles shift counts greater than 127:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "ashrti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = value
# rdx = count
__ashrti3:
mov ecx, edx
mov rax, rsi
cqo
mov rax, rdx
shrd rdi, rsi, cl
sar rsi, cl # rsi:rdi = value >> count % 64
cmp ecx, 127
cmovbe rax, rsi
cmp ecx, 63
cmovbe rdx, rsi
cmovbe rax, rdi # rdx:rax = value >> count
ret
.size __ashrti3, .-__ashrti3
.type __ashrti3, @function
.global __ashrti3
.end
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "ashrti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = value
# rdx = count
__ashrti3:
cmp edx, 127
ja .sign # count > 127?
mov ecx, edx
cmp edx, 63
ja .high # count > 63?
.low:
shrd rdi, rsi, cl
sar rsi, cl # rsi:rdi = value >> count % 64
mov rdx, rsi
mov rax, rdi # rdx:rax = value >> count
ret
.high:
sar rsi, cl
mov rax, rsi
cqo # rdx:rax = value >> count
ret
.sign:
mov rax, rsi
cqo
mov rax, rdx # rdx:rax = (value < 0) ? -1 : 0
# = value >> count
ret
.size __ashrti3, .-__ashrti3
.type __ashrti3, @function
.global __ashrti3
.end
Note: exploration of the equally bad code generated
for the __ashrdi3()
function on the i386
platform is left as an exercise to the reader.
[…] 00000000 <__ashrdi3>: 0: 56 push %esi 1: 53 push %ebx 2: 8b 5c 24 10 mov 0x10(%esp),%ebx 6: 8b 4c 24 14 mov 0x14(%esp),%ecx a: 8b 44 24 0c mov 0xc(%esp),%eax e: 89 da mov %ebx,%edx 10: 85 c9 test %ecx,%ecx 12: 74 15 je 29 <__ashrdi3+0x29> 14: be 20 00 00 00 mov $0x20,%esi 19: 29 ce sub %ecx,%esi 1b: 85 f6 test %esi,%esi 1d: 7e 11 jle 30 <__ashrdi3+0x30> 1f: d3 e8 shr %cl,%eax 21: d3 fa sar %cl,%edx 23: 89 f1 mov %esi,%ecx 25: d3 e3 shl %cl,%ebx 27: 09 d8 or %ebx,%eax 29: 5b pop %ebx 2a: 5e pop %esi 2b: c3 ret 2c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 30: 89 d8 mov %ebx,%eax 32: 83 e9 20 sub $0x20,%ecx 35: c1 fa 1f sar $0x1f,%edx 38: 5b pop %ebx 39: d3 f8 sar %cl,%eax 3b: 5e pop %esi 3c: c3 ret […]
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: applies shift count modulo 64!
.file "ashrdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+12] = count
# [esp+8] = high dword of value
# [esp+4] = low dword of value
__ashrdi3:
mov ecx, [esp+12]
mov edx, [esp+8]
test cl, 32
jnz 0f # count > 31?
mov eax, [esp+4]
shrd eax, edx, cl
sar edx, cl # edx:eax = value >> count
ret
0:
mov eax, edx
sar eax, cl
sar edx, 31 # edx:eax = value >> count
ret
.size __ashrdi3, .-__ashrdi3
.type __ashrdi3, @function
.global __ashrdi3
.end
__lshr{s,d,t}i3()
Functionslibgcc
provides the __lshrsi3()
,
__lshrdi3()
and __lshrti3()
functions for
integer shift operations:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __lshrsi3 (int a, int b)
- Runtime Function: long __lshrdi3 (long a, int b)
- Runtime Function: long long __lshrti3 (long long a, int b)
These functions return the result of logically shifting a right by b bits.
__lshrti3()
to the following unoptimised and
awful code:
[…] 0000000000000000 <__lshrti3>: 0: 48 89 d1 mov %rdx,%rcx 3: 48 85 d2 test %rdx,%rdx 6: 74 38 je 40 <__lshrti3+0x40> 8: 41 b8 40 00 00 00 mov $0x40,%r8d e: 49 29 d0 sub %rdx,%r8 11: 4d 85 c0 test %r8,%r8 14: 7e 1a jle 30 <__lshrti3+0x30> 16: 48 89 f2 mov %rsi,%rdx 19: 48 d3 ef shr %cl,%rdi 1c: 48 89 f0 mov %rsi,%rax 1f: 48 d3 ea shr %cl,%rdx 22: 44 89 c1 mov %r8d,%ecx 25: 48 d3 e0 shl %cl,%rax 28: 48 09 f8 or %rdi,%rax 2b: c3 retq 2c: 0f 1f 40 00 nopl 0x0(%rax) 30: 44 89 c1 mov %r8d,%ecx 33: 48 89 f0 mov %rsi,%rax 36: 31 d2 xor %edx,%edx 38: f7 d9 neg %ecx 3a: 48 d3 e8 shr %cl,%rax 3d: c3 retq 3e: 66 90 xchg %ax,%ax 40: 48 89 f8 mov %rdi,%rax 43: 48 89 f2 mov %rsi,%rdx 46: c3 retq […]Oops: the
TEST
instruction at address 11
is superfluous.
Properly written branch-free code uses but only 11 instructions in just 32 bytes instead of 25 instructions in 71 bytes and handles shift counts greater than 127:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "lshrti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = value
# rdx = count
__lshrti3:
mov ecx, edx
xor edx, edx
xor eax, eax
shrd rdi, rsi, cl
shr rsi, cl # rsi:rdi = value >> count % 64
cmp ecx, 127
cmovbe rax, rsi
cmp ecx, 63
cmovbe rdx, rsi
cmovbe rax, rdi # rdx:rax = value >> count
ret
.size __lshrti3, .-__lshrti3
.type __lshrti3, @function
.global __lshrti3
.end
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "lshrti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = value
# rdx = count
__lshrti3:
cmp edx, 127
ja .zero # count > 127?
mov ecx, edx
cmp edx, 63
ja .high # count > 63?
.low:
shrd rdi, rsi, cl
shr rsi, cl # rsi:rdi = value >> count % 64
mov rdx, rsi
mov rax, rdi # rdx:rax = value >> count
ret
.high:
shr rsi, cl
xor edx, edx
mov rax, rsi # rdx:rax = value >> count
ret
.zero:
xor edx, edx
xor eax, eax # rdx:rax = 0
# = value >> count
ret
.size __lshrti3, .-__lshrti3
.type __lshrti3, @function
.global __lshrti3
.end
Note: exploration of the equally bad code generated
for the __lshrdi3()
function on the i386
platform is left as an exercise to the reader.
Note: the __lshldi3()
and
__lshlti3()
functions are equal to the
__ashldi3()
and
__ashlti3()
functions presented above.
[…] 00000000 <__lshrdi3>: 0: 56 push %esi 1: 53 push %ebx 2: 8b 5c 24 10 mov 0x10(%esp),%ebx 6: 8b 4c 24 14 mov 0x14(%esp),%ecx a: 8b 44 24 0c mov 0xc(%esp),%eax e: 89 da mov %ebx,%edx 10: 85 c9 test %ecx,%ecx 12: 74 15 je 29 <__lshrdi3+0x29> 14: be 20 00 00 00 mov $0x20,%esi 19: 29 ce sub %ecx,%esi 1b: 85 f6 test %esi,%esi 1d: 7e 11 jle 30 <__lshrdi3+0x30> 1f: d3 e8 shr %cl,%eax 21: d3 ea shr %cl,%edx 23: 89 f1 mov %esi,%ecx 25: d3 e3 shl %cl,%ebx 27: 09 d8 or %ebx,%eax 29: 5b pop %ebx 2a: 5e pop %esi 2b: c3 ret 2c: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 30: 89 d8 mov %ebx,%eax 32: 83 e9 20 sub $0x20,%ecx 35: 31 d2 xor %edx,%edx 37: 5b pop %ebx 38: d3 e8 shr %cl,%eax 3a: 5e pop %esi 3b: c3 ret […]
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: applies shift count modulo 64!
.file "lshrdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+12] = count
# [esp+8] = high dword of value
# [esp+4] = low dword of value
__lshrdi3:
mov ecx, [esp+12]
mov edx, [esp+8]
test cl, 32
jnz 0f # count > 31?
mov eax, [esp+4]
shrd eax, edx, cl
shr edx, cl # edx:eax = value >> count
ret
0:
mov eax, edx
shr eax, cl
xor edx, edx # edx:eax = value >> count
ret
.size __lshrdi3, .-__lshrdi3
.type __lshrdi3, @function
.global __lshrdi3
.end
__rotl{s,d,t}i3()
and __rotr{s,d,t}i3()
Functionslibgcc
fails to provide the __rotlsi3()
,
__rotldi3()
and __rotlti3()
as well as the
__rotrsi3()
, __rotrdi3()
and
__rotrti3()
functions for (unsigned) integer rotation
alias circular shift.
Create the text file case5.c
with the following content
in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
unsigned int __rotlsi3(unsigned int value, int count) {
return (value << (31 & count))
| (value >> (31 & -count));
}
unsigned int __rotrsi3(unsigned int value, int count) {
return (value >> (31 & count))
| (value << (31 & -count));
}
unsigned long long __rotldi3(unsigned long long value, int count) {
return (value << (63 & count))
| (value >> (63 & -count));
}
unsigned long long __rotrdi3(unsigned long long value, int count) {
return (value >> (63 & count))
| (value << (63 & -count));
}
#ifdef __amd64__
__uint128_t __rotlti3(__uint128_t value, int count) {
return (value << (127 & count))
| (value >> (127 & -count));
}
__uint128_t __rotrti3(__uint128_t value, int count) {
#ifndef OPTIMIZE
return (value >> (127 & count))
| (value << (127 & -count));
#else
__asm__("movq\t%[low], %%rax\n\t"
"shrdq\t%%cl, %[high], %%rax\n\t"
"shrdq\t%%cl, %[low], %[high]\n\t"
"movq\t%[high], %%rdx\n\t"
"test\t$64, %%cl\n\t"
"cmovnz\t%%rax, %%rdx\n\t"
"cmovnz\t%[high], %%rax"
:"=A" (value)
:"c" (count),
[low] "r" ((unsigned long long) (value & ~0ULL)),
[high] "r" ((unsigned long long) (value >> 64))
:"cc");
return value;
#endif // OPTIMIZE
}
#endif // __amd64__
Compile the source file case5.c
with
GCC, engaging its
optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -Os -S case5.c
[…] __rotlsi3: movl %edi, %eax movl %esi, %ecx roll %cl, %eax ret […] __rotrsi3: movl %edi, %eax movl %esi, %ecx rorl %cl, %eax ret […] __rotldi3: movq %rdi, %rax movl %esi, %ecx rolq %cl, %rax ret […] __rotrdi3: movq %rdi, %rax movl %esi, %ecx rorq %cl, %rax ret […] __rotlti3: movq %rdi, %r9 movq %rsi, %r8 andl $127, %edx movq %rdi, %rsi movl %edx, %ecx movq %r9, %rax movq %r8, %rdx movq %r8, %rdi shldq %cl, %r9, %rdx salq %cl, %rax xorl %r8d, %r8d testb $64, %cl cmovne %rax, %rdx cmovne %r8, %rax negl %ecx xorl %r9d, %r9d andl $127, %ecx shrdq %cl, %rdi, %rsi shrq %cl, %rdi testb $64, %cl cmovne %rdi, %rsi cmovne %r9, %rdi orq %rsi, %rax orq %rdi, %rdx ret […] __rotrti3: movq %rsi, %r8 movq %rdi, %r9 andl $127, %edx movq %rdi, %rsi movl %edx, %ecx movq %r9, %rax movq %r8, %rdx movq %r8, %rdi shrdq %cl, %r8, %rax shrq %cl, %rdx xorl %r8d, %r8d testb $64, %cl cmovne %rdx, %rax cmovne %r8, %rdx negl %ecx andl $127, %ecx shldq %cl, %r9, %rdi salq %cl, %rsi xorl %r9d, %r9d testb $64, %cl cmovne %rsi, %rdi cmovne %r9, %rsi orq %rdi, %rdx orq %rsi, %rax ret […] .ident "GCC: (GNU […]) 10.2.0"Ouch: while the optimiser recognises the common and well-known idiom for rotation of 32-bit and 64-bit integers, it but fails rather bad at 128-bit integers and generates awful code!
Properly written code uses but only 9 instructions in just 28 bytes
instead of 25 instructions in 77 bytes for each of the
__rotlti3()
and __rotrti3()
functions:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "rolti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = value
# rdx = count
__rotlti3:
mov ecx, edx
mov rax, rdi
shld rax, rsi, cl
shld rsi, rdi, cl
mov rdx, rsi
test cl, 64
cmovnz rdx, rax
cmovnz rax, rsi
ret
.size __rotlti3, .-__rotlti3
.type __rotlti3, @function
.global __rotlti3
__rotrti3:
mov ecx, edx
mov rax, rdi
shrd rax, rsi, cl
shrd rsi, rdi, cl
mov rdx, rsi
test cl, 64
cmovnz rdx, rax
cmovnz rax, rsi
ret
.size __rotrti3, .-__rotrti3
.type __rotrti3, @function
.global __rotrti3
.end
Note: exploration of the equally bad code generated
for rotation of 64-bit integers on the i386 platform is
left as an exercise to the reader.
gcc -m32 -o- -Os -S case5.c
[…] __rotlsi3: movl 8(%esp), %ecx movl 4(%esp), %eax roll %cl, %eax ret […] __rotrsi3: movl 8(%esp), %ecx movl 4(%esp), %eax rorl %cl, %eax ret […] __rotldi3: pushl %edi pushl %esi movl 12(%esp), %eax movl 16(%esp), %edx movl 20(%esp), %ecx movl %eax, %esi andl $63, %ecx movl %edx, %edi shldl %cl, %esi, %edi sall %cl, %esi testb $32, %cl je L8 movl %esi, %edi xorl %esi, %esi L8: negl %ecx andl $63, %ecx shrdl %cl, %edx, %eax shrl %cl, %edx testb $32, %cl je L9 movl %edx, %eax xorl %edx, %edx L9: orl %esi, %eax orl %edi, %edx popl %esi popl %edi ret […] __rotrdi3: pushl %edi pushl %esi movl 12(%esp), %eax movl 16(%esp), %edx movl 20(%esp), %ecx movl %eax, %esi andl $63, %ecx movl %edx, %edi shrdl %cl, %edi, %esi shrl %cl, %edi testb $32, %cl je L12 movl %edi, %esi xorl %edi, %edi L12: negl %ecx andl $63, %ecx shldl %cl, %eax, %edx sall %cl, %eax testb $32, %cl je L13 movl %eax, %edx xorl %eax, %eax L13: orl %esi, %eax orl %edi, %edx popl %esi popl %edi ret […] .ident "GCC: (GNU […]) 10.2.0"
__cmp{s,d,t}i2()
Functionslibgcc
provides the __cmpsi2()
,
__cmpdi2()
and __cmpti2()
functions for
signed integer comparison:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __cmpsi2 (int a, int b)
- Runtime Function: int __cmpdi2 (long a, long b)
- Runtime Function: int __cmpti2 (long long a, long long b)
These functions perform a signed comparison of a and b. If a is less than b, they return 0; if a is greater than b, they return 2; and if a and b are equal they return 1.
__cmpti2()
to the following unoptimised and
awkward code using (superfluous) conditional
branches which impair performance:
[…] 0000000000000000 <__cmpti2>: 0: 31 c0 xor %eax,%eax 2: 48 39 ce cmp %rcx,%rsi 5: 7c 1c jl 23 <__cmpti2+0x23> 7: b8 02 00 00 00 mov $0x2,%eax c: 7f 15 jg 23 <__cmpti2+0x23> e: 31 c0 xor %eax,%eax 10: 48 39 d7 cmp %rdx,%rdi 13: 72 0e jb 23 <__cmpti2+0x23> 15: b8 01 00 00 00 mov $0x1,%eax 1a: ba 02 00 00 00 mov $0x2,%edx 1f: 48 0f 47 c2 cmova %rdx,%rax 23: c3 retq […]Properly written branch-free code uses 11 instructions in 21 bytes instead of 12 instructions in 36 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "cmpti2.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = left argument
# rcx:rdx = right argument
__cmpti2:
sub rdi, rdx
sbb rsi, rcx # rsi:rdi = left argument - right argument
setg ah # ah = (left argument > right argument) ? 1 : 0
setl al # al = (left argument < right argument) ? 1 : 0
stc
sbb al, ah # al = (left argument < right argument)
# - (left argument > right argument)
# - 1
# = {0, -1, -2}
movsx eax, al
neg eax # rax = (left argument > right argument)
# - (left argument < right argument)
# + 1
# = {0, 1, 2}
ret
.size __cmpti2, .-__cmpti2
.type __cmpti2, @function
.global __cmpti2
.end
Note: exploration of the equally bad code generated
for the __cmpdi2()
and __cmpsi2()
functions is left as an exercise to the reader.
[…] 00000000 <__cmpdi2>: 0: 53 push %ebx 1: 31 c0 xor %eax,%eax 3: 8b 4c 24 08 mov 0x8(%esp),%ecx 7: 8b 54 24 10 mov 0x10(%esp),%edx b: 8b 5c 24 14 mov 0x14(%esp),%ebx f: 39 5c 24 0c cmp %ebx,0xc(%esp) 13: 7c 1a jl 2f <__cmpdi2+0x2f> 15: b8 02 00 00 00 mov $0x2,%eax 1a: 7f 13 jg 2f <__cmpdi2+0x2f> 1c: 31 c0 xor %eax,%eax 1e: 39 d1 cmp %edx,%ecx 20: 72 0d jb 2f <__cmpdi2+0x2f> 22: b8 01 00 00 00 mov $0x1,%eax 27: ba 02 00 00 00 mov $0x2,%edx 2c: 0f 47 c2 cmova %edx,%eax 2f: 5b pop %ebx 30: c3 ret […]
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __cmpsi2(int comparand, int comparator) {
return (comparand > comparator) - (comparand < comparator) + 1;
}
int __cmpdi2(long long comparand, long long comparator) {
return (comparand > comparator) - (comparand < comparator) + 1;
}
#ifdef __amd64__
int __cmpti2(__int128_t comparand, __int128_t comparator) {
return (comparand > comparator) - (comparand < comparator) + 1;
}
#endif
__ucmp{s,d,t}i2()
Functionslibgcc
provides the __ucmpsi2()
,
__ucmpdi2()
and __ucmpti2()
functions for
unsigned integer comparison:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __ucmpsi2 (unsigned int a, unsigned int b)
- Runtime Function: int __ucmpdi2 (unsigned long a, unsigned long b)
- Runtime Function: int __ucmpti2 (unsigned long long a, unsigned long long b)
These functions perform an unsigned comparison of a and b. If a is less than b, they return 0; if a is greater than b, they return 2; and if a and b are equal they return 1.
__ucmpti2()
to the following unoptimised and
awkward code using (superfluous) conditional
branches which impair performance:
[…] 0000000000000000 <__ucmpti2>: 0: 31 c0 xor %eax,%eax 2: 48 39 ce cmp %rcx,%rsi 5: 72 1c jb 23 <__ucmpti2+0x23> 7: b8 02 00 00 00 mov $0x2,%eax c: 77 15 ja 23 <__ucmpti2+0x23> e: 31 c0 xor %eax,%eax 10: 48 39 d7 cmp %rdx,%rdi 13: 72 0e jb 23 <__ucmpti2+0x23> 15: b8 01 00 00 00 mov $0x1,%eax 1a: ba 02 00 00 00 mov $0x2,%edx 1f: 48 0f 47 c2 cmova %rdx,%rax 23: c3 retq […]Properly written branch-free code uses but only 6 instructions in just 15 bytes instead of 12 instructions in 36 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "ucmpti2.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = left argument
# rcx:rdx = right argument
__ucmpti2:
xor eax, eax # rax = 0
sub rdi, rdx
sbb rsi, rcx # rsi:rdi = left argument - right argument
seta al # rax = (left argument > right argument)
sbb eax, -1 # rax = (left argument > right argument)
# - (left argument < right argument)
# + 1
# = {0, 1, 2}
ret
.size __ucmpti2, .-__ucmpti2
.type __ucmpti2, @function
.global __ucmpti2
.end
Note: exploration of the equally bad code generated
for the __ucmpdi2()
and __ucmpsi2()
functions is left as an exercise to the reader.
[…] 00000000 <__ucmpdi2>: 0: 53 push %ebx 1: 31 c0 xor %eax,%eax 3: 8b 4c 24 08 mov 0x8(%esp),%ecx 7: 8b 54 24 10 mov 0x10(%esp),%edx b: 8b 5c 24 14 mov 0x14(%esp),%ebx f: 39 5c 24 0c cmp %ebx,0xc(%esp) 13: 72 1a jb 2f <__ucmpdi2+0x2f> 15: b8 02 00 00 00 mov $0x2,%eax 1a: 77 13 ja 2f <__ucmpdi2+0x2f> 1c: 31 c0 xor %eax,%eax 1e: 39 d1 cmp %edx,%ecx 20: 72 0d jb 2f <__ucmpdi2+0x2f> 22: b8 01 00 00 00 mov $0x1,%eax 27: ba 02 00 00 00 mov $0x2,%edx 2c: 0f 47 c2 cmova %edx,%eax 2f: 5b pop %ebx 30: c3 ret […]
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __ucmpsi2(unsigned int comparand, unsigned int comparator) {
return (comparand > comparator) - (comparand < comparator) + 1;
}
int __ucmpdi2(unsigned long long comparand, unsigned long long comparator) {
return (comparand > comparator) - (comparand < comparator) + 1;
}
#ifdef __amd64__
int __ucmpti2(__uint128_t comparand, __uint128_t comparator) {
return (comparand > comparator) - (comparand < comparator) + 1;
}
#endif
__mul{s,d,t}i3()
Functionslibgcc
provides the __mulsi3()
,
__muldi3()
and __multi3()
functions for
integer multiplication:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __mulsi3 (int a, int b)
- Runtime Function: long __muldi3 (long a, long b)
- Runtime Function: long long __multi3 (long long a, long long b)
These functions return the product of a and b.
__multi3()
to the following code, which uses a
LEA
instruction instead
of the appropriate (shorter) ADD
instruction to compute the high part of the product, and two
superfluous MOV
instructions:
[…] 0000000000000000 <__multi3>: 0: 49 89 f0 mov %rsi,%r8 3: 48 89 f8 mov %rdi,%rax 6: 48 89 d6 mov %rdx,%rsi 9: 48 0f af f9 imul %rcx,%rdi d: 49 0f af f0 imul %r8,%rsi 11: 48 f7 e2 mul %rdx 14: 48 01 d7 add %rdx,%rdi 17: 48 8d 14 37 lea (%rdi,%rsi,1),%rdx 1b: c3 retq […]Properly written code uses but 7 instructions in 21 bytes instead of 9 instructions in 28 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "multi3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = multiplicand
# rcx:rdx = multiplier
__multi3:
mov rax, rdi # rax = low qword of multiplicand
imul rsi, rdx # rsi = high qword of multiplicand
# * low qword of multiplier
imul rdi, rcx # rdi = low qword of multiplicand
# * high qword of multiplier
mul rdx # rdx:rax = low qword of multiplicand
# * low qword of multiplier
add rdi, rsi
add rdx, rdi # rdx:rax = product % 2**128
ret
.size __multi3, .-__multi3
.type __multi3, @function
.global __multi3
.end
Note: exploration of the equally bad code generated
for the __muldi3()
function on the i386
platform is left as an exercise to the reader.
[…] 00000000 <__muldi3>: 0: 53 push %ebx 1: 8b 4c 24 08 mov 0x8(%esp),%ecx 5: 8b 5c 24 10 mov 0x10(%esp),%ebx 9: 89 c8 mov %ecx,%eax b: 0f af 4c 24 14 imul 0x14(%esp),%ecx 10: f7 e3 mul %ebx 12: 0f af 5c 24 0c imul 0xc(%esp),%ebx 17: 01 d1 add %edx,%ecx 19: 8d 14 19 lea (%ecx,%ebx,1),%edx 1c: 5b pop %ebx 1d: c3 ret […]
Properly written code uses 12 instructions in 29 bytes instead of 11 instructions in 30 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "muldi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of multiplier
# [esp+12] = low dword of multiplier
# [esp+8] = high dword of multiplicand
# [esp+4] = low dword of multiplicand
__muldi3:
push ebx
mov edx, [esp+16] # edx = low dword of multiplier
mov ecx, [esp+12] # ecx = high dword of multiplicand
imul ecx, edx # ecx = high dword of multiplicand
# * low dword of multiplier
mov eax, [esp+8] # eax = low dword of multiplicand
mov ebx, [esp+20] # ebx = high dword of multiplier
imul ebx, eax # ebx = high dword of multiplier
# * low dword of multiplicand
mul edx # edx:eax = low dword of multiplicand
# * low dword of multiplier
add ecx, ebx # ecx = high dword of multiplicand
# * low dword of multiplier
# + high dword of multiplier
# * low dword of multiplicand
add edx, ecx # edx:eax = product % 2**64
pop ebx
ret
.size __muldi3, .-__muldi3
.type __muldi3, @function
.global __muldi3
.end
__neg{s,d,t}i2()
Functionslibgcc
provides the __negsi2()
,
__negdi2()
and __negti2()
functions for
signed integer negation:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __negsi2 (int a)
- Runtime Function: long __negdi2 (long a)
- Runtime Function: long long __negti2 (long long a)
These functions return the negation of a.
__negti2()
to the following unoptimised and
awkward code:
[…] 0000000000000000 <__negti2>: 0: 48 f7 de neg %rsi 3: 48 89 f8 mov %rdi,%rax 6: 48 f7 d8 neg %rax 9: 48 89 f2 mov %rsi,%rdx c: 48 83 ff 01 cmp $0x1,%rdi 10: 48 83 d2 ff adc $0xffffffffffffffff,%rdx 14: c3 retq […]Properly written code uses but only 5 instructions in just 10 bytes instead of 7 instructions in 21 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "negti2.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = argument
__negti2:
xor eax, eax
cdq # rdx:rax = 0
sub rax, rdi
sbb rdx, rsi # rdx:rax = -argument
ret
.size __negti2, .-__negti2
.type __negti2, @function
.global __negti2
.end
Note: exploration of the equally bad code generated
for the __negdi2()
function on the i386
platform is left as an exercise to the reader.
[…] 00000000 <__negdi2>: 0: 8b 4c 24 04 mov 0x4(%esp),%ecx 4: 8b 54 24 08 mov 0x8(%esp),%edx 8: 89 c8 mov %ecx,%eax a: f7 da neg %edx c: f7 d8 neg %eax e: 83 f9 01 cmp $0x1,%ecx 11: 83 d2 ff adc $0xffffffff,%edx 14: c3 ret […]
Properly written code uses but only 5 instructions in just 12 bytes instead of 8 instructions in 21 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "negdi2.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+8] = high dword of argument
# [esp+4] = low dword of argument
__negdi2:
xor eax, eax
cdq # edx:eax = 0
sub eax, [esp+4]
sbb edx, [esp+8] # edx:eax = -argument
ret
.size __negdi2, .-__negdi2
.type __negdi2, @function
.global __negdi2
.end
__absv{s,d,t}i2()
Functionslibgcc
provides the __absvsi2()
,
__absvdi2()
and __absvti2()
functions:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __absvsi2 (int a)
- Runtime Function: long __absvdi2 (long a)
- Runtime Function: long long __absvti2 (long long a)
These functions return the absolute value of a.
__absvti2()
to the following unoptimised and
awkward code:
[…] 0000000000000000 <__absvti2>: 0: 48 89 f8 mov %rdi,%rax 3: 48 89 f2 mov %rsi,%rdx 6: 48 85 f6 test %rsi,%rsi 9: 78 05 js 10 <__absvti2+0x10> b: c3 retq c: 0f 1f 40 00 nopl 0x0(%rax) 10: 48 83 ec 08 sub $0x8,%rsp 14: 48 f7 d8 neg %rax 17: 48 83 d2 00 adc $0x0,%rdx 1b: 48 f7 da neg %rdx 1e: 48 85 d2 test %rdx,%rdx 21: 0f 88 00 00 00 00 js … 27: 48 83 c4 08 add $0x8,%rsp 2b: c3 retq […]Oops: the
TEST
instruction at address 1e
is superfluous.
Properly written (and almost branch-free) code uses but only 11 instructions in just 26 bytes instead of 14 instructions in 44 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "absvti2.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = argument
__absvti2:
mov rax, rsi
cqo # rdx = (argument < 0) ? -1 : 0
add rdi, rdx
adc rsi, rdx # rsi:rsi = (argument < 0) ? argument - 1 : argument
jo .overflow
mov rax, rdx
xor rax, rdi
xor rdx, rsi # rdx:rax = (argument < 0) ? -argument : argument
# = |argument|
ret
.overflow:
int 4
ret
.size __absvti2, .-__absvti2
.type __absvti2, @function
.global __absvti2
.end
Note: exploration of the equally bad code generated
for the __absvdi2()
and __absvsi2()
functions is left as an exercise to the reader.
[…] 00000000 <__absvdi2>: 0: 83 ec 0c sub $0xc,%esp 3: 8b 54 24 14 mov 0x14(%esp),%edx 7: 8b 44 24 10 mov 0x10(%esp),%eax b: 85 d2 test %edx,%edx d: 78 09 js 18 <__absvdi2+0x18> f: 83 c4 0c add $0xc,%esp 12: c3 ret 13: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 17: 90 nop 18: f7 d8 neg %eax 1a: 83 d2 00 adc $0x0,%edx 1d: f7 da neg %edx 1f: 85 d2 test %edx,%edx 21: 0f 88 00 00 00 00 js … 27: 83 c4 0c add $0xc,%esp 2a: c3 ret […]
Properly written branch-free code uses but only 10 instructions in just 21 bytes instead of 16 instructions in 43 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "absvdi2.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+8] = high dword of argument
# [esp+4] = low dword of argument
__absvdi2:
mov eax, [esp+8]
mov ecx, [esp+4] # eax:ecx = argument
cdq # edx = (argument < 0) ? -1 : 0
add ecx, edx
adc eax, edx # eax:ecx = (argument < 0) ? argument - 1 : argument
into
xor ecx, edx
xor edx, eax
mov eax, ecx # edx:eax = (argument < 0) ? -argument : argument
# = |argument|
ret
.size __absvdi2, .-__absvdi2
.type __absvdi2, @function
.global __absvdi2
.end
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef ALTERNATE
int __absvsi2(int value) {
if (value < 0)
value = 0 - (unsigned int) value;
if (value < 0)
__builtin_trap();
return value;
}
long long __absvdi2(long long value) {
if (value < 0)
value = 0 - (unsigned long long) value;
if (value < 0)
__builtin_trap();
return value;
}
#ifdef __amd64__
__int128_t __absvti2(__int128_t value) {
if (value < 0)
value = 0 - (__uint128_t) value;
if (value < 0)
__builtin_trap();
return value;
}
#endif
#else
int __absvsi2(int value) {
const int sign = 0 - (value < 0);
if (__builtin_add_overflow(sign, value, &value))
__builtin_trap();
return sign ^ value;
}
long long __absvdi2(long long value) {
const long long sign = 0 - (value < 0);
if (__builtin_add_overflow(sign, value, &value))
__builtin_trap();
return sign ^ value;
}
#ifdef __amd64__
__int128_t __absvti2(__int128_t value) {
const __int128_t sign = 0 - (value < 0);
if (__builtin_add_overflow(sign, value, &value))
__builtin_trap();
return sign ^ value;
}
#endif
#endif // ALTERNATE
__addv{s,d,t}i3()
Functionslibgcc
provides the __addvsi3()
,
__addvdi3()
and __addvti3()
functions for
overflow-trapping signed integer addition:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __addvsi3 (int a, int b)
- Runtime Function: long __addvdi3 (long a, long b)
- Runtime Function: long long __addvti3 (long long a, long long b)
These functions return the sum of a and b; that is
a + b
.
__addvti3()
to the following unoptimised and
awful code:
[…] 0000000000000000 <__addvti3>: 0: 49 89 f9 mov %rdi,%r9 3: 49 89 d0 mov %rdx,%r8 6: 48 83 ec 08 sub $0x8,%rsp a: 48 89 f2 mov %rsi,%rdx d: 4c 89 c8 mov %r9,%rax 10: 48 89 f7 mov %rsi,%rdi 13: 4c 01 c0 add %r8,%rax 16: 48 11 ca adc %rcx,%rdx 19: 48 85 c9 test %rcx,%rcx 1c: 78 12 js 30 <__addvti3+0x30> 1e: 4c 39 c8 cmp %r9,%rax 21: 48 89 d1 mov %rdx,%rcx 24: 48 19 f1 sbb %rsi,%rcx 27: 7c 18 jl 41 <__addvti3+0x41> 29: 48 83 c4 08 add $0x8,%rsp 2d: c3 retq 2e: 66 90 xchg %ax,%ax 30: 49 39 c1 cmp %rax,%r9 33: 48 19 d7 sbb %rdx,%rdi 36: 0f 8c 00 00 00 00 jl … 3c: 48 83 c4 08 add $0x8,%rsp 40: c3 retq 41: e9 00 00 00 00 jmpq … […]Properly written (and almost branch-free code) uses but only 8 instructions in just 18 bytes instead of 23 instructions in 70 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "addvti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = augend
# rcx:rdx = addend
__addvti3:
add rdi, rdx
adc rsi, rcx # rsi:rdi = augend + addend
# = sum
jo .overflow
mov rax, rdi
mov rdx, rsi # rdx:rax = sum
ret
.overflow:
int 4
ret
.size __addvti3, .-__addvti3
.type __addvti3, @function
.global __addvti3
.end
Note: exploration of the equally bad code generated
for the __addvdi3()
and __addvsi3()
functions is left as an exercise to the reader.
[…] 00000000 <__addvdi3>: 0: 56 push %esi 1: 53 push %ebx 2: 83 ec 04 sub $0x4,%esp 5: 8b 4c 24 10 mov 0x10(%esp),%ecx 9: 8b 5c 24 14 mov 0x14(%esp),%ebx d: 8b 74 24 1c mov 0x1c(%esp),%esi 11: 89 c8 mov %ecx,%eax 13: 89 da mov %ebx,%edx 15: 03 44 24 18 add 0x18(%esp),%eax 19: 13 54 24 1c adc 0x1c(%esp),%edx 1d: 85 f6 test %esi,%esi 1f: 78 0f js 30 <__addvdi3+0x30> 21: 39 c8 cmp %ecx,%eax 23: 89 d6 mov %edx,%esi 25: 19 de sbb %ebx,%esi 27: 7c 17 jl 40 <__addvdi3+0x40> 29: 83 c4 04 add $0x4,%esp 2c: 5b pop %ebx 2d: 5e pop %esi 2e: c3 ret 2f: 90 nop 30: 39 c1 cmp %eax,%ecx 32: 19 d3 sbb %edx,%ebx 34: 0f 8c 00 00 00 00 jl … 3a: 83 c4 04 add $0x4,%esp 3d: 5b pop %ebx 3e: 5e pop %esi 3f: c3 ret 40: e9 00 00 00 00 jmp … […]
Properly written branch-free code uses but only 6 instructions in just 18 bytes instead of 29 instructions in 69 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "addvdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of addend
# [esp+12] = low dword of addend
# [esp+8] = high dword of augend
# [esp+4] = low dword of augend
__addvdi3:
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = augend
add eax, [esp+12]
adc edx, [esp+16] # edx:eax = augend + addend
# = sum
into
ret
.size __addvdi3, .-__addvdi3
.type __addvdi3, @function
.global __addvdi3
.end
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef ALTERNATE
// overflow if both augend and addend have opposite sign of sum,
// which is equivalent to augend has sign of addend
// and addend has opposite sign of sum
// (or augend has opposite sign of sum)
int __addvsi3(int augend, int addend) {
const int sum = (unsigned int) augend + (unsigned int) addend;
if (((augend ^ sum) & (addend ^ sum)) < 0)
__builtin_trap();
return sum;
}
long long __addvdi3(long long augend, long long addend) {
const long long sum = (unsigned long long) augend + (unsigned long long) addend;
if (((augend ^ sum) & (addend ^ sum)) < 0)
__builtin_trap();
return sum;
}
#ifdef __amd64__
__int128_t __addvti3(__int128_t augend, __int128_t addend) {
const __int128_t sum = (__uint128_t) augend + (__uint128_t) addend;
if (((augend ^ sum) & (addend ^ sum)) < 0)
__builtin_trap();
return sum;
}
#endif
#else
int __addvsi3(int augend, int addend) {
int sum;
if (__builtin_add_overflow(augend, addend, &sum))
__builtin_trap();
return sum;
}
long long __addvdi3(long long augend, long long addend) {
long long sum;
if (__builtin_add_overflow(augend, addend, &sum))
__builtin_trap();
return sum;
}
#ifdef __amd64__
__int128_t __addvti3(__int128_t augend, __int128_t addend) {
__int128_t sum;
if (__builtin_add_overflow(augend, addend, &sum))
__builtin_trap();
return sum;
}
#endif
#endif // ALTERNATE
__subv{s,d,t}i3()
Functionslibgcc
provides the __subvsi3()
,
__subvdi3()
and __subvti3()
functions for
overflow-trapping signed integer subtraction:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __subvsi3 (int a, int b)
- Runtime Function: long __subvdi3 (long a, long b)
- Runtime Function: long long __subvti3 (long long a, long long b)
These functions return the difference between a and b; that is
a - b
.
__subvti3()
to the following unoptimised and
awful code:
[…] 0000000000000000 <__subvti3>: 0: 49 89 f9 mov %rdi,%r9 3: 49 89 d0 mov %rdx,%r8 6: 48 83 ec 08 sub $0x8,%rsp a: 48 89 f2 mov %rsi,%rdx d: 4c 89 c8 mov %r9,%rax 10: 4c 29 c0 sub %r8,%rax 13: 48 19 ca sbb %rcx,%rdx 16: 48 85 c9 test %rcx,%rcx 19: 78 15 js 30 <__subvti3+0x30> 1b: 49 39 c1 cmp %rax,%r9 1e: 48 89 f1 mov %rsi,%rcx 21: 48 19 d1 sbb %rdx,%rcx 24: 7c 1e jl 44 <__subvti3+0x44> 26: 48 83 c4 08 add $0x8,%rsp 2a: c3 retq 2b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 30: 4c 39 c8 cmp %r9,%rax 33: 48 89 d1 mov %rdx,%rcx 36: 48 19 f1 sbb %rsi,%rcx 39: 0f 8c 00 00 00 00 jl … 3f: 48 83 c4 08 add $0x8,%rsp 43: c3 retq 44: e9 00 00 00 00 jmpq … […]Properly written (and almost branch-free) code uses but only 8 instructions in just 18 bytes instead of 23 instructions in 73 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "subvti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = minuend
# rcx:rdx = subtrahend
__subvti3:
sub rdi, rdx
sbb rsi, rcx # rsi:rdi = minuend - subtrahend
# = difference
jo .overflow
mov rax, rdi
mov rdx, rsi # rdx:rax = difference
ret
.overflow:
int 4
ret
.size __subvti3, .-__subvti3
.type __subvti3, @function
.global __subvti3
.end
Note: exploration of the equally bad code generated
for the __subvdi3()
and __subvsi3()
functions is left as an exercise to the reader.
[…] 00000000 <__subvdi3>: 0: 56 push %esi 1: 53 push %ebx 2: 83 ec 04 sub $0x4,%esp 5: 8b 4c 24 10 mov 0x10(%esp),%ecx 9: 8b 5c 24 14 mov 0x14(%esp),%ebx d: 8b 74 24 1c mov 0x1c(%esp),%esi 11: 89 c8 mov %ecx,%eax 13: 89 da mov %ebx,%edx 15: 2b 44 24 18 sub 0x18(%esp),%eax 19: 1b 54 24 1c sbb 0x1c(%esp),%edx 1d: 85 f6 test %esi,%esi 1f: 78 0f js 30 <__subvdi3+0x30> 21: 39 c1 cmp %eax,%ecx 23: 19 d3 sbb %edx,%ebx 25: 7c 1b jl 42 <__subvdi3+0x42> 27: 83 c4 04 add $0x4,%esp 2a: 5b pop %ebx 2b: 5e pop %esi 2c: c3 ret 2d: 8d 76 00 lea 0x0(%esi),%esi 30: 39 c8 cmp %ecx,%eax 32: 89 d6 mov %edx,%esi 34: 19 de sbb %ebx,%esi 36: 0f 8c 00 00 00 00 jl … 3c: 83 c4 04 add $0x4,%esp 3f: 5b pop %ebx 40: 5e pop %esi 41: c3 ret 42: e9 00 00 00 00 jmp … […]
Properly written branch-free code uses but only 6 instructions in just 18 bytes instead of 29 instructions in 71 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "subvdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of subtrahend
# [esp+12] = low dword of subtrahend
# [esp+8] = high dword of minuend
# [esp+4] = low dword of minuend
__subvdi3:
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = minuend
sub eax, [esp+12]
sbb edx, [esp+16] # edx:eax = minuend - subtrahend
# = difference
into
ret
.size __subvdi3, .-__subvdi3
.type __subvdi3, @function
.global __subvdi3
.end
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef ALTERNATE
// overflow if minuend has opposite sign of subtrahend
// and minuend has opposite sign of difference
// (or subtrahend has sign of difference)
int __subvsi3(int minuend, int subtrahend) {
const int difference = (unsigned int) minuend - (unsigned int) subtrahend;
if (((minuend ^ subtrahend) & (minuend ^ difference)) < 0)
__builtin_trap();
return difference;
}
long long __subvdi3(long long minuend, long long subtrahend) {
const long long difference = (unsigned long long) minuend - (unsigned long long) subtrahend;
if (((minuend ^ subtrahend) & (minuend ^ difference)) < 0)
__builtin_trap();
return difference;
}
#ifdef __amd64__
__int128_t __subvti3(__int128_t minuend, __int128_t subtrahend) {
const __int128_t difference = (__uint128_t) minuend - (__uint128_t) subtrahend;
if (((minuend ^ subtrahend) & (minuend ^ difference)) < 0)
__builtin_trap();
return difference;
}
#endif
#else
int __subvsi3(int minuend, int subtrahend) {
int difference;
if (__builtin_sub_overflow(minuend, subtrahend, &difference))
__builtin_trap();
return difference;
}
long long __subvdi3(long long minuend, long long subtrahend) {
long long difference;
if (__builtin_sub_overflow(minuend, subtrahend, &difference))
__builtin_trap();
return difference;
}
#ifdef __amd64__
__int128_t __subvti3(__int128_t minuend, __int128_t subtrahend) {
__int128_t difference;
if (__builtin_sub_overflow(minuend, subtrahend, &difference))
__builtin_trap();
return difference;
}
#endif
#endif // ALTERNATE
__mulv{s,d,t}i3()
Functionslibgcc
provides the __mulvsi3()
,
__mulvdi3()
and __mulvti3()
functions for
overflow-trapping signed integer multiplication:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __mulvsi3 (int a, int b)
- Runtime Function: long __mulvdi3 (long a, long b)
- Runtime Function: long long __mulvti3 (long long a, long long b)
These functions return the product of a and b; that is
a * b
.
__mulvti3()
to the following horrible
code:
[…] 0000000000000000 <__mulvti3>: 0: 41 55 push %r13 2: 49 89 cb mov %rcx,%r11 5: 48 89 d0 mov %rdx,%rax 8: 49 89 d2 mov %rdx,%r10 b: 41 54 push %r12 d: 49 89 fc mov %rdi,%r12 10: 48 89 d1 mov %rdx,%rcx 13: 49 89 f0 mov %rsi,%r8 16: 4c 89 e2 mov %r12,%rdx 19: 49 89 f5 mov %rsi,%r13 1c: 53 push %rbx 1d: 48 89 fe mov %rdi,%rsi 20: 48 c1 fa 3f sar $0x3f,%rdx 24: 48 c1 f8 3f sar $0x3f,%rax 28: 4c 89 df mov %r11,%rdi 2b: 4c 39 c2 cmp %r8,%rdx 2e: 75 18 jne 48 <__mulvti3+0x48> 30: 4c 39 d8 cmp %r11,%rax 33: 75 6b jne a0 <__mulvti3+0xa0> 35: 4c 89 e0 mov %r12,%rax 38: 49 f7 ea imul %r10 3b: 5b pop %rbx 3c: 41 5c pop %r12 3e: 41 5d pop %r13 40: c3 retq 41: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) 48: 4c 39 d8 cmp %r11,%rax 4b: 0f 85 a7 00 00 00 jne f8 <__mulvti3+0xf8> 51: 4c 89 e0 mov %r12,%rax 54: 49 f7 e2 mul %r10 57: 49 89 c1 mov %rax,%r9 5a: 4c 89 c0 mov %r8,%rax 5d: 48 89 d7 mov %rdx,%rdi 60: 49 f7 e2 mul %r10 63: 4d 85 c0 test %r8,%r8 66: 79 09 jns 71 <__mulvti3+0x71> 68: 49 89 d0 mov %rdx,%r8 6b: 49 29 c8 sub %rcx,%r8 6e: 4c 89 c2 mov %r8,%rdx 71: 48 85 c9 test %rcx,%rcx 74: 79 06 jns 7c <__mulvti3+0x7c> 76: 4c 29 e0 sub %r12,%rax 79: 4c 19 ea sbb %r13,%rdx 7c: 49 89 fa mov %rdi,%r10 7f: 45 31 db xor %r11d,%r11d 82: 49 01 c2 add %rax,%r10 85: 4c 89 d0 mov %r10,%rax 88: 49 11 d3 adc %rdx,%r11 8b: 48 c1 f8 3f sar $0x3f,%rax 8f: 4c 39 d8 cmp %r11,%rax 92: 0f 85 00 00 00 00 jne … 98: 4c 89 c8 mov %r9,%rax 9b: 4c 89 d2 mov %r10,%rdx 9e: eb 9b jmp 3b <__mulvti3+0x3b> a0: 4c 89 d0 mov %r10,%rax a3: 49 f7 e4 mul %r12 a6: 49 89 c0 mov %rax,%r8 a9: 4c 89 d8 mov %r11,%rax ac: 48 89 d3 mov %rdx,%rbx af: 49 f7 e4 mul %r12 b2: 4d 85 db test %r11,%r11 b5: 79 09 jns c0 <__mulvti3+0xc0> b7: 48 89 d7 mov %rdx,%rdi ba: 4c 29 e7 sub %r12,%rdi bd: 48 89 fa mov %rdi,%rdx c0: 48 85 f6 test %rsi,%rsi c3: 79 06 jns cb <__mulvti3+0xcb> c5: 4c 29 d0 sub %r10,%rax c8: 4c 19 da sbb %r11,%rdx cb: 48 89 de mov %rbx,%rsi ce: 31 ff xor %edi,%edi d0: 48 01 c6 add %rax,%rsi d3: 48 89 f0 mov %rsi,%rax d6: 48 11 d7 adc %rdx,%rdi d9: 48 c1 f8 3f sar $0x3f,%rax dd: 48 39 f8 cmp %rdi,%rax e0: 0f 85 00 00 00 00 jne … e6: 4c 89 c0 mov %r8,%rax e9: 48 89 f2 mov %rsi,%rdx ec: e9 4a ff ff ff jmpq 3b <__mulvti3+0x3b> f1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax) f8: 4d 85 c0 test %r8,%r8 fb: 78 23 js 120 <__mulvti3+0x120> fd: 4d 85 db test %r11,%r11 100: 78 4d js 14f <__mulvti3+0x14f> 102: 4c 09 c7 or %r8,%rdi 105: 0f 85 00 00 00 00 jne … 10b: 4c 89 e0 mov %r12,%rax 10e: 49 f7 e2 mul %r10 111: 48 85 d2 test %rdx,%rdx 114: 0f 89 21 ff ff ff jns 3b <__mulvti3+0x3b> 11a: e9 00 00 00 00 jmpq … 11f: 90 nop 120: 4d 85 db test %r11,%r11 123: 78 57 js 17c <__mulvti3+0x17c> 125: 0f 85 00 00 00 00 jne … 12b: 49 83 f8 ff cmp $0xffffffffffffffff,%r8 12f: 0f 85 00 00 00 00 jne … 135: 4c 89 e0 mov %r12,%rax 138: 49 f7 e2 mul %r10 13b: 49 89 d0 mov %rdx,%r8 13e: 4d 29 d0 sub %r10,%r8 141: 0f 89 00 00 00 00 jns … 147: 4c 89 c2 mov %r8,%rdx 14a: e9 ec fe ff ff jmpq 3b <__mulvti3+0x3b> 14f: 49 83 fb ff cmp $0xffffffffffffffff,%r11 153: 0f 85 00 00 00 00 jne … 159: 4d 85 c0 test %r8,%r8 15c: 0f 85 00 00 00 00 jne … 162: 4c 89 d0 mov %r10,%rax 165: 49 f7 e4 mul %r12 168: 48 89 d7 mov %rdx,%rdi 16b: 4c 29 e7 sub %r12,%rdi 16e: 0f 89 00 00 00 00 jns … 174: 48 89 fa mov %rdi,%rdx 177: e9 bf fe ff ff jmpq 3b <__mulvti3+0x3b> 17c: 4c 21 c7 and %r8,%rdi 17f: 48 83 ff ff cmp $0xffffffffffffffff,%rdi 183: 0f 85 00 00 00 00 jne … 189: 4c 89 d0 mov %r10,%rax 18c: 4c 09 e0 or %r12,%rax 18f: 0f 84 00 00 00 00 je … 195: 4c 89 e0 mov %r12,%rax 198: 49 f7 e2 mul %r10 19b: 4c 29 e2 sub %r12,%rdx 19e: 49 89 c0 mov %rax,%r8 1a1: 48 89 d6 mov %rdx,%rsi 1a4: 4c 29 d6 sub %r10,%rsi 1a7: 0f 88 00 00 00 00 js … 1ad: e9 34 ff ff ff jmpq e6 <__mulvti3+0xe6> […]Properly written (and almost branch-free) code uses but only 45 instructions in just 119 bytes instead of 130 instructions in 434 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "mulvti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = multiplicand
# rcx:rdx = multiplier
__mulvti3:
mov r10, rdx # r10 = low qword of multiplier
mov r11, rcx # r11 = high qword of multiplier
mov rax, rcx # rax = high qword of multiplier
cqo # rdx = (multiplier < 0) ? -1 : 0
mov r9, rdx # r9 = (multiplier < 0) ? -1 : 0
xor r10, rdx
xor r11, rdx # r11:r10 = (multiplier < 0) ? ~multiplier : multiplier
sub r10, rdx
sbb r11, rdx # r11:r10 = (multiplier < 0) ? -multiplier : multiplier
# = |multiplier|
mov rax, rsi # rax = high qword of multiplicand
cqo # rdx = (multiplicand < 0) ? -1 : 0
xor r9, rdx # r9 = (multiplicand < 0) <> (multiplier < 0) ? -1 : 0
# = (product < 0) ? -1 : 0
xor rdi, rdx
xor rsi, rdx # rsi:rdi = (multiplicand < 0) ? ~multiplicand : multiplicand
sub rdi, rdx
sbb rsi, rdx # rsi:rdi = (multiplicand < 0) ? -multiplicand : multiplicand
# = |multiplicand|
xor ecx, ecx
cmp rcx, rsi
sbb edx, edx # edx = (high qword of |multiplicand| = 0) ? 0 : -1
# = (|multiplicand| < 2**64) ? 0 : -1
cmp rcx, r11
sbb ecx, ecx # ecx = (high qword of |multiplier| = 0) ? 0 : -1
# = (|multiplier| < 2**64) ? 0 : -1
and ecx, edx # ecx = (|multiplicand| < 2**64)
# & (|multiplier| < 2**64) ? 0 : -1
# = (|product| < 2**128) ? 0 : -1
mov rax, rsi
mul r10 # rdx:rax = high qword of |multiplicand|
# * low qword of |multiplier|
adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
mov rsi, rax
mov rax, rdi
mul r11 # rdx:rax = low qword of |multiplicand|
# * high qword of |multiplier|
adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
add rsi, rax # rsi = high qword of |multiplicand|
# * low qword of |multiplier|
# + low qword of |multiplicand|
# * high qword of |multiplier|
# adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
mov rax, rdi
mul r10 # rdx:rax = low qword of |multiplicand|
# * low qword of |multiplier|
add rdx, rsi # rdx:rax = |product % 2**128|
# = |product| % 2**128
adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
.if 0
xor rax, r9
xor rdx, r9 # rdx:rax = (product < 0)
# ? product % 2**128 - 1 : product % 2**128
sub rax, r9
sbb rdx, r9 # rdx:rax = product % 2**128
xor r9, rdx # r9 = (product % 2**128 < -2**127)
# | (product % 2**128 >= 2**127)
# ? negative : positive
add r9, r9
.else
add rax, r9
adc rdx, r9 # rdx:rax = (product < 0)
# ? ~product % 2**128 : product % 2**128
mov rsi, rdx # rsi = (product % 2**128 < -2**127)
# | (product % 2**128 >= 2**127)
# ? negative : positive
xor rax, r9
xor rdx, r9 # rdx:rax = product % 2**128
add rsi, rsi
.endif
adc ecx, ecx # ecx = (-2**127 <= product < 2**127) ? 0 : *
jnz .overflow # product < -2**127?
# product >= 2**127?
ret
.overflow:
int 4
ret
.size __mulvti3, .-__mulvti3
.type __mulvti3, @function
.global __mulvti3
.end
Note: exploration of the equally horrible code
generated for the __mulvdi3()
and
__mulvsi3()
functions is left as an exercise to the
reader.
[…] 00000000 <__mulvdi3>: 0: 55 push %ebp 1: 57 push %edi 2: 56 push %esi 3: 53 push %ebx 4: 83 ec 1c sub $0x1c,%esp 7: 8b 44 24 38 mov 0x38(%esp),%eax b: 8b 54 24 3c mov 0x3c(%esp),%edx f: 8b 7c 24 30 mov 0x30(%esp),%edi 13: 8b 6c 24 34 mov 0x34(%esp),%ebp 17: 89 54 24 0c mov %edx,0xc(%esp) 1b: 89 c2 mov %eax,%edx 1d: 89 c3 mov %eax,%ebx 1f: c1 fa 1f sar $0x1f,%edx 22: 89 44 24 08 mov %eax,0x8(%esp) 26: 89 f9 mov %edi,%ecx 28: 89 d0 mov %edx,%eax 2a: 8b 54 24 0c mov 0xc(%esp),%edx 2e: 89 14 24 mov %edx,(%esp) 31: 89 fa mov %edi,%edx 33: c1 fa 1f sar $0x1f,%edx 36: 39 ea cmp %ebp,%edx 38: 75 16 jne 50 <__mulvdi3+0x50> 3a: 3b 04 24 cmp (%esp),%eax 3d: 75 61 jne a0 <__mulvdi3+0xa0> 3f: 89 f8 mov %edi,%eax 41: f7 eb imul %ebx 43: 83 c4 1c add $0x1c,%esp 46: 5b pop %ebx 47: 5e pop %esi 48: 5f pop %edi 49: 5d pop %ebp 4a: c3 ret 4b: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 4f: 90 nop 50: 89 ee mov %ebp,%esi 52: 3b 04 24 cmp (%esp),%eax 55: 0f 85 95 00 00 00 jne f0 <__mulvdi3+0xf0> 5b: 89 f8 mov %edi,%eax 5d: f7 e3 mul %ebx 5f: 89 04 24 mov %eax,(%esp) 62: 89 44 24 08 mov %eax,0x8(%esp) 66: 89 e8 mov %ebp,%eax 68: 89 54 24 04 mov %edx,0x4(%esp) 6c: f7 e3 mul %ebx 6e: 85 ed test %ebp,%ebp 70: 79 06 jns 78 <__mulvdi3+0x78> 72: 89 d6 mov %edx,%esi 74: 29 de sub %ebx,%esi 76: 89 f2 mov %esi,%edx 78: 85 db test %ebx,%ebx 7a: 79 04 jns 80 <__mulvdi3+0x80> 7c: 29 f8 sub %edi,%eax 7e: 19 ea sbb %ebp,%edx 80: 8b 4c 24 04 mov 0x4(%esp),%ecx 84: 31 db xor %ebx,%ebx 86: 01 c1 add %eax,%ecx 88: 89 c8 mov %ecx,%eax 8a: 11 d3 adc %edx,%ebx 8c: c1 f8 1f sar $0x1f,%eax 8f: 39 d8 cmp %ebx,%eax 91: 0f 85 00 00 00 00 jne … 97: 8b 44 24 08 mov 0x8(%esp),%eax 9b: 89 ca mov %ecx,%edx 9d: eb a4 jmp 43 <__mulvdi3+0x43> 9f: 90 nop a0: 89 d8 mov %ebx,%eax a2: f7 e7 mul %edi a4: 8b 3c 24 mov (%esp),%edi a7: 89 c6 mov %eax,%esi a9: 8b 04 24 mov (%esp),%eax ac: 89 d5 mov %edx,%ebp ae: f7 e1 mul %ecx b0: 89 c3 mov %eax,%ebx b2: 85 ff test %edi,%edi b4: 79 0c jns c2 <__mulvdi3+0xc2> b6: 89 d0 mov %edx,%eax b8: 29 c8 sub %ecx,%eax ba: 89 04 24 mov %eax,(%esp) bd: 8b 14 24 mov (%esp),%edx c0: 89 d8 mov %ebx,%eax c2: 85 c9 test %ecx,%ecx c4: 79 08 jns ce <__mulvdi3+0xce> c6: 2b 44 24 08 sub 0x8(%esp),%eax ca: 1b 54 24 0c sbb 0xc(%esp),%edx ce: 89 e9 mov %ebp,%ecx d0: 31 db xor %ebx,%ebx d2: 01 c1 add %eax,%ecx d4: 89 c8 mov %ecx,%eax d6: 11 d3 adc %edx,%ebx d8: c1 f8 1f sar $0x1f,%eax db: 39 d8 cmp %ebx,%eax dd: 0f 85 00 00 00 00 jne … e3: 89 f0 mov %esi,%eax e5: 89 ca mov %ecx,%edx e7: e9 57 ff ff ff jmp 43 <__mulvdi3+0x43> ec: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi f0: 85 ed test %ebp,%ebp f2: 78 24 js 118 <__mulvdi3+0x118> f4: 8b 04 24 mov (%esp),%eax f7: 85 c0 test %eax,%eax f9: 78 45 js 140 <__mulvdi3+0x140> fb: 09 e8 or %ebp,%eax fd: 0f 85 00 00 00 00 jne … 103: 89 f8 mov %edi,%eax 105: f7 e3 mul %ebx 107: 85 d2 test %edx,%edx 109: 0f 89 34 ff ff ff jns 43 <__mulvdi3+0x43> 10f: e9 00 00 00 00 jmp … 114: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi 118: 8b 04 24 mov (%esp),%eax 11b: 85 c0 test %eax,%eax 11d: 78 48 js 167 <__mulvdi3+0x167> 11f: 0f 85 00 00 00 00 jne … 125: 83 fd ff cmp $0xffffffff,%ebp 128: 0f 85 00 00 00 00 jne … 12e: 89 f8 mov %edi,%eax 130: f7 e3 mul %ebx 132: 89 d1 mov %edx,%ecx 134: 89 c6 mov %eax,%esi 136: 29 d9 sub %ebx,%ecx 138: 78 a9 js e3 <__mulvdi3+0xe3> 13a: e9 00 00 00 00 jmp … 13f: 90 nop 140: 83 3c 24 ff cmpl $0xffffffff,(%esp) 144: 0f 85 00 00 00 00 jne … 14a: 85 ed test %ebp,%ebp 14c: 0f 85 00 00 00 00 jne … 152: 89 d8 mov %ebx,%eax 154: f7 e7 mul %edi 156: 89 d3 mov %edx,%ebx 158: 29 cb sub %ecx,%ebx 15a: 0f 89 00 00 00 00 jns … 160: 89 da mov %ebx,%edx 162: e9 dc fe ff ff jmp 43 <__mulvdi3+0x43> 167: 23 34 24 and (%esp),%esi 16a: 83 fe ff cmp $0xffffffff,%esi 16d: 0f 85 00 00 00 00 jne … 173: 89 d8 mov %ebx,%eax 175: 09 f8 or %edi,%eax 177: 0f 84 00 00 00 00 je … 17d: 89 f8 mov %edi,%eax 17f: f7 e3 mul %ebx 181: 89 c6 mov %eax,%esi 183: 89 d0 mov %edx,%eax 185: 29 c8 sub %ecx,%eax 187: 29 d8 sub %ebx,%eax 189: 89 c1 mov %eax,%ecx 18b: 0f 88 00 00 00 00 js … 191: e9 4d ff ff ff jmp e3 <__mulvdi3+0xe3> […]
Properly written (and almost branch-free) code uses but only 55 instructions in either 127 bytes or just 121 bytes instead of 149 instructions in 406 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "mulvdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of multiplier
# [esp+12] = low dword of multiplier
# [esp+8] = high dword of multiplicand
# [esp+4] = low dword of multiplicand
__mulvdi3:
push ebx
mov eax, [esp+20] # eax = high dword of multiplier
mov ecx, [esp+16] # ecx = low dword of multiplier
cdq # edx = (multiplier < 0) ? -1 : 0
mov ebx, edx # ebx = (multiplier < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (multiplier < 0) ? ~multiplier : multiplier
sub ecx, edx
sbb eax, edx # eax:ecx = (multiplier < 0) ? -multiplier : multiplier
# = |multiplier|
mov [esp+16], ecx
mov [esp+20], eax # multiplier = |multiplier|
mov eax, [esp+12] # eax = high dword of multiplicand
mov ecx, [esp+8] # ecx = low dword of multiplicand
cdq # edx = (multiplicand < 0) ? -1 : 0
xor ebx, edx # ebx = (multiplier < 0) ^ (multiplicand < 0) ? -1 : 0
# = (product < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (multiplicand < 0) ? ~multiplicand : multiplicand
sub ecx, edx
sbb eax, edx # eax:ecx = (multiplicand < 0) ? -multiplicand : multiplicand
# = |multiplicand|
mov [esp+8], ecx
mov [esp+12], eax # multiplicand = |multiplicand|
push ebx # [esp] = (product < 0) ? -1 : 0
xor edx, edx # edx = 0
# mov eax, [esp+16] # eax = high dword of |multiplicand|
cmp edx, eax
sbb ebx, ebx # ebx = (high dword of |multiplicand| = 0) ? 0 : -1
# = (|multiplicand| < 2**32) ? 0 : -1
mov ecx, [esp+24] # ecx = high dword of |multiplier|
cmp edx, ecx
sbb edx, edx # edx = (high dword of |multiplier| = 0) ? 0 : -1
# = (|multiplier| < 2**32) ? 0 : -1
and ebx, edx # ebx = (|multiplicand| < 2**32)
# & (|multiplier| < 2**32) ? 0 : -1
# = (|product| < 2**64) ? 0 : -1
.ifdef INTO
shr ebx, 1
into
mov ebx, [esp+20] # ebx = low dword of |multiplier|
mul ebx # edx:eax = high dword of |multiplicand|
# * low dword of |multiplier|
into
.if 0
xchg ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|,
# eax = high dword of |multiplier|
mov edx, [esp+12] # edx = low dword of |multiplicand|
.else
mov edx, ecx # edx = high dword of |multiplier|
mov ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|
mov eax, [esp+12] # eax = low dword of |multiplicand|
.endif
mul edx # edx:eax = high dword of |multiplier|
# * low dword of |multiplicand|
into
add ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|
# + high dword of |multiplier|
# * low dword of |multiplicand|
# sbb eax, eax # eax = (|product| < 2**64) ? 0 : -1
# shr eax, 1
# into
mov eax, [esp+12] # eax = low dword of |multiplicand|
mul ebx # edx:eax = low dword of |multiplicand|
# * low dword of |multiplier|
add edx, ecx # edx:eax = |product % 2**64|
# = |product| % 2**64
sbb ecx, ecx # ecx = (|product| < 2**64) ? 0 : -1
shr ecx, 1
into
pop ebx # ebx = (product < 0) ? -1 : 0
add eax, ebx
adc edx, ebx # edx:eax = (product < 0) ? ~product % 2**64 : product % 2**64
mov ecx, edx
shr ecx, 1
into
xor eax, ebx
xor edx, ebx # edx:eax = product % 2**64
pop ebx
ret
.else
mov edx, [esp+20] # edx = low dword of |multiplier|
mul edx # edx:eax = high dword of |multiplicand|
# * low dword of |multiplier|
adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
.if 0
xchg ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|,
# eax = high dword of |multiplier|
mov edx, [esp+12] # edx = low dword of |multiplicand|
.else
mov edx, ecx # edx = high dword of |multiplier|
mov ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|
mov eax, [esp+12] # eax = low dword of |multiplicand|
.endif
mul edx # edx:eax = high dword of |multiplier|
# * low dword of |multiplicand|
adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
add ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|
# + high dword of |multiplier|
# * low dword of |multiplicand|
# adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
mov eax, [esp+12] # eax = low dword of |multiplicand|
mov edx, [esp+20] # edx = low dword of |multiplier|
mul edx # edx:eax = low dword of |multiplicand|
# * low dword of |multiplier|
add edx, ecx # edx:eax = |product % 2**64|
# = |product| % 2**64
adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
pop ecx # ecx = (product < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx # edx:eax = (product < 0) ? product % 2**64 - 1 : product % 2**64
sub eax, ecx
sbb edx, ecx # edx:eax = product % 2**64
xor ecx, edx # ecx = (multiplicand < 0)
# ^ (multiplier < 0)
# ^ (product < 0) ? negative : positive
add ecx, ecx
adc ebx, ebx # ebx = (-2**63 <= product < 2**63) ? 0 : *
jnz .overflow # product < -2**63?
# product >= 2**63?
pop ebx
ret
.overflow:
int 4
ret
.endif # INTO
.size __mulvdi3, .-__mulvdi3
.type __mulvdi3, @function
.global __mulvdi3
.end
Properly written code uses but only 47 instructions in just 114 bytes instead of 149 instructions in 406 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "mulvdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of multiplier
# [esp+12] = low dword of multiplier
# [esp+8] = high dword of multiplicand
# [esp+4] = low dword of multiplicand
__mulvdi3:
push ebx
mov ebx, [esp+16] # ebx = low dword of multiplier
mov eax, [esp+8] # eax = low dword of multiplicand
mul ebx # edx:eax = low dword of multiplicand
# * low dword of multiplier
push eax
mov ecx, edx # ecx = low dword of "inner" product
mov eax, [esp+16] # eax = high dword of multiplicand
mul ebx # edx:eax = high dword of multiplicand
# * low dword of multiplier
xor ebx, ebx # ebx = high dword of "inner" product
add ecx, eax
adc ebx, edx # ebx:ecx = intermediate "inner" product
mov eax, [esp+12] # eax = low dword of multiplicand
mul dword ptr [esp+24]
# edx:eax = low dword of multiplicand
# * high dword of multiplier
add ecx, eax
adc ebx, edx # ebx:ecx = final "inner" product
push ecx
sbb ecx, ecx # ecx = 0 - carry from "inner" product
mov eax, [esp+20] # eax = high dword of multiplicand
mul dword ptr [esp+28]
# edx:eax = high dword of multiplicand
# * high dword of multiplier
neg ecx # ecx = carry from "inner" product
add eax, ebx
adc edx, ecx # edx:eax = high dword of multiplicand
# * high dword of multiplier
# + high dword of "inner" product
# + carry from "inner" product
# = high qword of (unsigned) product"
xor ebx, ebx # ebx = sign of multiplier = 0
cmp ebx, [esp+28]
jle 0f # (high dword of) multiplier >= 0?
not ebx # ebx = 0 - sign of multiplier = -1
sub eax, [esp+16]
sbb edx, [esp+20] # edx:eax = high qword of product"
# - multiplicand
# = high qword of product'
0:
xor ecx, ecx
cmp ecx, [esp+20]
jle 1f # (high dword of) multiplicand >= 0?
not ebx # ebx = (0 - sign of multiplier)
# ^ (0 - sign of multiplicand)
sub eax, [esp+24]
sbb edx, [esp+28] # edx:eax = high qword of product'
# - multiplier
# = high qword of (signed) product
1:
xor eax, ebx
xor edx, ebx # edx:eax = high qword of product
# ^ (0 - sign of multiplier)
# ^ (0 - sign of multiplicand)
or eax, edx # eax = high qword of product
# <> (0 - sign of multiplier)
# ^ (0 - sign of multiplicand)
pop edx # edx = high dword of (low qword of) product
shld ecx, edx, 1 # ecx = sign of product
add ebx, ecx # ebx = sign of product
# <> sign of multiplier
# ^ sign of multiplicand
or eax, ebx
jnz .overflow
pop eax # edx:eax = (low qword of) product
pop ebx
ret
.overflow:
int 4
ret
.size __mulvdi3, .-__mulvdi3
.type __mulvdi3, @function
.global __mulvdi3
.end
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __mulvsi3(int multiplicand, int multiplier) {
int product;
if (__builtin_mul_overflow(multiplicand, multiplier, &product))
__builtin_trap();
return product;
}
long long __mulvdi3(long long multiplicand, long long multiplier) {
long long product;
if (__builtin_mul_overflow(multiplicand, multiplier, &product))
__builtin_trap();
return product;
}
#ifdef __amd64__
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
__int128_t product;
if (__builtin_mul_overflow(multiplicand, multiplier, &product))
__builtin_trap();
return product;
}
#endif
__negv{s,d,t}i2()
Functionslibgcc
provides the __negvsi2()
,
__negvdi2()
and __negvti2()
functions for
overflow-trapping signed integer negation:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __negvsi2 (int a)
- Runtime Function: long __negvdi2 (long a)
- Runtime Function: long long __negvti2 (long long a)
These functions return the negation of a; that is
-a
.
__negvti2()
to the following unoptimised and
awful code:
[…] 0000000000000000 <__negvti2>: 0: 48 89 f8 mov %rdi,%rax 3: 48 83 ec 08 sub $0x8,%rsp 7: 48 89 f2 mov %rsi,%rdx a: 48 f7 d8 neg %rax d: 48 83 d2 00 adc $0x0,%rdx 11: 48 f7 da neg %rdx 14: 48 85 f6 test %rsi,%rsi 17: 78 17 js 30 <__negvti2+0x30> 19: 31 c9 xor %ecx,%ecx 1b: 48 39 c1 cmp %rax,%rcx 1e: 48 19 d1 sbb %rdx,%rcx 21: 7c 1b jl 3e <__negvti2+0x3e> 23: 48 83 c4 08 add $0x8,%rsp 27: c3 retq 28: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 2f: 00 30: 48 85 d2 test %rdx,%rdx 33: 0f 88 00 00 00 00 js … 39: 48 83 c4 08 add $0x8,%rsp 3d: c3 retq 3e: e9 00 00 00 00 jmpq … […]Properly written code uses but only 8 instructions in just 15 bytes instead of 20 instructions in 67 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "negvti2.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = argument
__negvti2:
xor eax, eax
cdq # rdx:rax = 0
sub rax, rdi
sbb rdx, rsi # rdx:rax = -argument
jo .overflow
ret
.overflow:
int 4
ret
.size __negvti2, .-__negvti2
.type __negvti2, @function
.global __negvti2
.end
Note: exploration of the equally bad code generated
for the __negvdi2()
and __negvsi2()
functions is left as an exercise to the reader.
[…] 00000000 <__negvdi2>: 0: 53 push %ebx 1: 83 ec 08 sub $0x8,%esp 4: 8b 4c 24 10 mov 0x10(%esp),%ecx 8: 8b 5c 24 14 mov 0x14(%esp),%ebx c: 89 c8 mov %ecx,%eax e: 89 da mov %ebx,%edx 10: f7 d8 neg %eax 12: 83 d2 00 adc $0x0,%edx 15: f7 da neg %edx 17: 85 db test %ebx,%ebx 19: 78 15 js 30 <__negvdi2+0x30> 1b: 31 c9 xor %ecx,%ecx 1d: 39 c1 cmp %eax,%ecx 1f: 19 d1 sbb %edx,%ecx 21: 7c 1a jl 3d <__negvdi2+0x3d> 23: 83 c4 08 add $0x8,%esp 26: 5b pop %ebx 27: c3 ret 28: 8d b4 26 00 00 00 00 lea 0x0(%esi,%eiz,1),%esi 2f: 90 nop 30: 85 d2 test %edx,%edx 32: 0f 88 00 00 00 00 js … 38: 83 c4 08 add $0x8,%esp 3b: 5b pop %ebx 3c: c3 ret 3d: e9 00 00 00 00 jmp … […]
Properly written branch-free code uses but only 6 instructions in just 13 bytes instead of 26 instructions in 66 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "negvdi2.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+8] = high dword of argument
# [esp+4] = low dword of argument
__negvdi2:
xor eax, eax
cdq # edx:eax = 0
sub eax, [esp+4]
sbb edx, [esp+8] # edx:eax = -argument
into
ret
.size __negvdi2, .-__negvdi2
.type __negvdi2, @function
.global __negvdi2
.end
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef ALTERNATE
// overflow if negend and negation are negative
int __negvsi2(int negend) {
const int negation = 0 - (unsigned int) negend;
if ((negend & negation) < 0)
__builtin_trap();
return negation;
}
long long __negvdi2(long long negend) {
const long long negation = 0 - (unsigned long long) negend;
if ((negend & negation) < 0)
__builtin_trap();
return negation;
}
#ifdef __amd64__
__int128_t __negvti2(__int128_t negend) {
const __int128_t negation = 0 - (__uint128_t) negend;
if ((negend & negation) < 0)
__builtin_trap();
return negation;
}
#endif
#else
int __negvsi2(int negend) {
int negation;
if (__builtin_sub_overflow(0, negend, &negation))
__builtin_trap();
return negation;
}
long long __negvdi2(long long negend) {
long long negation;
if (__builtin_sub_overflow(0, negend, &negation))
__builtin_trap();
return negation;
}
#ifdef __amd64__
__int128_t __negvti2(__int128_t negend) {
__int128_t negation;
if (__builtin_sub_overflow(0, negend, &negation))
__builtin_trap();
return negation;
}
#endif
#endif // ALTERNATE
__parity{s,d,t}i2()
Functionslibgcc
provides the __paritysi2()
,
__paritydi2()
and __parityti2()
functions:
For the AMD64 target platform, GCC compiles
- Runtime Function: int __paritysi2 (unsigned int a)
- Runtime Function: int __paritydi2 (unsigned long a)
- Runtime Function: int __parityti2 (unsigned long long a)
These functions return the value zero if the number of bits set in a is even, and the value one otherwise.
__parityti2()
to the following unoptimised and
awful code:
[…] 0000000000000000 <__parityti2>: 0: 48 31 f7 xor %rsi,%rdi 3: b8 96 69 00 00 mov $0x6996,%eax 8: 48 89 fe mov %rdi,%rsi b: 48 89 f9 mov %rdi,%rcx e: 48 c1 ee 20 shr $0x20,%rsi 12: 48 31 f1 xor %rsi,%rcx 15: 48 89 ce mov %rcx,%rsi 18: 48 c1 ee 10 shr $0x10,%rsi 1c: 48 31 ce xor %rcx,%rsi 1f: 48 89 f1 mov %rsi,%rcx 22: 48 c1 e9 08 shr $0x8,%rcx 26: 48 31 ce xor %rcx,%rsi 29: 48 89 f1 mov %rsi,%rcx 2c: 48 c1 e9 04 shr $0x4,%rcx 30: 48 31 f1 xor %rsi,%rcx 33: 83 e1 0f and $0xf,%ecx 36: d3 f8 sar %cl,%eax 38: 83 e0 01 and $0x1,%eax 3b: c3 retq […]Properly written code uses but only 9 instructions in just 24 bytes instead of 19 instructions in 60 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "parityti2.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = argument
__parityti2:
xor rdi, rsi
shld rdx, rdi, 32
xor edx, edi
shld edx, edi, 16
xor edx, edi
xor eax, eax
xor dl, dh
setpo al
ret
.size __parityti2, .-__parityti2
.type __parityti2, @function
.global __parityti2
.end
Note: exploration of the equally bad code generated
for the __paritydi2()
and __paritysi2()
functions is left as an exercise to the reader.
[…] 00000000 <__paritydi2>: 0: 8b 44 24 08 mov 0x8(%esp),%eax 4: 33 44 24 04 xor 0x4(%esp),%eax 8: 89 c1 mov %eax,%ecx a: c1 e8 10 shr $0x10,%eax d: 31 c8 xor %ecx,%eax f: 89 c1 mov %eax,%ecx 11: c1 e9 08 shr $0x8,%ecx 14: 31 c8 xor %ecx,%eax 16: 89 c1 mov %eax,%ecx 18: c1 e9 04 shr $0x4,%ecx 1b: 31 c1 xor %eax,%ecx 1d: b8 96 69 00 00 mov $0x6996,%eax 22: 83 e1 0f and $0xf,%ecx 25: d3 f8 sar %cl,%eax 27: 83 e0 01 and $0x1,%eax 2a: c3 ret […]
__builtin_parity()
Builtincase16.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __paritysi2(unsigned int value) {
return __builtin_parity(value);
}
int __paritydi2(unsigned long long value) {
return __builtin_parityll(value);
}
Compile the source file case16.c
with
GCC, engaging its
optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case16.c
[…] __paritysi2: movl 4(%esp), %eax movl %eax, %edx shrl $16, %eax xorl %eax, %edx xorl %eax, %eax xorb %dh, %dl setnp %al ret […] __paritydi2: pushl %ebx movl 8(%esp), %eax movl 12(%esp), %edx movl %edx, %ebx xorl %eax, %ebx movl %ebx, %edx shrl $16, %ebx xorl %ebx, %edx xorl %eax, %eax xorb %dh, %dl setnp %al popl %ebx ret […] .ident "GCC: (GNU […]) 10.2.0"Oops: the code for the
__paritydi2()
function clobbers register EBX
instead of the volatile
register ECX
, thus wasting a
PUSH
and a POP
instruction which perform 2 superfluous memory accesses.
__builtin_mul_overflow()
Builtincase17.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef OPTIMIZE
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
__int128_t product;
if (__builtin_mul_overflow(multiplicand, multiplier, &product))
__builtin_trap();
return product;
}
__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
__uint128_t product;
if (__builtin_mul_overflow(multiplicand, multiplier, &product))
__builtin_trap();
return product;
}
#else
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
__uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);
if (__builtin_mul_overflow((multiplicand ^ sign) - sign,
(multiplier ^ tmp) - tmp,
&product))
__builtin_trap();
sign ^= tmp;
product = (product ^ sign) - sign;
if ((__int128_t) (product ^ sign) < 0)
__builtin_trap();
return product;
}
__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
union {
__uint128_t all;
struct {
unsigned long long low, high;
};
} product;
unsigned long long tmp;
product.all = (multiplicand & ~0ULL) * (multiplier & ~0ULL);
if ((((multiplicand >> 64) != 0) && ((multiplier >> 64) != 0))
|| __builtin_umulll_overflow(multiplicand & ~0ULL, multiplier >> 64, &tmp)
|| __builtin_uaddll_overflow(tmp, product.high, &product.high)
|| __builtin_umulll_overflow(multiplicand >> 64, multiplier & ~0ULL, &tmp)
|| __builtin_uaddll_overflow(tmp, product.high, &product.high))
__builtin_trap();
return product.all;
}
#endif // OPTIMIZE
Compile the source file case17.c
with
GCC, engaging its
optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -Os -S case17.c
[…] __mulvti3: pushq %rbp movq %rsi, %r10 movq %rdx, %rsi movq %rdi, %rdx movq %rsi, %rax sarq $63, %rdx movq %rcx, %r11 sarq $63, %rax movq %rsp, %rbp pushq %r15 pushq %r14 xorl %r14d, %r14d pushq %r13 pushq %r12 pushq %rbx cmpq %r10, %rdx jne .L4 cmpq %rcx, %rax jne .L5 movq %rdi, %rax imulq %rsi movq %rax, %rbx movq %rdx, %r8 jmp .L2 .L5: movq %rsi, %rcx movq %r11, %rbx movq %r11, %r8 movq %rdi, %r15 jmp .L6 .L4: cmpq %rcx, %rax jne .L7 movq %rdi, %rcx movq %r10, %rbx movq %r10, %r8 movq %rsi, %r15 .L6: movq %rdi, %rax mulq %rsi movq %rax, %r12 movq %r15, %rax movq %rdx, %r13 mulq %r8 testq %r8, %r8 jns .L8 xorl %r8d, %r8d subq %r8, %rax sbbq %r15, %rdx .L8: testq %r15, %r15 jns .L9 subq %rcx, %rax sbbq %rbx, %rdx .L9: movq %r13, %r8 xorl %r9d, %r9d movq %r12, %rbx addq %rax, %r8 adcq %rdx, %r9 movq %r8, %rdx sarq $63, %rdx cmpq %r9, %rdx je .L2 imulq %rsi, %r10 movq %rdi, %rax imulq %rdi, %r11 mulq %rsi leaq (%r10,%r11), %r8 addq %rdx, %r8 movq %rax, %rbx jmp .L3 .L7: movq %r10, %r8 movq %rcx, %rdx movq %rdi, %rax imulq %rdi, %rdx imulq %rsi, %r8 addq %rdx, %r8 mulq %rsi addq %rdx, %r8 leaq 1(%r10), %rdx movq %rax, %rbx cmpq $1, %rdx ja .L3 leaq 1(%rcx), %rdx cmpq $1, %rdx ja .L3 cmpq %rcx, %r10 jne .L11 cmpq %rbx, %r14 movq %r14, %rax sbbq %r8, %rax jl .L2 jmp .L3 .L11: testq %r8, %r8 js .L2 .L3: movl $1, %r14d .L2: testq %r14, %r14 je .L1 ud2 .L1: movq %rbx, %rax movq %r8, %rdx popq %rbx popq %r12 popq %r13 popq %r14 popq %r15 popq %rbp ret […] __umulvti3: pushq %rbp movq %rsi, %r10 movq %rcx, %r11 movq %rdx, %rsi movq %rsp, %rbp pushq %r12 xorl %r12d, %r12d pushq %rbx testq %r10, %r10 jne .L18 testq %rcx, %rcx jne .L19 movq %rdi, %rax mulq %rdx movq %rax, %rbx movq %rdx, %r8 jmp .L16 .L19: movq %rcx, %r8 movq %rdi, %r9 jmp .L20 .L18: testq %rcx, %rcx jne .L22 movq %r10, %r8 movq %rdx, %r9 .L20: movq %rdi, %rax mulq %rsi movq %rax, %rcx movq %r9, %rax movq %rdx, %rbx xorl %r9d, %r9d mulq %r8 movq %rbx, %r8 movq %rcx, %rbx addq %rax, %r8 adcq %rdx, %r9 testq %r9, %r9 je .L16 .L22: imulq %rsi, %r10 movq %rdi, %rax movl $1, %r12d imulq %rdi, %r11 mulq %rsi leaq (%r10,%r11), %r8 addq %rdx, %r8 movq %rax, %rbx .L16: testq %r12, %r12 je .L15 ud2 .L15: movq %rbx, %rax movq %r8, %rdx popq %rbx popq %r12 popq %rbp ret […] .ident "GCC: (GNU […]) 10.2.0"OUCH: 104 instructions in 295 bytes for the
__mulvti3()
function – still horrible, but not as
bad as the
__mulvti3()
function provided in libgcc
any more!
Ouch: 54 instructions in 148 bytes for the
__umulvti3()
function – also rather terrible!
Compile the source file case17.c
a second time with
GCC, now with the
preprocessor macro OPTIMIZE
defined, and display the
generated assembly code:
gcc -DOPTIMIZE -m64 -o- -Os -S case17.c
[…] __mulvti3: pushq %rbp movq %rcx, %rax movq %rdi, %r10 movq %rsi, %r11 movq %rax, %rdi sarq $63, %rdi movq %rsp, %rbp pushq %r14 movq %rdi, %r14 pushq %r13 pushq %r12 pushq %rbx movq %rsi, %rbx movslq %edi, %rsi sarq $63, %rbx movslq %ebx, %rcx movq %rbx, %r12 movq %rcx, %r8 sarq $63, %rcx xorq %r8, %r10 movq %rcx, %r9 movq %rsi, %rcx sarq $63, %rsi movq %rsi, %rbx xorq %r9, %r11 movq %r10, %rsi subq %r8, %rsi movq %r11, %rdi sbbq %r9, %rdi xorq %rcx, %rdx xorq %rbx, %rax movq %rdx, %r10 movq %rax, %r11 subq %rcx, %r10 sbbq %rbx, %r11 xorl %r13d, %r13d movq %r11, %r8 testq %rdi, %rdi jne .L4 testq %r11, %r11 jne .L5 movq %rsi, %rax mulq %r10 movq %rax, %rcx movq %rdx, %r8 jmp .L2 .L5: movq %rsi, %r9 jmp .L6 .L4: testq %r11, %r11 jne .L8 movq %rdi, %r8 movq %r10, %r9 .L6: movq %rsi, %rax mulq %r10 movq %rax, %rcx movq %r9, %rax movq %rdx, %rbx xorl %r9d, %r9d mulq %r8 movq %rbx, %r8 addq %rax, %r8 adcq %rdx, %r9 testq %r9, %r9 je .L2 .L8: movq %rdi, %r8 movq %r11, %rax movl $1, %r13d imulq %rsi, %rax imulq %r10, %r8 addq %rax, %r8 movq %rsi, %rax mulq %r10 addq %rdx, %r8 movq %rax, %rcx .L2: testq %r13, %r13 je .L9 .L10: ud2 .L9: xorl %r14d, %r12d movslq %r12d, %r12 movq %r12, %rsi sarq $63, %r12 xorq %rsi, %rcx xorq %r12, %r8 movq %r12, %rbx movq %rcx, %rax movq %r8, %rdx subq %rsi, %rax sbbq %r12, %rdx xorq %rdx, %rbx js .L10 popq %rbx popq %r12 popq %r13 popq %r14 popq %rbp ret […] __umulvti3: movq %rdi, %rax movq %rdx, %r8 mulq %rdx movq %rdx, %r11 movq %rax, %r9 testq %rsi, %rsi je .L14 testq %rcx, %rcx je .L14 .L17: ud2 .L14: movq %rdi, %rax mulq %rcx movq %rax, %rdi jo .L17 addq %r11, %rdi jc .L17 movq %rsi, %rax mulq %r8 movq %rax, %rsi jo .L17 addq %rdi, %rsi movq %rsi, %rdx jc .L17 movq %r9, %rax ret […] .ident "GCC: (GNU […]) 10.2.0"Oops: the optimised implementation coaxes GCC to generate 96 instructions in 273 bytes for the
__mulvti3()
function – a little shorter and faster than the code generated
for __builtin_mul_overflow()
, but more than twice the
size of the proper assembly code shown in
case 13
above, and less than half as fast!
OOPS: 25 instructions in 66 bytes for the
__umulvti3()
function – also not optimal, but
much better than the rather terrible code generated
for __builtin_mul_overflow()
.
Note: exploration of the equally bad code generated for checked multiplication of signed and unsigned 64-bit integers on the i386 platform is left as an exercise to the reader.
__builtin_sub_overflow()
Builtincase18.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
__int128_t __absvti2(__int128_t value) {
__int128_t sign = 0 - (value < 0);
if (__builtin_sub_overflow(value ^ sign, sign, &value))
__builtin_trap();
return value;
}
__int128_t __negvti2(__int128_t negend) {
__int128_t negation;
if (__builtin_sub_overflow(0, negend, &negation))
__builtin_trap();
return negation;
}
__int128_t __subvti3(__int128_t minuend, __int128_t subtrahend) {
__int128_t difference;
if (__builtin_sub_overflow(minuend, subtrahend, &difference))
__builtin_trap();
return difference;
}
Compile the source file case18.c
with
GCC, engaging its
optimiser, targetting the AMD64 platform, and display
the generated assembly code:
gcc -m64 -o- -Os -S case18.c
[…] __absvti2: movq %rsi, %rax movq %rdi, %r8 movq %rsi, %rcx sarq $63, %rax cltq movq %rax, %rsi sarq $63, %rax movq %rax, %rdi xorq %rsi, %r8 xorq %rdi, %rcx movq %r8, %rax movq %rcx, %rdx subq %rsi, %rax sbbq %rdi, %rdx jno .L1 ud2 .L1: ret […] __negvti2: movq %rdi, %r9 movq %rsi, %r8 movq %rdi, %rsi xorl %edx, %edx movq %r8, %rdi movq %r9, %r8 movl $1, %eax negq %r8 movq %rdi, %r9 adcq $0, %r9 salq $63, %rax negq %r9 cmpq %rax, %rdi jne .L7 xorl %edx, %edx testq %rsi, %rsi sete %dl .L7: testq %rdx, %rdx je .L6 ud2 .L6: movq %r8, %rax movq %r9, %rdx ret […] __subvti3: movq %rsi, %r8 movq %rdi, %rsi movq %rsi, %rax movq %r8, %rdi movq %rdx, %r8 subq %r8, %rax movq %rdi, %rdx sbbq %rcx, %rdx jno .L11 ud2 .L11: ret […] .ident "GCC: (GNU […]) 10.2.0"OUCH: terrible code for the
__absvti2()
and __negvti2()
functions!
OOPS: awful code with 4 superfluous
MOV
instructions in a total of 11
instructions for the __subvti3()
function.
__builtin_copysign()
Builtincase19.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
double copysign(double destination, double source) {
return __builtin_copysign(destination, source);
}
Compile the source file case19.c
with
GCC, engaging its
optimiser, targetting the i386 platform, and display
the generated assembly code:
gcc -m32 -o- -O3 -S case19.c
[…] copysign: fldl 12(%esp) fxam fnstsw %ax fstp %st(0) fldl 4(%esp) fabs testb $2, %ah je L1 fchs L1: ret […] .ident "GCC: (GNU […]) 10.2.0"Ouch: 10 instructions in 24 bytes, including a conditional branch which impairs performance!
Properly written code uses but only 5 instructions in just 19 bytes, without conditional branch:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "copysign.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
copysign:
mov eax, [esp+16]
shld [esp+8], eax, 1
ror [esp+8], 1
fld qword ptr [esp+4]
ret
.size copysign, .-copysign
.type copysign, @function
.global copysign
.end
Compile the source file case19.c
a second time with
GCC, now
targetting the AMD64 platform, and display the
generated assembly code:
gcc -m64 -o- -O3 -S case19.c
[…] copysign: pushq %rbp andpd .LC1(%rip), %xmm0 movapd %xmm1, %xmm2 andpd .LC0(%rip), %xmm2 movq %rsp, %rbp orpd %xmm2, %xmm0 popq %rbp ret .section .rdata,"dr" .align 16 .LC0: .long 0 .long -2147483648 .long 0 .long 0 .align 16 .LC1: .long -1 .long 2147483647 .long 0 .long 0 […] .ident "GCC: (GNU […]) 10.2.0"OUCH: 8 instructions in 30 bytes, plus 2 constants in 32 bytes, of which either 4 instructions or (at least) one of the 2 constants are superfluous!
Properly written code uses but only 6 instructions in either 28 or just 24 bytes, and performs no memory access:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "copysign.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# xmm0 = destination
# xmm1 = source
copysign:
.ifnotdef OPTIMIZE
mov rax, 0x7FFFFFFFFFFFFFFF
movd xmm2, rax
andpd xmm0, xmm2
andnpd xmm2, xmm1
orpd xmm0, xmm2
.else
movd rcx, xmm0
movd rdx, xmm1
shld rcx, rdx, 1
ror rcx, 1
movd xmm0, rcx
.endif
ret
.size copysign, .-copysign
.type copysign, @function
.global copysign
.end
Note: exploration of the equally bad code generated
for the copysignf()
and copysignl()
functions is left as an exercise to the reader.
-fsanitize=signed-integer-overflow
Command Line Optioncase20.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int si2(int u) {
return v << 31;
}
long long di2(long long u) {
return v << 63;
}
#ifdef __amd64__
__int128_t ti2(__int128_t u) {
return v << 127;
}
#endif
Compile the source file case20.c
with
GCC, enabling
signed overflow detection, engaging its optimiser, targetting the
AMD64 platform, and display the generated assembly
code:
gcc -fsanitize=signed-integer-overflow -m64 -o- -Os -S case20.c
[…] si2: movl %edi, %eax sall $31, %eax ret […] di2: movq %rdi, %rax salq $63, %rax ret […] ti2: movq %rdi, %rdx xorl %eax, %eax salq $63, %rdx ret […] .ident "GCC: (GNU […]) 10.2.0"OUCH: GCC fails to generate code for integer overflow detection – which occurs for arguments not equal 0 here!
Note: exploration of this bug on the i386 platform is left as an exercise to the reader.
gcc -fsanitize=signed-integer-overflow -m32 -o- -Os -S case20.c
[…] _si2: movl 4(%esp), %eax sall $31, %eax ret […] _di2: movl 4(%esp), %edx xorl %eax, %eax sall $31, %edx ret […] .ident "GCC: (GNU […]) 10.2.0"
-ftrapv
Command Line Optioncase21.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int si3(int v, int w) {
volatile int p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
return v < 0 ? -v : v;
}
long long di3(long long v, long long w) {
volatile long long p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
return v < 0 ? -v : v;
}
#ifdef __amd64__
__int128_t ti3(__int128_t v, __int128_t w) {
volatile __int128_t p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
return v < 0 ? -v : v;
}
#endif
Compile the source file case21.c
with
GCC, enabling
overflow-trapping, engaging its optimiser, targetting the
AMD64 platform, and display the generated assembly
code:
gcc -ftrapv -m64 -o- -Os -S case21.c
[…] .globl __mulvsi3 .globl __addvsi3 .globl __subvsi3 .globl __negvsi2 .globl si3 .type si3, @function si3: pushq %rbp movl %esi, %ebp pushq %rbx movl %edi, %ebx subq $40, %rsp call __mulvsi3 movl %ebp, %esi movl %ebx, %edi movl %eax, 8(%rsp) movl %ebx, %eax cltd idivl %ebp movl %eax, 12(%rsp) movl %edx, 16(%rsp) call __addvsi3 movl %ebp, %esi movl %ebx, %edi movl %eax, 20(%rsp) call __subvsi3 movl %ebx, %edi movl %eax, 24(%rsp) call __negvsi2 movl %ebx, %esi movl %ebx, %edi sarl $31, %esi movl %eax, 28(%rsp) xorl %esi, %edi call __subvsi3 addq $40, %rsp popq %rbx popq %rbp ret […] .globl __mulvdi3 .globl __addvdi3 .globl __subvdi3 .globl __negvdi2 .globl di3 .type di3, @function di3: pushq %rbp movq %rsi, %rbp pushq %rbx movq %rdi, %rbx subq $56, %rsp call __mulvdi3 movq %rbp, %rsi movq %rbx, %rdi movq %rax, (%rsp) movq %rbx, %rax cqto idivq %rbp movq %rax, 8(%rsp) movq %rdx, 16(%rsp) call __addvdi3 movq %rbp, %rsi movq %rbx, %rdi movq %rax, 24(%rsp) call __subvdi3 movq %rbx, %rdi movq %rax, 32(%rsp) call __negvdi2 movq %rbx, %rsi movq %rbx, %rdi sarq $63, %rsi movq %rax, 40(%rsp) xorq %rsi, %rdi call __subvdi3 addq $56, %rsp popq %rbx popq %rbp ret […] .globl __mulvti3 .globl __divti3 .globl __modti3 .globl __addvti3 .globl __subvti3 .globl __negvti2 .globl ti3 .type ti3, @function ti3: pushq %r14 movq %rcx, %r14 pushq %r13 movq %rdx, %r13 pushq %rbp movq %rdi, %rbp pushq %rbx movq %rsi, %rbx subq $104, %rsp call __mulvti3 movq %r14, %rcx movq %rbp, %rdi movq %rbx, %rsi movq %rax, (%rsp) movq %rdx, 8(%rsp) movq %r13, %rdx call __divti3 movq %r14, %rcx movq %rbp, %rdi movq %rbx, %rsi movq %rax, 16(%rsp) movq %rdx, 24(%rsp) movq %r13, %rdx call __modti3 movq %r14, %rcx movq %rbp, %rdi movq %rbx, %rsi movq %rax, 32(%rsp) movq %rdx, 40(%rsp) movq %r13, %rdx call __addvti3 movq %r14, %rcx movq %rbp, %rdi movq %rbx, %rsi movq %rax, 48(%rsp) movq %rdx, 56(%rsp) movq %r13, %rdx call __subvti3 movq %rbp, %rdi movq %rbx, %rsi movq %rax, 64(%rsp) movq %rdx, 72(%rsp) call __negvti2 movq %rbx, %rcx movq %rbp, %rdi sarq $63, %rcx movq %rax, 80(%rsp) xorq %rcx, %rbx movq %rdx, 88(%rsp) xorq %rcx, %rdi movq %rcx, %rdx movq %rbx, %rsi call __subvti3 addq $104, %rsp popq %rbx popq %rbp popq %r13 popq %r14 ret […] .ident "GCC: (GNU […]) 10.2.0"Oops: instead to generate a single call of the library function
__divmodti4()
which returns both
quotient and remainder,
GCC generates
separate calls of the library functions __divti3()
and __modti3()
, doubling the execution time.
Ouch: instead to generate the
trivial code for overflow-trapping addition,
negation and subtraction inline,
GCC generates
calls of the library functions
__addv{s,d,t}i3()
,
__mulv{s,d,t}i3()
,
__negv{s,d,t}i2()
and
__subv{s,d,t}i3()
,
increasing the execution time about an
order of magnitude!
Note: exploration of the equally bad code generated for overflow-trapping integer arithmetic of 64-bit integers on the i386 platform is left as an exercise to the reader.
gcc -ftrapv -m32 -o- -Os -S case21.c
[…] _si3: pushl %esi pushl %ebx subl $52, %esp movl 64(%esp), %ebx movl 68(%esp), %esi movl %ebx, (%esp) movl %esi, 4(%esp) call ___mulvsi3 movl %esi, 4(%esp) movl %eax, 24(%esp) movl %ebx, %eax cltd movl %ebx, (%esp) idivl %esi movl %eax, 28(%esp) movl %edx, 32(%esp) call ___addvsi3 movl %esi, 4(%esp) movl %ebx, (%esp) movl %eax, 36(%esp) call ___subvsi3 movl %ebx, (%esp) movl %eax, 40(%esp) call ___negvsi2 movl %eax, 44(%esp) movl %ebx, %eax sarl $31, %eax xorl %eax, %ebx movl %eax, 4(%esp) movl %ebx, (%esp) call ___subvsi3 addl $52, %esp popl %ebx popl %esi ret […] _di3: pushl %ebp pushl %edi pushl %esi pushl %ebx subl $76, %esp movl 100(%esp), %ebx movl 96(%esp), %ebp movl 104(%esp), %esi movl 108(%esp), %edi movl %ebp, (%esp) movl %esi, 8(%esp) movl %edi, 12(%esp) movl %ebx, 4(%esp) call ___mulvdi3 movl %esi, 8(%esp) movl %edi, 12(%esp) movl %ebp, (%esp) movl %ebx, 4(%esp) movl %eax, 16(%esp) movl %edx, 20(%esp) call ___divdi3 movl %esi, 8(%esp) movl %edi, 12(%esp) movl %ebp, (%esp) movl %ebx, 4(%esp) movl %eax, 24(%esp) movl %edx, 28(%esp) call ___moddi3 movl %esi, 8(%esp) movl %edi, 12(%esp) movl %ebp, (%esp) movl %ebx, 4(%esp) movl %eax, 32(%esp) movl %edx, 36(%esp) call ___addvdi3 movl %esi, 8(%esp) movl %edi, 12(%esp) movl %ebp, (%esp) movl %ebx, 4(%esp) movl %eax, 40(%esp) movl %edx, 44(%esp) call ___subvdi3 movl %ebp, (%esp) movl %ebx, 4(%esp) movl %eax, 48(%esp) movl %edx, 52(%esp) call ___negvdi2 movl %eax, 56(%esp) movl %ebx, %eax sarl $31, %eax movl %edx, 60(%esp) xorl %eax, %ebp xorl %eax, %ebx movl %eax, 8(%esp) movl %ebp, (%esp) movl %ebx, 4(%esp) movl %eax, 12(%esp) call ___subvdi3 addl $76, %esp popl %ebx popl %esi popl %edi popl %ebp ret […] .ident "GCC: (GNU […]) 10.2.0"
Shell Game, or
__mulvti3()
function and the code generated for it:
/* More subroutines needed by GCC output code on some machines. */
/* Compile this one with gcc. */
/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
[…]
typedef union {
DWtype ll;
struct {
Wtype low, high;
} s;
} DWunion;
[…]
DWtype
__mulvDI3 (DWtype u, DWtype v)
{
/* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
but the checked multiplication needs only two. */
const DWunion uu = {.ll = u};
const DWunion vv = {.ll = v};
if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
{
/* u fits in a single Wtype. */
if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
{
/* v fits in a single Wtype as well. */
/* A single multiplication. No overflow risk. */
return (DWtype) uu.s.low * (DWtype) vv.s.low;
}
[…]
[…]
0000000000000000 <__mulvti3>:
0: 41 55 push %r13
2: 49 89 cb mov %rcx,%r11
5: 48 89 d0 mov %rdx,%rax
8: 49 89 d2 mov %rdx,%r10
b: 41 54 push %r12
d: 49 89 fc mov %rdi,%r12
10: 48 89 d1 mov %rdx,%rcx
13: 49 89 f0 mov %rsi,%r8
16: 4c 89 e2 mov %r12,%rdx
19: 49 89 f5 mov %rsi,%r13
1c: 53 push %rbx
1d: 48 89 fe mov %rdi,%rsi
20: 48 c1 fa 3f sar $0x3f,%rdx
24: 48 c1 f8 3f sar $0x3f,%rax
28: 4c 89 df mov %r11,%rdi
2b: 4c 39 c2 cmp %r8,%rdx
2e: 75 18 jne 48 <__mulvti3+0x48>
30: 4c 39 d8 cmp %r11,%rax
33: 75 6b jne a0 <__mulvti3+0xa0>
35: 4c 89 e0 mov %r12,%rax
38: 49 f7 ea imul %r10
3b: 5b pop %rbx
3c: 41 5c pop %r12
3e: 41 5d pop %r13
40: c3 retq
[…]
.ident "GCC: (GNU […]) 10.2.0"
OUCH: 25 instructions in 65 bytes, including 8
superfluous MOV
instructions shuffling registers, clobbering the non-volatile
registers RBX
, R12
and R13
without necessity, and performing 6 superfluous memory accesses!
Properly written and straightforward code uses but only 11 instructions in just 31 bytes, without clobbering non-volatile registers, and without memory accesses at all:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "mulvti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.global __mulvti3
.type __mulvti3, @function
.text
# rsi:rdi = multiplicand
# rcx:rdx = multiplier
__mulvti3:
mov r8, rdi
mov r9, rdx
sra r8, 63
sra r9, 63
cmp r8, rsi
jne __mulvti3+0x48
cmp r9, rcx
jne __mulvti3+0xa0
mov rax, rdi
imul rdx
ret
[…]
Undefined Behaviourin
__bswapsi2()
Function__bswapsi2()
defined in
libgcc2.c
exhibits
Undefined Behaviourin its last line:
/* More subroutines needed by GCC output code on some machines. */
/* Compile this one with gcc. */
/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
[…]
typedef int SItype __attribute__ ((mode (SI)));
[…]
SItype
__bswapsi2 (SItype u)
{
return ((((u) & 0xff000000) >> 24)
| (((u) & 0x00ff0000) >> 8)
| (((u) & 0x0000ff00) << 8)
| (((u) & 0x000000ff) << 24));
}
[…]
3.4.3
Undefined behavior[…]
EXAMPLE An example of undefined behavior is the behavior on integer overflow.
Undefined Behaviouror Optimiser Failure?
case24.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __absusi2(int value) {
if (value < 0)
value = -value;
if (value < 0)
__builtin_trap();
return value;
}
int __absvsi2(int value) {
const int sign = value >> 31;
value += sign;
value ^= sign;
if (value < 0)
__builtin_trap();
return value;
}
int __abswsi2(int value) {
const int sign = 0 - (value < 0);
value ^= sign;
value -= sign;
if (value < 0)
__builtin_trap();
return value;
}
long long __absudi2(long long value) {
if (value < 0)
value = -value;
if (value < 0)
__builtin_trap();
return value;
}
long long __absvdi2(long long value) {
const long long sign = value >> 63;
value += sign;
value ^= sign;
if (value < 0)
__builtin_trap();
return value;
}
long long __abswdi2(long long value) {
const long long sign = 0 - (value < 0);
value ^= sign;
value -= sign;
if (value < 0)
__builtin_trap();
return value;
}
#ifdef __amd64__
__int128_t __absuti2(__int128_t value) {
if (value < 0)
value = -value;
if (value < 0)
__builtin_trap();
return value;
}
__int128_t __absvti2(__int128_t value) {
const __int128_t sign = value >> 127;
value += sign;
value ^= sign;
if (value < 0)
__builtin_trap();
return value;
}
__int128_t __abswti2(__int128_t value) {
const __int128_t sign = 0 - (value < 0);
value ^= sign;
value -= sign;
if (value < 0)
__builtin_trap();
return value;
}
#endif // __amd64__
Compile the source file case24.c
with
GCC, engaging its
optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -Os -S -Wall -Wextra case24.c
[…] __absusi2: movl %edi, %eax cltd xorl %edx, %eax subl %edx, %eax ret […] __absvsi2: movl %edi, %edx sarl $31, %edx leal (%rdi,%rdx), %eax xorl %edx, %eax jns .L2 ud2 .L2: ret […] __abswsi2: movl %edi, %eax movl %edi, %edx shrl $31, %eax movl %eax, %edi negl %edi xorl %edx, %edi addl %edi, %eax jns .L5 ud2 .L5: ret […] __absudi2: movq %rdi, %rax cqto xorq %rdx, %rax subq %rdx, %rax ret […] __absvdi2: movq %rdi, %rdx sarq $63, %rdx leaq (%rdi,%rdx), %rax xorq %rdx, %rax jns .L8 ud2 .L8: ret […] __abswdi2: movq %rdi, %rax cqto movslq %edx, %rdx xorq %rdx, %rax subq %rdx, %rax jns .L10 ud2 .L10: ret […] __absuti2: movq %rsi, %rax movq %rdi, %r8 movq %rsi, %rcx sarq $63, %rax movq %rax, %rsi xorq %rax, %r8 xorq %rsi, %rcx movq %r8, %rax movq %rcx, %rdx subq %rsi, %rax sbbq %rsi, %rdx ret […] __absvti2: movq %rdi, %rax movq %rsi, %rdi movq %rsi, %rdx sarq $63, %rdi addq %rdi, %rax adcq %rdi, %rdx xorq %rdi, %rax xorq %rdi, %rdx jns .L13 ud2 .L13: ret […] __abswti2: movq %rsi, %rax movq %rdi, %r8 movq %rsi, %rcx sarq $63, %rax cltq movq %rax, %rsi sarq $63, %rax movq %rax, %rdi xorq %rsi, %r8 xorq %rdi, %rcx movq %r8, %rax movq %rcx, %rdx subq %rsi, %rax sbbq %rdi, %rdx testq %rdx, %rdx jns .L15 ud2 .L15: ret […] .ident "GCC: (GNU […]) 10.2.0"Ouch: the
undefined behaviour, for the
__absusi2()
, __absudi2()
and
__absuti2()
functions – without warning, despite
the command line options -Wall -Wextra
given.
Oops: the optimiser
fails to recognise the
common and well-known alternative implementations used for the
__absvsi2()
, __absvdi2()
and
__absvti2()
plus the __abswsi2()
,
__abswdi2()
and __abswti2()
functions, and
generates unoptimised and awful code especially for the latter.
Note: exploration of the equally bad code generated for the i386 platform is left as an exercise to the reader.
case25.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef __amd64__
inline
unsigned long long __rotldi3(unsigned long long m, unsigned n) {
return (m << n) | (m >> (64 - n));
}
inline
unsigned long long __rotrdi3(unsigned long long m, unsigned n) {
return (m >> n) | (m << (64 - n));
}
unsigned long long right63(unsigned long long o) {
return __rotrdi3(o, 63);
}
unsigned long long right33(unsigned long long o) {
return __rotrdi3(o, 33);
}
unsigned long long right31(unsigned long long o) {
return __rotrdi3(o, 31);
}
unsigned long long right1(unsigned long long o) {
return __rotrdi3(o, 1);
}
#else
inline
__uint128_t __rotlti3(__uint128_t m, unsigned n) {
return (m << n) | (m >> (128 - n));
}
inline
__uint128_t __rotrti3(__uint128_t m, unsigned n) {
return (m >> n) | (m << (128 - n));
}
__uint128_t left127(__uint128_t o) {
return __rotlti3(o, 127);
}
__uint128_t left65(__uint128_t o) {
return __rotlti3(o, 65);
}
__uint128_t left63(__uint128_t o) {
return __rotlti3(o, 63);
}
__uint128_t left1(__uint128_t o) {
return __rotlti3(o, 1);
}
#endif // __amd64__
Compile the source file case26.c
with
GCC, engaging its
optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case26.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…] right63: movl 4(%esp), %eax # mov eax, [esp+4] movl 8(%esp), %edx # mov edx, [esp+8] movl %eax, %ecx # bt eax, 31 shldl $1, %edx, %eax # adc edx, edx shldl $1, %ecx, %edx # adc eax, eax ret # ret […] right33: movl 4(%esp), %eax # mov edx, [esp+4] movl 8(%esp), %edx # mov eax, [esp+8] movl %eax, %ecx # bt edx, 1 shldl $31, %edx, %eax # rcr eax, 1 shldl $31, %ecx, %edx # rcr edx, 1 ret # ret […] right31: movl 4(%esp), %eax # mov edx, [esp+4] movl 8(%esp), %edx # mov eax, [esp+8] movl %eax, %ecx # bt edx, 31 shrdl $31, %edx, %eax # adc eax, eax shrdl $31, %ecx, %edx # adc edx, edx ret # ret […] right1: movl 4(%esp), %eax # mov eax, [esp+4] movl 8(%esp), %edx # mov edx, [esp+8] movl %eax, %ecx # bt eax, 1 shrdl $1, %edx, %eax # rcr edx, 1 shrdl $1, %ecx, %edx # rcr eax, 1 ret # ret […] .ident "GCC: (GNU […]) 10.2.0"OOPS: while GCC replaces the circular left shift by 63 with the equivalent circular right shift by 1, it but fails to replace the circular left shift by 33, i.e. register size plus 1, with the equivalent circular left shift by 1 of the exchanged registers, and also fails to replace the circular left shift by 31, i.e. register size − 1, with the equivalent circular right shift by 1 of the exchanged registers!
OUCH: additionally
GCC clobbers
register ECX
in all 4 examples without necessity, and
(ab)uses slow
SHLD
respectively
SHRD
instructions
instead of fast
RCL
respectively
RCR
instructions or
even faster ADC
instructions!
Compile the source file case25.c
a second time with
GCC, now
targetting the AMD64 platform, and display the
generated assembly code:
gcc -m64 -o- -O3 -S case25.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…] left127: movq %rdi, %rax # mov rax, rdi movq %rsi, %rdx # mov rdx, rsi # shr esi, 1 shrdq $1, %rsi, %rax # rcr rax, 1 shrdq $1, %rdi, %rdx # rcr rdx, 1 ret # ret […] left65: movq %rdi, %rax # mov rax, rsi movq %rsi, %rdx # mov rdx, rdi # add rdi, rdi shrdq $63, %rsi, %rax # adc rax, rax shrdq $63, %rdi, %rdx # adc rdx, rdx ret # ret […] left63: movq %rdi, %rax # mov rax, rsi movq %rsi, %rdx # mov rdx, rdi # shr edi, 1 shldq $63, %rsi, %rax # rcr rax, 1 shldq $63, %rdi, %rdx # rcr rdx, 1 ret # ret […] left1: movq %rdi, %rax # mov rax, rdi movq %rsi, %rdx # mov rdx, rsi # add rsi, rsi shldq $1, %rsi, %rax # adc rax, rax shldq $1, %rdi, %rdx # adc rdx, rdx ret # ret […] .ident "GCC: (GNU […]) 10.2.0"OUCH: like above, GCC (ab)uses slow
SHLD
respectively
SHRD
instructions
instead of fast
RCL
respectively
RCR
instructions or
even faster ADC
instructions!
Note: unlike above, the second of the
SHLD
respectively
SHRD
instructions
does not depend on the first one, so processors with 2 execution
ports for slow shift instructions can execute them in parallel.
case26.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int sign32(int value) {
return 0 - (value < 0);
}
long long sign64(long long value) {
#ifdef ALTERNATE
return 0LL - (value < 0);
#else
return 0 - (value < 0);
#endif
}
#ifdef __amd64__
__int128_t sign128(__int128_t value) {
#ifdef ALTERNATE
return (__int128_t) 0 - (value < 0);
#else
return 0 - (value < 0);
#endif
}
#endif // __amd64__
Compile the source file case26.c
with
GCC, engaging its
optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case26.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…] _sign32: movl 4(%esp), %eax sarl $31, %eax ret […] _sign64: movl 8(%esp), %eax # mov eax, [esp+8] sarl $31, %eax # cdq cltd # mov eax, edx ret # ret […] .ident "GCC: (GNU […]) 10.2.0"Compile the source file
case26.c
a second time with
GCC, now
targetting the AMD64 platform, and display the
generated assembly code:
gcc -m64 -o- -O3 -S case26.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…] sign32: movl %edi, %eax sarl $31, %eax ret […] sign64: sarq $63, %rdi # mov rax, rdi movslq %edi, %rax # sar rax, 63 ret # ret […] sign128: sarq $63, %rsi # mov rax, rsi movslq %esi, %rax # cqo cqto # mov rax, rdx ret # ret […] .ident "GCC: (GNU […]) 10.2.0"Compile the source file
case26.c
a third time with
GCC, again
targetting the AMD64 platform, now with the
preprocessor macro ALTERNATE
defined, and display the
generated assembly code:
gcc -DALTERNATE -m64 -o- -O3 -S case26.c
[…]
sign32:
movl %edi, %eax
sarl $31, %eax
ret
[…]
sign64:
sarq $63, %rdi
movslq %edi, %rax
ret
[…]
sign128:
shrq $63, %rsi
xorl %edx, %edx
movq %rsi, %rax
negq %rax
adcq $0, %rdx
negq %rdx
ret
[…]
.ident "GCC: (GNU […]) 10.2.0"
OUCH: that’s what I call a complete failure!
case27.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __clzti2(__uint128_t value) {
unsigned long long high = value >> 64, low = value, mask = -(high == 0);
return (64 & (int) mask) + __builtin_clzll((high & ~mask) | (low & mask));
}
int __ctzti2(__uint128_t value) {
unsigned long long high = value >> 64, low = value, mask = -(low == 0);
return (64 & (int) mask) + __builtin_ctzll((high & mask) | (low & ~mask));
}
Compile the source file case27.c
with
GCC, engaging its
optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -O3 -S case27.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…] __clzti2: xorl %edx, %edx # xor edx, edx testq %rsi, %rsi # test rsi, rsi sete %dl # setz dl movl %edx, %ecx # subl $1, %edx # negl %ecx # movslq %edx, %rax # movslq %ecx, %rdx # andq %rsi, %rax # andl $64, %ecx # andq %rdi, %rdx # cmovnz rdi, rsi orq %rdx, %rax # shl edx, 6 bsrq %rax, %rax # bsr rax, rdi xorq $63, %rax # xor eax, 63 addl %ecx, %eax # add eax, edx ret # ret […] __ctzti2: xorl %eax, %eax # xor edx, edx testq %rdi, %rdi # test rdi, rdi sete %al # setz dl movl %eax, %ecx # subl $1, %eax # negl %ecx # cltq # movslq %ecx, %rdx # andq %rdi, %rax # andl $64, %ecx # andq %rsi, %rdx # cmovnz rsi, rdi orq %rdx, %rax # shl edx, 6 rep bsfq %rax, %rax # bsf rax, rsi addl %ecx, %eax # add eax, edx ret # ret […] .ident "GCC: (GNU […]) 10.2.0"OUCH: that’s what I call a complete failure!
case28.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// Stein's algorithm: greatest common divisor
unsigned long long __gcddi3(unsigned long long p, unsigned long long q) {
unsigned r, s = 0;
unsigned long long t;
if (p == 0)
return q;
if (q == 0)
return p;
if (p == q)
return p;
#ifndef CTZ
while (((p | q) & 1) == 0)
p >>= 1, q >>= 1, s++;
while ((p & 1) == 0)
p >>= 1;
do {
while ((q & 1) == 0)
q >>= 1;
#elif CTZ != 32
s = __builtin_ctzll(p | q), p >>= s, q >>= s;
r = __builtin_ctzll(p), p >>= r;
do {
r = __builtin_ctzll(q), q >>= r;
#else
if (((unsigned long) p | (unsigned long) q) == 0)
p >>= 32, q >>= 32, s = 32;
r = __builtin_ctzl((unsigned long) p | (unsigned long) q), p >>= r, q >>= r, s += r;
if ((unsigned long) p == 0)
p >>= 32;
r = __builtin_ctzl(p), p >>= r;
do
{
if ((unsigned long) q == 0)
q >>= 32;
r = __builtin_ctzl(q), q >>= r;
#endif
if (p < q)
t = q, q = p, p = t;
} while (q -= p);
return p << s;
}
Compile the source file case28.c
with
GCC, engaging its
optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -DCTZ -m32 -mtune=i386 -o- -O3 -S case28.c
[…] __gcddi3: pushl %ebp pushl %edi pushl %esi pushl %ebx subl $44, %esp movl 64(%esp), %ecx movl 68(%esp), %ebx movl 72(%esp), %esi movl 76(%esp), %edi movl %ebx, %ebp movl %ecx, 24(%esp) orl %ecx, %ebp movl %ebx, 28(%esp) movl %esi, %eax movl %edi, %edx je L1 movl %ecx, %edx movl %ebx, %eax xorl %esi, %edx xorl %edi, %eax orl %eax, %edx je L8 movl %esi, %eax orl %edi, %eax je L8 movl %ecx, %edx orl %edi, %ebx orl %esi, %edx movl %ebx, 4(%esp) xorl %ebx, %ebx movl %edx, (%esp) call ___ctzdi2 movl 28(%esp), %edx movl %eax, %ebp movl 24(%esp), %eax movl %ebp, %ecx shrdl %cl, %edx, %eax shrl %cl, %edx testb $32, %cl cmovne %edx, %eax cmovne %ebx, %edx shrdl %cl, %edi, %esi xorl %ebx, %ebx shrl %cl, %edi testb $32, %cl movl %edx, 4(%esp) cmovne %edi, %esi cmovne %ebx, %edi movl %eax, (%esp) movl %edx, 28(%esp) movl %eax, 24(%esp) call ___ctzdi2 movl 28(%esp), %edx movl %eax, %ebx movl 24(%esp), %eax movl %ebx, %ecx xorl %ebx, %ebx shrdl %cl, %edx, %eax shrl %cl, %edx andl $32, %ecx cmovne %edx, %eax cmovne %ebx, %edx movl %eax, 24(%esp) movl %edx, 28(%esp) L6: movl %esi, (%esp) movl %edi, 4(%esp) call ___ctzdi2 movl 28(%esp), %edx movl %eax, %ecx xorl %eax, %eax shrdl %cl, %edi, %esi shrl %cl, %edi testb $32, %cl cmovne %edi, %esi cmovne %eax, %edi movl 24(%esp), %eax cmpl %eax, %esi movl %edi, %ebx movl %esi, %ecx sbbl %edx, %edi jc L3 subl %eax, %esi movl %ebx, %edi sbbl %edx, %edi movl %edi, %eax orl %esi, %eax jne L6 movl 24(%esp), %eax movl 28(%esp), %edx movl %ebp, %ecx xorl %ebx, %ebx shldl %cl, %eax, %edx sall %cl, %eax andl $32, %ecx cmovne %eax, %edx cmovne %ebx, %eax L1: addl $44, %esp popl %ebx popl %esi popl %edi popl %ebp ret L8: movl 24(%esp), %eax movl 28(%esp), %edx addl $44, %esp popl %ebx popl %esi popl %edi popl %ebp ret L3: movl %eax, %esi movl %edx, %edi movl %ecx, 24(%esp) subl %ecx, %esi movl %ebx, 28(%esp) sbbl %ebx, %edi jmp L6 […] .ident "GCC: (GNU […]) 10.2.0"OUCH: the
CMOVcc
instruction is
not available on the i386 processor!
The code generated by
GCC shows 117
instructions in 325 bytes, with 3 calls of the library function
__ctzdi2()
instead of inlined code for the
__builtin_ctzll()
builtin, while properly written code
uses either only 70 instructions in just 146 bytes or (with the
symbol SPEED
defined) 75 instructions in 157 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "gcddi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of second argument
# [esp+12] = low dword of second argument
# [esp+8] = high dword of first argument
# [esp+4] = low dword of first argument
__gcddi3:
push edi
push esi
mov eax, [esp+12]
mov edx, [esp+16] # edx:eax = first argument
mov edi, [esp+20]
mov esi, [esp+24] # esi:edi = second argument
mov ecx, eax
or ecx, edx # ecx = low qword of first argument
# | high qword of first argument
jz .Lsecond # first argument = 0?
mov ecx, edi
or ecx, esi # ecx = low qword of second argument
# | high qword of second argument
jz .Lfirst # second argument = 0?
push ebx
mov ecx, eax
mov ebx, edx # ebx:ecx = first argument
xor ecx, edi
xor ebx, esi # ebx:ecx = first argument ^ second argument
or ebx, ecx
jz .Lequal # first argument = second argument?
xor ebx, ebx # ebx = exponent of common divisor 2 = 0
mov ecx, eax
or ecx, edi # ecx = low dword of first argument
# | low dword of second argument
jnz .Lcommon # first argument % 2**32 <> 0?
# second argument % 2**32 <> 0?
.ifnotdef SPEED
xchg eax, edx # edx:eax = first argument / 2**32
xchg edi, esi # esi:edi = second argument / 2**32
.else
mov eax, edx
mov edx, ecx # edx:eax = first argument / 2**32
mov edi, esi
mov esi, ecx # esi:edi = second argument / 2**32
.endif
add ebx, 32 # ebx = exponent of common divisor 2 = 32
mov ecx, eax
or ecx, edi # ecx = high dword of first argument
# | high dword of second argument
.Lcommon:
bsf ecx, ecx # ecx = index of least significant '1' bit
add ebx, ecx # ebx = exponent of common divisor 2
shrd eax, edx, cl
shr edx, cl # edx:eax = first argument / 2**exponent
# = first argument'
shrd edi, esi, cl
shr esi, cl # esi:edi = second argument / 2**exponent
# = second argument'
test eax, eax
jnz 1f # first argument' % 2**32 <> 0?
.ifnotdef SPEED
xchg eax, edx # edx:eax = first argument' / 2**32
.else
mov eax, edx
xor edx, edx # edx:eax = first argument' / 2**32
.endif
1:
bsf ecx, eax # ecx = index of least significant '1' bit
shrd eax, edx, cl
shr edx, cl # edx:eax = first argument' / 2**index
# = first argument"
.Lloop:
test edi, edi
jnz 2f # second argument' % 2**32 <> 0?
.ifnotdef SPEED
xchg edi, esi # esi:edi = second argument' / 2**32
.else
mov edi, esi
xor esi, esi # esi:edi = second argument' / 2**32
.endif
2:
bsf ecx, edi # ecx = index of least significant '1' bit
shrd edi, esi, cl
shr esi, cl # esi:edi = second argument' / 2**index
# = second argument"
.Lsubtract:
sub edi, eax
sbb esi, edx # esi:edi = second argument" - first argument"
jnb .Lcontinue # second argument" >= first argument"?
.Lswap:
add eax, edi
adc edx, esi # edx:eax = first argument"
# + second argument" - first argument"
# = second argument"
.ifnotdef SPEED
neg esi
neg edi
sbb esi, 0 # esi:edi = 0
# - (second argument" - first argument")
# = first argument" - second argument"
.else
mov ecx, esi
xor esi, esi
neg edi
sbb esi, ecx # esi:edi = 0
# - (second argument" - first argument")
# = first argument" - second argument"
.endif
jmp .Lloop
.Lcontinue:
mov ecx, edi
or ecx, esi
jnz .Lloop # second argument" <> first argument"?
mov ecx, ebx # ecx = exponent of common divisor 2
shld edx, eax, cl
shl eax, cl # edx:eax = first argument" * 2**exponent
# = greatest common divisor
.Lequal:
pop ebx
.Lfirst:
pop esi
pop edi
ret
.Lsecond:
mov eax, edi
mov edx, esi # edx:eax = second argument
pop esi
pop edi
ret
.size __gcddi3, .-__gcddi3
.type __gcddi3, @function
.global __gcddi3
.end
Compile the source file case28.c
a second time with
GCC, now with the
preprocessor macro CTZ=32
defined, and display the
generated assembly code:
gcc -DCTZ=32 -m32 -mtune=i386 -o- -O3 -S case28.c
[…] pushl %edi pushl %esi pushl %ebx subl $8, %esp movl 24(%esp), %esi movl 28(%esp), %edi movl 32(%esp), %eax movl 36(%esp), %edx movl %edi, %ebx orl %esi, %ebx je L1 movl %edx, %ebx orl %eax, %ebx je L11 movl %edi, %ebx xorl %edx, %ebx movl %esi, %ecx xorl %eax, %ecx orl %ecx, %ebx je L1 movl %esi, %ecx orl %eax, %ecx je L17 xorl %ebx, %ebx L3: bsfl %ecx, %ecx shrdl %cl, %edi, %esi shrl %cl, %edi testb $32, %cl je L19 movl %edi, %esi xorl %edi, %edi L19: shrdl %cl, %edx, %eax shrl %cl, %edx testb $32, %cl je L20 movl %edx, %eax xorl %edx, %edx L20: addl %ecx, %ebx movl %esi, %ecx testl %esi, %esi jne L4 movl %edi, %esi xorl %edi, %edi movl %esi, %ecx L4: bsfl %ecx, %ecx shrdl %cl, %edi, %esi shrl %cl, %edi testb $32, %cl je L18 movl %edi, %esi xorl %edi, %edi L18: movl %esi, (%esp) movl %edi, 4(%esp) L9: movl %eax, %ecx testl %eax, %eax jne L5 movl %edx, %eax xorl %edx, %edx movl %eax, %ecx L5: bsfl %ecx, %ecx shrdl %cl, %edx, %eax shrl %cl, %edx xorl %esi, %esi testb $32, %cl cmovne %edx, %eax cmovne %esi, %edx movl %eax, %esi movl %edx, %edi movl (%esp), %edx movl 4(%esp), %ecx cmpl %edx, %eax movl %edi, %eax sbbl %ecx, %eax jc L6 subl %edx, %esi sbbl %ecx, %edi movl %esi, %eax movl %edi, %edx orl %esi, %edi jne L9 movl (%esp), %eax movl 4(%esp), %edx movb %bl, %cl shldl %cl, %eax, %edx sall %cl, %eax xorl %ebx, %ebx andl $32, %ecx cmovne %eax, %edx cmovne %ebx, %eax L1: addl $8, %esp popl %ebx popl %esi popl %edi ret L11: movl %esi, %eax movl %edi, %edx addl $8, %esp popl %ebx popl %esi popl %edi ret L17: movl %edi, %esi xorl %edi, %edi movl %edx, %eax xorl %edx, %edx movl %esi, %ecx orl %eax, %ecx movl $32, %ebx jmp L3 L6: movl %edx, %eax movl %ecx, %edx subl %esi, %eax sbbl %edi, %edx movl %esi, (%esp) movl %edi, 4(%esp) jmp L9 […] .ident "GCC: (GNU […]) 10.2.0"OOPS: the
BSF
instruction can’t set bit 25, the
15 highlighted instructions are therefore
superfluous!
Now the code generated by GCC shows 116 instructions in 282 bytes, without calls of a library function – better than before, but still outright bad!
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// Stein's algorithm: greatest common divisor
__uint128_t __gcdti3(__uint128_t p, __uint128_t q) {
unsigned r, s = 0;
__uint128_t t;
if (p == 0)
return q;
if (q == 0)
return p;
if (p == q)
return p;
if (((unsigned long long) p | (unsigned long long) q) == 0)
p >>= 64, q >>= 64, s = 64;
r = __builtin_ctzll((unsigned long long) p | (unsigned long long) q), p >>= r, q >>= r, s += r;
if ((unsigned long long) p == 0)
p >>= 64;
r = __builtin_ctzll(p), p >>= r;
do
{
if ((unsigned long long) q == 0)
q >>= 64;
r = __builtin_ctzll(q), q >>= r;
if (p < q)
t = q, q = p, p = t;
} while (q -= p);
return p << s;
}
gcc -m64 -mtune=core2 -o- -O3 -S case28.c
[…] __gcdti3: movq %rsi, %r9 movq %rdi, %r8 movq %rdx, %rsi movq %r9, %rax movq %rcx, %rdi orq %r8, %rax je .L10 movq %r8, %rdx movq %r9, %rax xorq %rsi, %rdx xorq %rcx, %rax orq %rax, %rdx je .L11 movq %rsi, %rax orq %rcx, %rax je .L11 movq %r8, %rcx xorl %r10d, %r10d orq %rsi, %rcx jne .L3 movq %r9, %r8 movq %rdi, %rsi xorl %r9d, %r9d movq %r8, %rcx xorl %edi, %edi movl $64, %r10d orq %rsi, %rcx .L3: bsfq %rcx, %rcx movq %r8, %rax movq %r9, %rdx shrdq %cl, %r9, %rax shrq %cl, %rdx xorl %r8d, %r8d testb $64, %cl cmovne %rdx, %rax cmovne %r8, %rdx shrdq %cl, %rdi, %rsi xorl %r9d, %r9d shrq %cl, %rdi testb $64, %cl cmovne %rdi, %rsi cmovne %r9, %rdi addl %ecx, %r10d testq %rax, %rax movq %rax, %rcx je .L15 .L4: bsfq %rcx, %rcx xorl %r11d, %r11d shrdq %cl, %rdx, %rax shrq %cl, %rdx andl $64, %ecx cmovne %rdx, %rax cmovne %r11, %rdx .L9: testq %rsi, %rsi movq %rsi, %rcx jne .L5 movq %rdi, %rsi xorl %edi, %edi movq %rsi, %rcx .L5: bsfq %rcx, %rcx xorl %r8d, %r8d shrdq %cl, %rdi, %rsi shrq %cl, %rdi testb $64, %cl cmovne %rdi, %rsi cmovne %r8, %rdi cmpq %rax, %rsi movq %rsi, %r8 movq %rdi, %r9 sbbq %rdx, %rdi jc .L6 subq %rax, %rsi movq %r9, %rdi sbbq %rdx, %rdi movq %rdi, %rcx orq %rsi, %rcx jne .L9 movl %r10d, %ecx xorl %esi, %esi shldq %cl, %rax, %rdx salq %cl, %rax andl $64, %ecx cmovne %rax, %rdx cmovne %rsi, %rax ret .L11: movq %r8, %rax movq %r9, %rdx ret .L6: subq %rsi, %rax sbbq %r9, %rdx movq %rax, %rsi movq %r8, %rax movq %rdx, %rdi movq %r9, %rdx jmp .L9 .L10: movq %rdx, %rax movq %rcx, %rdx ret .L15: movq %rdx, %rax xorl %edx, %edx movq %rax, %rcx jmp .L4 […] .ident "GCC: (GNU […]) 10.2.0"
OOPS: the
BSF
instruction can’t
set bit 26, the 16 highlighted instructions are
therefore superfluous!
The code generated by GCC shows 102 instructions in 329 bytes, while properly written code uses but only 64 instructions in just 213 bytes:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "gcdti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = first argument
# rcx:rdx = second argument
__gcdti3:
mov rax, rdi
or rax, rsi # rax = low qword of first argument
# | high qword of first argument
jz .Lsecond # first argument = 0?
mov rax, rdx
or rax, rcx # rax = low qword of second argument
# | high qword of second argument
jz .Lfirst # second argument = 0?
mov r10, rdx
mov r11, rcx # r11:r10 = second argument
xor rdx, rdi
xor rcx, rsi # rcx:rdx = second argument ^ first argument
or rcx, rdx
jz .Lequal # first argument = second argument?
mov eax, 64 # rax = 64
xor edx, edx # rdx = exponent of common divisor 2 = 0
mov rcx, rdi
or rcx, r10 # rcx = low qword of first argument
# | low qword of second argument,
# ZF = (first argument % 2**64 = 0)
# & (second argument % 2**64 = 0)
cmovnz eax, edx # rax = ZF ? 64 : 0
cmovz rdi, rsi
cmovz esi, edx # rsi:rdi = ZF ? first argument / 2**64
cmovz r10, r11
cmovz r11, rdx # r11:r10 = ZF ? second argument / 2**64
mov rcx, rdi
or rcx, r10 # rcx = high qword of first argument
# | high qword of second argument
.Lcommon:
bsf rcx, rcx # rcx = index of least significant '1' bit
add eax, ecx # rax = exponent of common divisor 2
shrd rdi, rsi, cl
shr rsi, cl # rsi:rdi = first argument / 2**exponent
# = first argument'
shrd r10, r11, cl
shr r11, cl # r11:r10 = second argument / 2**exponent
# = second argument'
test rdi, rdi # ZF = (first argument' % 2**64 = 0)
cmovz rdi, rsi
cmovz esi, edx # rsi:rdi = ZF ? first argument' / 2**64
bsf rcx, rdi # rcx = index of least significant '1' bit
shrd rdi, rsi, cl
shr rsi, cl # rsi:rdi = first argument' / 2**index
# = first argument"
.Lloop:
test r10, r10 # ZF = (second argument' % 2**64 = 0)
cmovz r10, r11
cmovz r11, rdx # r11:r10 = ZF ? second argument' / 2**64
bsf rcx, r10 # rcx = index of least significant '1' bit
shrd r10, r11, cl
shr r11, cl # r11:r10 = second argument' / 2**index
# = second argument"
.Lcompare:
cmp r10, rdi
mov rcx, r11
sbb rcx, rsi # CF = (second argument" < first argument")
.Lswap:
mov rcx, r10
cmovb r10, rdi
cmovb rdi, rcx
mov rcx, r11
cmovb r11, rsi # r11:r10 = CF ? first argument" : second argument"
cmovb rsi, rcx # rsi:rdi = CF ? second argument" : first argument"
.Lsubtract:
sub r10, rdi
sbb r11, rsi # r11:r10 = second argument" - first argument"
mov rcx, r10
or rcx, r11
jnz .Lloop # second argument" <> first argument"?
mov ecx, eax # rcx = exponent of common divisor 2
shld rsi, rdi, cl
shl rdi, cl # rsi:rdi = first argument" * 2**exponent
# = greatest common divisor
.Lequal:
.Lfirst:
mov rdx, rsi
mov rax, rdi
ret
.Lsecond:
mov rax, rdx
mov rdx, rcx # rdx:rax = second argument
ret
.size __gcdti3, .-__gcdti3
.type __gcdti3, @function
.global __gcdti3
.end
case29.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// Widening carry-less multiplication
#ifndef __amd64__
unsigned long long clmul64(unsigned long p, unsigned long q) {
unsigned long long r = 0;
unsigned long s = 1UL << 31;
do {
r <<= 1;
if (q & s)
r ^= p; // no need for promotion or conversion of p here!
} while (s >>= 1);
return r;
}
#else // __amd64__
__uint128_t clmul128(unsigned long long p, unsigned long long q) {
__uint128_t r = 0;
unsigned long long s = 1UL << 63;
do {
r <<= 1;
if (q & s)
r ^= p; // no need for promotion or conversion of p here!
} while (s >>= 1);
return r;
}
#endif // __amd64__
Compile the source file case29.c
with
GCC, engaging its
optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case29.c
[…] clmul64: pushl %ebp movl $-2147483648, %ecx xorl %eax, %eax xorl %edx, %edx pushl %edi xorl %edi, %edi pushl %esi pushl %ebx movl 24(%esp), %ebp movl $32, %ebx movl 20(%esp), %esi L3: shldl $1, %eax, %edx addl %eax, %eax testl %ecx, %ebp je L2 xorl %esi, %eax xorl %edi, %edx L2: shrl $1, %ecx subl $1, %ebx jne L3 popl %ebx popl %esi popl %edi popl %ebp ret […] .ident "GCC: (GNU […]) 10.2.0"OUCH: 8 (highlighted) superfluous instructions, including 4 extraneous synthesized instructions which use register
EBX
, out of the total 25 instructions!
Compile the source file case29.c
a second time with
GCC, now
targetting the AMD64 platform, and display the
generated assembly code:
gcc -m64 -o- -O3 -S case29.c
[…] clmul128: movq %rdi, %r10 xorl %eax, %eax movq %rsi, %rdi xorl %edx, %edx movabsq $-9223372036854775808, %rcx movl $64, %esi .L3: shldq $1, %rax, %rdx addq %rax, %rax testq %rcx, %rdi je .L2 movq %rax, %r9 # xorq %r10, %r9 # xorq %rsi, %rax movq %r9, %rax # .L2: shrq $1, %rcx subl $1, %esi jne .L3 ret […] .ident "GCC: (GNU […]) 10.2.0"OUCH: 7 (highlighted) superfluous instructions (minus the single instruction added as comment) out of the total 17 instructions!
case30.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef __amd64__
int ispowerof2(unsigned long long argument) {
#else
int ispowerof2(__uint128_t argument) {
#endif
return (argument != 0) && ((argument & argument - 1) == 0);
}
Compile the source file case30.c
with
GCC, engaging its
optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case30.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…] ispowerof2: pushl %esi # pushl %ebx # movl 12(%esp), %ecx # mov ecx, [esp+4] movl 16(%esp), %ebx # mov edx, [esp+8] movl %ecx, %eax # mov eax, ecx orl %ebx, %eax # or eax, edx je .L5 # jz .L5 movl %ecx, %eax # movl %ebx, %edx # xor eax, eax addl $-1, %eax # add ecx, -1 adcl $-1, %edx # adc edx, -1 andl %ecx, %eax # and ecx, [esp+4] andl %ebx, %edx # and edx, [esp+8] popl %ebx # orl %edx, %eax # or ecx, edx sete %al # setz al movzbl %al, %eax # movl %eax, %esi # movl %esi, %eax # popl %esi # ret # .L5: xorl %esi, %esi # popl %ebx # movl %esi, %eax # popl %esi # ret # ret […] .ident "GCC: (GNU […]) 13.1.0"
[…] ispowerof2: pushl %edi # pushl %esi # pushl %ebx # movl 20(%esp), %edi # mov ecx, [esp+4] movl 16(%esp), %esi # mov edx, [esp+8] movl %edi, %eax # mov eax, ecx orl %esi, %eax # or eax, edx je .L5 # jz .L5 movl %esi, %eax # movl %edi, %edx # addl $-1, %eax # add ecx, -1 adcl $-1, %edx # adc edx, -1 movl %eax, %ecx # movl %esi, %eax # movl %edx, %ebx # movl %edi, %edx # andl %ecx, %eax # and ecx, [esp+4] xorl %ecx, %ecx # xor eax, eax andl %ebx, %edx # and edx, [esp+8] popl %ebx # popl %esi # orl %edx, %eax # or ecx, edx popl %edi # sete %cl # setz al movl %ecx, %eax # ret # .L5: xorl %ecx, %ecx # popl %ebx # popl %esi # movl %ecx, %eax # popl %edi # ret # ret […] .ident "GCC: (GNU […]) 12.3.0"OUCH: 19 (highlighted) superfluous instructions (which abuse registers
EDI
and
ESI
to play a
Shell Gameand perform 3 superfluous memory writes) out of the total 32 instructions in 60 bytes, plus an epic failure of the
optimiserwhich doesn’t recognise that register
EAX
is already 0 at label .L5
!
Properly written straightforward code uses but only 13 instructions in 36 bytes and doesn’t clobber any non-volatile register!
Compile the source file case30.c
a second time with
GCC, again
targetting the i386 platform, now with explicit support
for the SSE4.1
alias Penryn New Instruction Set, and display the
generated assembly code:
gcc -m32 -msse4.1 -o- -O3 -S case30.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…] ispowerof2: movq 4(%esp), %xmm1 # movq xmm0, qword ptr [esp+4] xorl %eax, %eax # xor eax, eax movdqa %xmm1, %xmm0 # punpcklqdq %xmm1, %xmm0 # ptest %xmm0, %xmm0 # ptest xmm0, xmm0 je .L1 # je .L1 pcmpeqd %xmm0, %xmm0 # pcmpeqq xmm1, xmm1 xorl %eax, %eax # paddq %xmm1, %xmm0 # paddq xmm1, xmm0 pand %xmm1, %xmm0 # punpcklqdq %xmm0, %xmm0 # ptest %xmm0, %xmm0 # ptest xmm1, xmm0 sete %al # sete al .L1: ret # ret […] .ident "GCC: (GNU […]) 12.3.0"OUCH: 5 (highlighted) superfluous instructions out of the total 14 instructions in 50 bytes, plus 2 epic failures of the
optimiserwhich doesn’t recognise that the
MOVQ
instruction clears the upper
lane of the
SSE register
XMM1
and that register EAX
is already 0!
Properly written straightforward code uses but only 9 instructions in 32 bytes!
Compile the source file case30.c
a third time with
GCC, again
targetting the i386 platform, now with explicit support
for the AVX2
alias Haswell New Instruction Set or the
AVX
instruction set, and display the generated assembly code:
gcc -m32 -mavx -o- -O3 -S case30.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…] ispowerof2: vmovq 4(%esp), %xmm0 # movq xmm0, [esp+4] xorl %eax, %eax # xor eax, eax vpunpcklqdq %xmm0, %xmm0, %xmm0 # vptest %xmm0, %xmm0 # ptest xmm0, xmm0 je .L1 # jz .L1 vpcmpeqd %xmm1, %xmm1, %xmm1 # pcmpeqq xmm1, xmm1 xorl %eax, %eax # vpaddq %xmm1, %xmm0, %xmm1 # paddq xmm1, xmm0 vpand %xmm0, %xmm1, %xmm0 # vpunpcklqdq %xmm0, %xmm0, %xmm0 # vptest %xmm0, %xmm0 # ptest xmm1, xmm0 sete %al # setz al .L1: ret # ret […] .ident "GCC: (GNU […]) 12.3.0"OUCH: 4 (highlighted) superfluous instructions out of the total 13 instructions in 46 bytes, plus 2 epic failures of the
optimiserwhich doesn’t recognise that the
MOVQ
instruction clears the upper
lane of the
SSE register
XMM1
and that register EAX
is already 0!
OUCH: additionally GCC generates AVX instructions without need where SSE4.1 instructions are sufficient!
Compile the source file case30.c
a fourth time with
GCC, now
targetting the AMD64 platform, and display the
generated assembly code:
gcc -m64 -o- -O3 -S case30.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…] ispowerof2: movq %rsi, %rax # mov rax, rdi xorl %ecx, %ecx # movq %rdi, %r8 # movq %rsi, %r9 # orq %rdi, %rax # or rax, rsi je .L1 # jz .L1 movq %rdi, %rax # mov rcx, rdi movq %rsi, %rdx # mov rdx, rsi addq $-1, %rax # add rcx, -1 adcq $-1, %rdx # add rdx, -1 movq %rax, %rsi # movq %r8, %rax # xorl %ecx, %ecx # xor eax, eax movq %rdx, %rdi # movq %r9, %rdx # andq %rsi, %rax # and rcx, rdi andq %rdi, %rdx # and rdx, rsi orq %rdx, %rax # or rcx, rdx sete %cl # setz al .L1: movl %ecx, %eax # ret # ret […] .ident "GCC: (GNU […]) 12.3.0"OUCH: 8 (highlighted) superfluous instructions (which abuse registers
R8
and
R9
to play a
Shell Game) out of the total 21 instructions in 59 bytes, plus an epic failure of the
optimiserwhich doesn’t recognise that register
ECX
is already 0!
Properly written straightforward code uses but only 13 instructions in 37 bytes!
POPCNT Instruction May Take Longer to Execute Than Expected, Intel documents a bug in their Haswell, Broadwell, Skylake and Coffee Lake families of CPUs: the
POPCNT
instruction has a false dependency on its destination register.
Note: a XOR
instruction that zeroes the destination register before the
POPCNT
instruction breaks this false dependency.
Intel processors from the Sandy Bridge, Ivy Bridge and Coffee Lake Refresh families suffer from this bug too; processors of the Alder Lake, Cannon Lake, Cascade Lake, Comet Lake, Ice Lake, Raptor Lake, Rocket Lake and Tiger Lake families are but not affected.
Create the text file case31.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int ispowerof2(unsigned long long argument) {
return __builtin_popcountll(argument) == 1;
}
Compile the source file case31.c
with
GCC, engaging its
optimiser, targetting one of the not affected processor families, and display the generated assembly code:
gcc -m32 -march=cannonlake -o- -O3 -S case31.c
[…]
ispowerof2:
xorl %eax, %eax
xorl %edx, %edx
popcntl 4(%esp), %eax
popcntl 8(%esp), %edx
addl %edx, %eax
cmpl $1, %eax
sete %al
movzbl %al, %eax
ret
[…]
.ident "GCC: (GNU […]) 12.3.0"
OOPS:
GCC generates
superfluous
XOR
instructions!
Compile the source file case31.c
with
GCC, again
targetting one of the not affected processor families, but now
optimising
for size, and display the generated assembly code:
gcc -m32 -march=cannonlake -o- -Os -S case31.c
[…]
ispowerof2:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
pushl 12(%ebp)
pushl 8(%ebp)
call __popcountdi2
addl $16, %esp
leave
decl %eax
sete %al
movzbl %al, %eax
ret
[…]
.ident "GCC: (GNU […]) 12.3.0"
OUCH: optimising for speed
GCC generates
9 instructions in 24 bytes; optimisingfor size it but generates 12 instructions in 29 bytes, including a superfluous (highlighted)
ADD
instruction, in addition
to the uncounted instructions and bytes of the
__popcountdi2()
function!
case32.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int ispowerof2(unsigned long long argument) {
return (argument != 0) && (__builtin_clzll(argument) + __builtin_ctzll(argument) == 63);
}
Compile the source file case32.c
with
GCC, engaging its
optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case32.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…] ispowerof2: pushl %ebx # subl $8, %esp # movq 16(%esp), %xmm1 # movdqa %xmm1, %xmm0 # movd %xmm1, %edx # mov ecx, [esp+4] psrlq $32, %xmm0 # movd %xmm0, %eax # mov edx, [esp+8] movl %eax, %ebx # mov eax, ecx orl %edx, %ebx # or eax, edx je .L7 # jz .L0 testl %eax, %eax # bsr eax, edx je .L3 # jnz .L1 bsrl %eax, %ebx # bsr eax, ecx xorl $31, %ebx # xor eax, 32 # L1: # xor eax, 31 .L4: subl $16, %esp # bsf ecx, ecx movq %xmm1, (%esp) # jnz .L2 call __ctzdi2 # bsf ecx, edx addl $16, %esp # xor ecx, 32 xorl %ecx, %ecx # L2: addl %eax, %ebx # add eax, ecx cmpl $63, %ebx # cmp eax, 63 sete %cl # sete al addl $8, %esp # movl %ecx, %eax # popl %ebx # L0: ret # ret .L7: xorl %ecx, %ecx # addl $8, %esp # movl %ecx, %eax # popl %ebx # ret # .L3: bsrl %edx, %edx # xorl $31, %edx # leal 32(%edx), %ebx # jmp .L4 # […] .ident "GCC: (GNU […]) 12.3.0"OUCH: GCC generates 35 instructions in 96 bytes which clobber register
EBX
,
perform 2 superfluous memory writes plus a superfluous memory read
and call the __ctzdi2()
function instead to use 2
BSF
instructions;
straightforward code uses but only 18 instructions
in 48 bytes without call of a library function!
OUCH: the highlighted
SSE2
instructions (ab)used to load the 64-bit
argument into the 32-bit registers EAX
and
EDX
are completely braindead!
Compile the source file case32.c
a second time with
GCC, engaging its
optimiser
, targetting the i386 platform, now
explicitly enabling the
LZCNT
and
TZCNT
instructions, and display the generated assembly code:
gcc -m32 -mabm -mbmi -mlzcnt -o- -O3 -S case32.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…] ispowerof2: pushl %ebx # subl $8, %esp # movq 16(%esp), %xmm1 # movdqa %xmm1, %xmm0 # movd %xmm1, %edx # mov ecx, [esp+4] psrlq $32, %xmm0 # movd %xmm0, %eax # mov edx, [esp+8] movl %eax, %ebx # mov eax, ecx orl %edx, %ebx # or eax, edx je .L7 # jz .L0 xorl %ebx, %ebx # lzcntl %eax, %ebx # lzcnt eax, edx testl %eax, %eax # je .L10 # jnc .L4 # lzcnt eax, ecx # add eax, 32 .L4: subl $16, %esp # tzcnt ecx, ecx movq %xmm1, (%esp) # jnc .L1 call __ctzdi2 # tzcnt edx, edx addl $16, %esp # add ecx, edx xorl %ecx, %ecx # L1: addl %eax, %ebx # add eax, ecx cmpl $63, %ebx # cmp eax, 63 sete %cl # sete al addl $8, %esp # movl %ecx, %eax # popl %ebx # L0: ret # ret .L7: xorl %ecx, %ecx # addl $8, %esp # movl %ecx, %eax # popl %ebx # ret # .L10: lzcntl %edx, %edx # leal 32(%edx), %ebx # jmp .L4 # […] .ident "GCC: (GNU […]) 12.3.0"OUCH: GCC generates 34 instructions in 94 bytes which clobber register
EBX
,
perform 2 superfluous memory writes plus a superfluous memory read
and call the __ctzdi2()
function instead to use 2
TZCNT
instructions, despite the -mabm
and
-mbmi
options; straightforward code
uses but only 17 instructions in 48 bytes without call of a library
function!
OUCH: the highlighted
SSE2
instructions (ab)used to load the 64-bit
argument into the 32-bit registers EAX
and
EDX
are completely braindead!
Compile the source file case32.c
a third time with
GCC, again
targetting the i386 platform, but now optimising
for size, and display the generated assembly code:
gcc -m32 -mabm -mbmi -mlzcnt -o- -Os -S case32.c
[…] ispowerof2: pushl %ebp movl %esp, %ebp pushl %edi pushl %esi pushl %ebx subl $12, %esp movl 12(%ebp), %edi movl 8(%ebp), %esi movl %edi, %edx orl %esi, %edx je .L7 lzcntl %edi, %ebx testl %edi, %edi jne .L4 lzcntl %esi, %ebx addl $32, %ebx .L4: pushl %eax pushl %eax pushl %edi pushl %esi call __ctzdi2 addl $16, %esp addl %eax, %ebx cmpl $63, %ebx sete %al jmp .L2 .L7: xorl %eax, %eax .L2: leal -12(%ebp), %esp movzbl %al, %eax popl %ebx popl %esi popl %edi popl %ebp ret […] .ident "GCC: (GNU […]) 12.3.0"OUCH: again GCC generates 34 instructions, now in 71 bytes, with a call of the
__ctzdi2()
function, instead of the
straighforward code shown above – this is as
bad as code might get!
case33.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#include <arpa/inet.h>
unsigned long long htonll(unsigned long long h)
{
#ifndef ALTERNATE
return htonl(1) == 1 ? h : htonl((unsigned long) (h >> 32))
| htonl((unsigned long) h) * 0x100000000u;
#else
return ((unsigned char *) &h)[0] * 0x0100000000000000u
| ((unsigned char *) &h)[1] * 0x0001000000000000u
| ((unsigned char *) &h)[2] * 0x0000010000000000u
| ((unsigned char *) &h)[3] * 0x0000000100000000u
| ((unsigned char *) &h)[4] * 0x0000000001000000u
| ((unsigned char *) &h)[5] * 0x0000000000010000u
| ((unsigned char *) &h)[6] * 0x0000000000000100u
| ((unsigned char *) &h)[7];
#endif
}
unsigned long long ntohll(unsigned long long n)
{
#ifdef ALTERNATE
return ntohl(1) == 1 ? n : ntohl((unsigned long) (n >> 32))
| ntohl((unsigned long) n) * 0x100000000u;
#else
return ((unsigned char *) &n)[0] * 0x0100000000000000u
| ((unsigned char *) &n)[1] * 0x0001000000000000u
| ((unsigned char *) &n)[2] * 0x0000010000000000u
| ((unsigned char *) &n)[3] * 0x0000000100000000u
| ((unsigned char *) &n)[4] * 0x0000000001000000u
| ((unsigned char *) &n)[5] * 0x0000000000010000u
| ((unsigned char *) &n)[6] * 0x0000000000000100u
| ((unsigned char *) &n)[7];
#endif
}
Compile the source file case33.c
with
GCC, engaging its
optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case33.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…] htonll: movl 8(%esp), %edx # mov eax, [esp+8] movl 4(%esp), %eax # mov edx, [esp+4] bswapl %edx # bswap eax bswapl %eax # bswap edx xchgl %edx, %eax # ret # ret […] ntohll: pushl %edi # xorl %ecx, %ecx # pushl %esi # pushl %ebx # movl 16(%esp), %edx # mov edx, [esp+4] movl 20(%esp), %eax # mov eax, [esp+8] movl %edx, %edi # bswap eax movzbl %dh, %esi # bswap edx sall $24, %edi # movl %edi, %ebx # movl %esi, %edi # xorl %esi, %esi # orl %ecx, %esi # movl %eax, %ecx # sall $16, %edi # shrl $24, %ecx # orl %ebx, %edi # xorl %ebx, %ebx # orl %ecx, %esi # movl %edx, %ecx # orl %ebx, %edi # shrl $24, %edx # shrl $16, %ecx # movzbl %cl, %ecx # movl %ecx, %ebx # xorl %ecx, %ecx # orl %ecx, %esi # xorl %ecx, %ecx # sall $8, %ebx # orl %esi, %ecx # movl %eax, %esi # orl %ebx, %edi # movl %edx, %ebx # sall $24, %esi # orl %edi, %ebx # orl %esi, %ecx # movl %esi, %edi # movzbl %ah, %esi # shrl $8, %eax # sall $16, %esi # sarl $31, %edi # andl $65280, %eax # orl %edi, %ebx # movl %esi, %edi # orl %esi, %ecx # cltd # sarl $31, %edi # orl %ecx, %eax # orl %edi, %ebx # orl %ebx, %edx # popl %ebx # popl %esi # popl %edi # ret # ret […] .ident "GCC: (GNU […]) 13.2.0"OUCH: while the code generated for the first function is almost perfect, it is desastrous for the second function!
Compile the source file case33.c
a second time with
GCC, now
targetting the AMD64 platform, and display the
generated assembly code:
gcc -m64 -o- -O3 -S case33.cNote: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…] htonll: movq %rdi, %rdx # movl %edi, %eax # mov rax, rdi shrq $32, %rdx # bswapl %eax # bswap rax salq $32, %rax # bswapl %edx # movl %edx, %edx # orq %rdx, %rax # ret # ret ntohll: movq %rdi, %rdx # movq %rdi, %rax # mov rax, rdi movzbl %dh, %ecx # bswap rax salq $56, %rax # salq $48, %rcx # shrq $40, %rdx # orq %rcx, %rax # movq %rdi, %rcx # andl $65280, %edx # shrq $56, %rcx # orq %rcx, %rax # movq %rdi, %rcx # shrq $16, %rcx # movzbl %cl, %ecx # salq $40, %rcx # orq %rcx, %rax # movq %rdi, %rcx # shrq $24, %rcx # movzbl %cl, %ecx # salq $32, %rcx # orq %rcx, %rax # movq %rdi, %rcx # shrq $32, %rcx # sall $24, %ecx # movslq %ecx, %rcx # orq %rcx, %rax # movq %rdi, %rcx # shrq $24, %rcx # andl $16711680, %ecx # orq %rcx, %rax # orq %rdx, %rax # ret # ret […] .ident "GCC: (GNU […]) 13.2.0"OUCH: now the code generated for both functions is desastrous!
Use the X.509 certificate to send S/MIME encrypted mail.
Note: email in weird format and without a proper sender name is likely to be discarded!
I dislike
HTML (and even
weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your
nickname.
I abhor top posts and expect inline quotes in replies.
as iswithout any warranty, neither express nor implied.
cookiesin the web browser.
The web service is operated and provided by
Telekom Deutschland GmbH The web service provider stores a session cookie
in the web
browser and records every visit of this web site with the following
data in an access log on their server(s):