True Lies– or
What LLVM Claims, but Fails to Deliver
undefined behaviour
__bswapsi2()
, __bswapdi2()
and __builtin_bswap*()
__absvti2()
__ashldi3()
, __ashrdi3()
and __lshrdi3()
__cmp?i2()
and __ucmp?i2()
__mulo?i4()
__parity?i2()
__builtin_*()
__builtin_parity()
__builtin_rotateleft*()
and __builtin_rotateright*()
__builtin_mul_overflow()
__builtin_add_overflow()
and __builtin_sub_overflow()
__builtin_copysign()
optimiserfails – take 2
optimiserfails – take 3
optimiserfails – take 4
optimiserfails – take 5
optimiserfails – take 6
optimiserfails – take 7
optimiserfails – take 8
optimiserfails – take 9
optimiserfails – take 10
optimiserfails – take 11
optimiserfails – take 12
optimiserfails – take 13
optimiserfails – take 14
optimiserfails – take 15
compiler-rtruntime libraries to careless wasting of resources on Microsoft® Windows™.
compiler-rtruntime libraries, LLVM
and
- The compiler-rt project provides highly tuned implementations of the low-level code generator support routines like "__fixunsdfdi" and other calls generated when a target doesn't have a short sequence of native instructions to implement a core IR operation.
The builtins library provides optimized implementations of this and other low-level routines, either in target-independent C form, or as a heavily-optimized assembly.but dares to ship unoptimised and untuned
__cmpdi2()
,
__cmpti2()
,
__ucmpdi2()
,
__ucmpti2()
,
__udivmoddi4()
,
__udivmodti4()
,
__mulodi4()
,
__muloti4()
or
__popcountti2()
,
which even show pessimised code that impairs their
own Clang compiler and its optimiser from generating
proper and performant machine code suitable for current super-scalar
processors!
Note: for the synopsis of the library’s
contents see
README.txt
.
As proved below, especially in the sections miserable performance of integer division, unoptimised library functions and utterly devastating performance of overflow-checking 128×128-bit (and 64×64-bit) integer multiplication, the statements cited above are blatant lies!
Note: I recommend to read Randall Hyde’s ACM article, especially the following part:
Observation #6: Software engineers have been led to believe that their time is more valuable than CPU time; therefore, wasting CPU cycles in order to reduce development time is always a win. They've forgotten, however, that the application users' time is more valuable than their time.
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
[…]
// This file implements __ucmpdi2 for the compiler_rt library.
[…]
// Returns: if (a < b) returns 0
// if (a == b) returns 1
// if (a > b) returns 2
[…]
int __ucmpdi2(uint64_t a, uint64_t b) {
union {
uint64_t all;
struct {
uint32_t low, high;
} s;
} x, y;
x.all = a;
y.all = b;
if (x.s.high < y.s.high)
return 0;
if (x.s.high > y.s.high)
return 2;
if (x.s.low < y.s.low)
return 0;
if (x.s.low > y.s.low)
return 2;
return 1;
}
#ifdef __ARM_EABI__
// Returns: if (a < b) returns -1
// if (a == b) returns 0
// if (a > b) returns 1
int __aeabi_ulcmp(int64_t a, int64_t b) {
return __ucmpdi2(a, b) - 1;
}
#endif
Note: spotting the most obvious bug is left as an
exercise to the reader.
Properly written code which lets the optimiser do its job follows:
// Copyleft © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
typedef unsigned long long uint64_t;
int __ucmpdi2(uint64_t a, uint64_t b) {
return (a > b) - (a < b) + 1;
}
#ifdef __ARM_EABI__
int __aeabi_ulcmp(uint64_t a, uint64_t b) {
return (a > b) - (a < b);
}
#endif
magic128-bit constants:
//===-- popcountti2.c - Implement __popcountti2
[…]
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
[…]
int __popcountti2(__int128_t a) {
__uint128_t x3 = (__uint128_t)a;
x3 = x3 - ((x3 >> 1) &
(((__uint128_t)0x5555555555555555uLL << 64) | 0x5555555555555555uLL));
// Every 2 bits holds the sum of every pair of bits (64)
x3 = ((x3 >> 2) &
(((__uint128_t)0x3333333333333333uLL << 64) | 0x3333333333333333uLL)) +
(x3 & (((__uint128_t)0x3333333333333333uLL << 64) | 0x3333333333333333uLL));
// Every 4 bits holds the sum of every 4-set of bits (3 significant bits) (32)
x3 = (x3 + (x3 >> 4)) &
(((__uint128_t)0x0F0F0F0F0F0F0F0FuLL << 64) | 0x0F0F0F0F0F0F0F0FuLL);
[…]
Properly written code looks as follows:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __popcountti2(__uint128_t value) {
value -= (value >> 1) & (~(__uint128_t) 0 / 3u); // 0x55...55
value = ((value >> 2) & (~(__uint128_t) 0 / 5u)) // 0x33...33
+ (value & (~(__uint128_t) 0 / 5u));
value += value >> 4;
value &= ~(__uint128_t) 0 / 17u; // 0x0F...0F
value *= ~(__uint128_t) 0 / 255u; // 0x01...01
return value >> 120;
}
There's a strict one-to-one correspondence between a function call's arguments and the registers used for those arguments. Any argument that doesn't fit in 8 bytes, or isn't 1, 2, 4, or 8 bytes, must be passed by reference. A single argument is never spread across multiple registers.The LLVM Project Blog post Clang is now used to build Chrome for Windows states:[…] 16-byte arguments are passed by reference. […]
To return a user-defined type by value in RAX, it must have a length of 1, 2, 4, 8, 16, 32, or 64 bits. […] Otherwise, the caller must allocate memory for the return value and pass a pointer to it as the first argument. The remaining arguments are then shifted one argument to the right. The same pointer must be returned by the callee in RAX.
Clang is the first-ever open-source C++ compiler that’s ABI-compatible with Microsoft Visual C++ (MSVC) – meaning you can build some parts of your program (for example, system libraries) with the MSVC compiler (“cl.exe”), other parts with Clang, and when linked together (either by MSVC’s linker, “link.exe“, or LLD, the LLVM project’s linker – see below) the parts will form a working program.To falsify LLVM’s claim, create the text file
case0.c
with the following content in an arbitrary,
preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef __clang__
typedef struct {
unsigned __int64 low, high;
} __uint128_t;
#else
__attribute__ ((ms_abi))
#endif
__uint128_t __ubugti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder) {
if (remainder != 0)
#ifndef ALTERNATE
*remainder = divisor;
return dividend;
#else
*remainder = dividend % divisor;
return dividend / divisor;
#endif
}
Note: the actual definition or name of the
__uint128_t
for the
Visual C compiler is irrelevant; for
binary compatibility, defined by the respective
Application Binary Interfacealias ABI, only its matching size and memory layout matters!
Compile the source file case0.c
with
Clang, engaging its optimiser, and display the
generated assembly code:
clang -o- -O3 -S -target amd64-pc-windows case0.c
[…] __ubugti4: # @__ubugti4 # %bb.0: movaps (%rcx), %xmm0 testq %r8, %r8 je .LBB0_2 # %bb.1: movaps (%rdx), %xmm1 movaps %xmm1, (%r8) .LBB0_2: retq […]OOPS: the code generated by Clang expects the addresses of its three (128-bit) arguments in the registers
RCX
, RDX
and R8
instead of RDX
, R8
and R9
,
while it places the (128-bit) return value in the
SSE register
XMM0
instead of the memory pointed to by
RCX
, without to return this pointer in
RAX
, thus violating the Microsoft
ABI twice!
Compile the source file case0.c
a second time with
Clang, now without
SSE support,
and display the generated assembly code:
clang -mno-sse -o- -O3 -S -target amd64-pc-windows case0.c
[…] __ubugti4: # @__ubugti4 # %bb.0: movq %rdx, %r9 movq (%rcx), %rax movq 8(%rcx), %rdx testq %r8, %r8 je .LBB0_2 # %bb.1: movq (%r9), %r10 movq 8(%r9), %rcx movq %rcx, 8(%r8) movq %r10, (%r8) .LBB0_2: retq […]OUCH: the code generated by Clang still expects the addresses of its three (128-bit) arguments in the registers
RCX
, RDX
and R8
instead of RDX
, R8
and R9
,
while it places the (128-bit) return value in the register pair
RDX:RAX
instead of the memory pointed to by
RCX
, without to return this pointer in
RAX
, again violating the Microsoft
ABI twice!
Now compile the source file case0.c
with the reference
compiler for the Microsoft
ABI,
Visual C, and display the generated assembly code:
CL.EXE /c /FaCON: /FoNUL: /nologo /Ox /W4 case0.c
case0.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 […] __ubugti4 PROC ; Line 11 test r9, r9 je SHORT $LN1@ubugti4 ; Line 13 mov rax, QWORD PTR [r8] mov QWORD PTR [r9], rax mov rax, QWORD PTR [r8+8] mov QWORD PTR [r9+8], rax $LN1@ubugti4: ; Line 14 mov rax, QWORD PTR [rdx] mov QWORD PTR [rcx], rax mov rax, QWORD PTR [rdx+8] mov QWORD PTR [rcx+8], rax mov rax, rcx ; Line 19 ret 0 __ubugti4 ENDP […]This code complies (of course) with the Microsoft ABI:
RCX
;
RDX
, R8
and
R9
;
RAX
.
case0.c
a third time with
Clang, now with the preprocessor macro
ALTERNATE
defined, and display the generated assembly
code:
clang -DALTERNATE -o- -O3 -S -target amd64-pc-windows case0.c
[…] __ubugti4: # @__ubugti4 […] testq %r8, %r8 je .LBB0_2 # %bb.1: movq %r8, %r15 […] leaq 80(%rsp), %rcx leaq 64(%rsp), %rdx callq __umodti3 movaps %xmm0, (%r15) .LBB0_2: […] leaq 48(%rsp), %rcx leaq 32(%rsp), %rdx callq __udivti3 nop […]Oops: instead to generate a single call of the library function
__udivmodti4()
which returns both quotient and remainder, even with the command
line option -O3
specified Clang generates
separate calls of the library functions
__umodti3()
and
__udivti3()
– which is especially funny, since both call the library function
__udivmodti4()
in turn!
Note: as demonstrated below, this code runs about 3 to 34 (in words: thirty-four) times slower than code generated by GCC!
Additionally count the number of instructions and the code size of
the __udivmodti4()
function provided in the
clang_rt.builtins-x86_64.lib
library: it shows 234
instructions in 765 bytes, while properly written
(assembly) code
shows only 80 instructions in just 219 bytes – and runs up to
17 (in words: seventeen) times faster!
Note: exploration of the equally entertaining call chain for signed 128÷128-bit integer division is left as an exercise to the reader.
case2.c
with the following content
in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef ALTERNATE
typedef struct {
__int64 quotient, remainder;
} lldiv_t;
__attribute__ ((ms_abi))
lldiv_t lldiv(__int64 dividend, __int64 divisor) {
lldiv_t lldiv = {dividend / divisor, dividend % divisor};
return lldiv;
}
#else
unsigned long __udivmoddi4(unsigned long dividend, unsigned long divisor, unsigned long *remainder) {
if (remainder != 0)
*remainder = dividend % divisor;
return dividend / divisor;
}
#endif // ALTERNATE
Compile the source file case2.c
with
Clang, engaging its optimiser, targetting the
i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-windows case2.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] _lldiv: # @lldiv # %bb.0: pushl %ebp # push ebp pushl %ebx # mov ebp, [esp+8] pushl %edi # push ebx pushl %esi # movl 36(%esp), %esi # movl 20(%esp), %ebp # movl 24(%esp), %edi # movl 28(%esp), %ebx # pushl %esi # push [esp+28] pushl 36(%esp) # push [esp+28] pushl %ebx # push [esp+28] pushl %edi # push [esp+28] calll __alldiv # call __alldvrm movl %edx, %ecx # movl %eax, (%ebp) # mov [ebp], eax imull %eax, %esi # mull 32(%esp) # movl %ecx, 4(%ebp) # mov [ebp+4], edx imull 32(%esp), %ecx # addl %esi, %edx # addl %edx, %ecx # subl %eax, %edi # movl %ebp, %eax # sbbl %ecx, %ebx # movl %edi, 8(%ebp) # mov [ebp+8], ecx movl %ebx, 12(%ebp) # mov [ebp+12], ebx popl %esi # popl %edi # pop ebx popl %ebx # mov eax, ebp popl %ebp # pop ebp retl # ret […]Ouch: instead to generate a single call of the compiler helper function
__alldvrm()
, which returns
both quotient and remainder in the register pairs
EDX:EAX
and EBX:ECX
, even with the command
line option -O3
specified Clang generates
code to compute the remainder itself, wasting 15 of the total 31
instructions, wasting 34 of the total 74 bytes, and wasting about 9
processor clock cycles per function call!
Note: the code of the
__divmoddi4()
function provided in the clang_rt.builtins-i386.lib
library is equally bad:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
.arch generic32
.code32
.intel_syntax noprefix
.global ___divmoddi4
.type ___divmoddi4, @function
.text
___divmoddi4:
push ebp
push ebx
push edi
push esi
push eax
mov edi, [esp+36]
mov ebx, [esp+24]
mov ebp, [esp+28]
mov esi, [esp+32]
push edi
push esi
push ebp
push ebx
call ___divdi3
add esp, 16
mov ecx, eax
mov [esp], edx
imul edi, eax
mul eax, esi
add edx, edi
mov edi, [esp]
imul esi, edi
add esi, edx
sub ebx, eax
mov eax, [esp+40]
mov edx, edi
sbb ebp, esi
mov [eax], ebx
mov [eax+4], ebp
mov eax, ecx
add esp, 4
pop esi
pop edi
pop ebx
pop ebp
ret
.ident "clang version 10.0.0 "
.end
Additionally count the number of instructions and the code size of
the
__udivmoddi4()
function provided in the
clang_rt.builtins-i386.lib
library: it shows 256
instructions in 759 bytes, while properly written
(assembly) code
shows only 96 instructions in just 240 bytes – and runs up to
11 (in words: eleven) times faster!
Compile the source file case2.c
a second time with
Clang, now with the preprocessor macro
ALTERNATE
defined, targetting the AMD64
platform, and display the generated assembly code:
clang -DALTERNATE -o- -O3 -S -target amd64-pc-linux case2.c
[…] __udivmoddi4: # @__udivmoddi4 # %bb.0: testq %rdx, %rdx je .LBB0_2 # %bb.1: movq %rdx, %rcx movq %rdi, %rax xorl %edx, %edx divq %rsi movq %rdx, (%rcx) .LBB0_2: movq %rdi, %rax xorl %edx, %edx divq %rsi retq […] .ident "clang version 10.0.0 " […]Oops: despite the command line option
-O3
given, the generated code performs two
expensive divisions instead of just one!
case2.c
a third time with
Clang, now targetting the AMD64 platform,
and display the generated assembly code:
clang -DALTERNATE -o- -O3 -S -target amd64-pc-windows case2.c
[…] lldiv: # @lldiv # %bb.0: movq %rdx, %rax cqto idivq %r8 movq %rax, (%rcx) movq %rdx, 8(%rcx) movq %rcx, %rax retq […]
case3.c
with the following source
code, refactored from
__divdi3()
and
__moddi3()
,
in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
unsigned long long __udivmoddi4(unsigned long long dividend,
unsigned long long divisor,
unsigned long long *remainder);
#ifndef ALTERNATE
long long __divdi3(long long dividend, long long divisor) {
long long r = divisor >> 63; // r = divisor < 0 ? -1 : 0
long long s = dividend >> 63; // s = dividend < 0 ? -1 : 0
divisor = (divisor ^ r) - r; // negate if divisor < 0
dividend = (dividend ^ s) - s; // negate if dividend < 0
s ^= r; // sign of quotient
// negate if quotient < 0
return (__udivmoddi4(dividend, divisor, 0) ^ s) - s;
}
#else
long long __moddi3(long long dividend, long long divisor) {
long long r = divisor >> 63; // r = divisor < 0 ? -1 : 0
long long s = dividend >> 63; // s = dividend < 0 ? -1 : 0
divisor = (divisor ^ r) - r; // negate if divisor < 0
dividend = (dividend ^ s) - s; // negate if dividend < 0
__udivmoddi4(dividend, divisor, (unsigned long long *) &r);
return (r ^ s) - s; // negate if dividend < 0
}
#endif // ALTERNATE
Compile the source file case3.c
with
Clang, engaging its optimiser, targetting the
i386 platform, and display the generated, not properly
optimised assembly code:
clang -o- -O3 -S -target i386-pc-windows case3.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] ___divdi3: # @__divdi3 # %bb.0: pushl %ebp # movl %esp, %ebp # pushl %ebx # push ebx pushl %edi # pushl %esi # movl 20(%ebp), %ecx # mov eax, [esp+20] movl 12(%ebp), %eax # movl 16(%ebp), %edi # mov ecx, [esp+16] movl 8(%ebp), %ebx # movl %ecx, %edx # movl %eax, %esi # sarl $31, %edx # sarl $31, %esi # cdq xorl %edx, %edi # xor ecx, edx xorl %edx, %ecx # xor eax, edx subl %edx, %edi # sub ecx, edx sbbl %edx, %ecx # sbb eax, edx xorl %esi, %ebx # xorl %esi, %eax # subl %esi, %ebx # sbbl %esi, %eax # xorl %edx, %esi # mov ebx, edx pushl $0 # push 0 pushl %ecx # push eax pushl %edi # push ecx # mov eax, [esp+24] # mov ecx, [esp+20] # cdq # xor ecx, edx # xor eax, edx # sub ecx, edx # sbb eax, edx # xor ebx, edx pushl %eax # push eax pushl %ebx # push ecx calll ___udivmoddi4 # call ___udivmoddi4 addl $20, %esp # add esp, 20 xorl %esi, %eax # xor eax, ebx xorl %esi, %edx # xor edx, ebx subl %esi, %eax # sub eax, ebx sbbl %esi, %edx # sbb edx, ebx popl %esi # popl %edi # popl %ebx # pop ebx popl %ebp # retl # ret […]Oops: despite the command line option
-O3
given, the generated code shows 38 instructions and
clobbers the registers EBP
, EDI
and
ESI
without necessity; the properly optimised code
shows just 30 instructions and avoids 6 superfluous memory accesses.
Compile the source file case3.c
a second time with
Clang, now with the preprocessor macro
ALTERNATE
defined, again targetting the
i386 platform, and display the generated assembly code:
clang -DALTERNATE -o- -O3 -S -target i386-pc-windows case3.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] ___moddi3: # @__moddi3 # %bb.0: pushl %ebp # movl %esp, %ebp # pushl %ebx # push ebx pushl %edi # pushl %esi # andl $-8, %esp # subl $16, %esp # sub esp, 8 movl 20(%ebp), %ecx # mov eax, [esp+28] movl 12(%ebp), %eax # movl 16(%ebp), %edi # mov ecx, [esp+24] movl %esp, %ebx # push esp movl %ecx, %edx # movl %eax, %esi # sarl $31, %edx # cdq sarl $31, %esi # xorl %edx, %edi # xor ecx, edx xorl %edx, %ecx # xor eax, edx movl %edx, 4(%esp) # movl %edx, (%esp) # subl %edx, %edi # sub ecx, edx sbbl %edx, %ecx # sbb eax, edx # push eax # push ecx # mov eax, [esp+32] movl 8(%ebp), %edx # mov ecx, [esp+28] # cdq # mov ebx, edx xorl %esi, %eax # xor ecx, edx xorl %esi, %edx # xor eax, edx subl %esi, %edx # sub ecx, edx sbbl %esi, %eax # sbb eax, edx pushl %ebx # pushl %ecx # pushl %edi # pushl %eax # push eax pushl %edx # push ecx calll ___udivmoddi4 # call ___udivmoddi4 addl $20, %esp # add esp, 20 movl (%esp), %eax # pop eax movl 4(%esp), %edx # pop edx xorl %esi, %eax # xor eax, ebx xorl %esi, %edx # xor edx, ebx subl %esi, %eax # sub eax, ebx sbbl %esi, %edx # sbb edx, ebx leal -12(%ebp), %esp # popl %esi # popl %edi # popl %ebx # pop ebx popl %ebp # retl # ret […]OUCH: despite the command line option
-O3
given, the generated code shows 45 instructions and
clobbers the registers EBP
, EDI
and
ESI
without necessity; the properly optimised code
shows just 32 instructions and avoids 8 superfluous memory accesses.
Now compare the code shown above with the code from the
clang_rt.builtins-i386.lib
library, which is even
worse and shows 51 instructions:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
.arch generic32
.code32
.intel_syntax noprefix
.global ___moddi3
.type ___moddi3, @function
.text
___moddi3:
push ebp
mov ebp, esp
push ebx
push edi
push esi
and esp, -8
sub esp, 16
mov edx, [___security_cookie]
mov ecx, [ebp+20]
mov esi, [ebp+16]
mov eax, [ebp+12]
mov edi, esp
xor edx, ebp
mov ebx, eax
mov [esp+8], edx
mov edx, ecx
sar edx, 31
add esi, edx
adc ecx, edx
xor esi, edx
sar ebx, 31
xor ecx, edx
mov edx, [ebp+8]
xor eax, ebx
xor edx, ebx
sub edx, ebx
sbb eax, ebx
push edi
push ecx
push esi
push eax
push edx
call ___udivmoddi4
add esp, 20
mov edi, [esp]
mov esi, [esp+4]
mov ecx, [esp+8]
xor edi, ebx
xor esi, ebx
sub edi, ebx
sbb esi, ebx
xor ecx, ebp
call @__security_check_cookie@4
mov eax, edi
mov edx, esi
lea esp, [ebp-12]
pop esi
pop edi
pop ebx
pop ebp
ret
.ident "clang version 10.0.0 "
.end
case4llvm.c
with the following
source code, refactored from
__udivmodti4()
,
in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// Effects: if modulus != 0, *modulus = numerator % denominator
// Returns: numerator / denominator
// Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide
__uint128_t __udivmodti4(__uint128_t numerator, __uint128_t denominator, __uint128_t *modulus) {
union {
__uint128_t all;
struct {
unsigned long long low, high;
};
} dividend, divisor, quotient, remainder;
dividend.all = numerator;
divisor.all = denominator;
unsigned carry, shift;
// special cases, X is unknown, K != 0
if (dividend.high == 0) {
if (divisor.high == 0) {
// 0 X
// ---
// 0 X
if (modulus != 0)
*modulus = dividend.low % divisor.low;
return dividend.low / divisor.low;
}
// 0 X
// ---
// K X
if (modulus != 0)
*modulus = dividend.low;
return 0;
}
// dividend.high != 0
if (divisor.low == 0) {
if (divisor.high == 0) {
// K X
// ---
// 0 0
if (modulus != 0)
*modulus = dividend.high % divisor.low;
return dividend.high / divisor.low;
}
// divisor.high != 0
if (dividend.low == 0) {
// K 0
// ---
// K 0
if (modulus != 0) {
remainder.high = dividend.high % divisor.high;
remainder.low = 0;
*modulus = remainder.all;
}
return dividend.high / divisor.high;
}
// K K
// ---
// K 0
if ((divisor.high & (divisor.high - 1)) == 0) {
// divisor is a power of 2
if (modulus != 0) {
remainder.low = dividend.low;
remainder.high = dividend.high & (divisor.high - 1);
*modulus = remainder.all;
}
return dividend.high >> __builtin_ctzll(divisor.high);
}
// K K
// ---
// K 0
shift = __builtin_clzll(divisor.high)
- __builtin_clzll(dividend.high);
// 0 <= shift < 63 or shift large
if (shift > 62) {
if (modulus != 0)
*modulus = dividend.all;
return 0;
}
++shift;
// 0 < shift < 64
#ifdef OPTIMIZE
remainder.all = dividend.all >> shift;
#else
remainder.high = dividend.high >> shift;
remainder.low = (dividend.low >> shift)
| (dividend.high << (64 - shift));
#endif
quotient.high = dividend.low << (64 - shift);
quotient.low = 0;
} else {
// divisor.low != 0
if (divisor.high == 0) {
// K X
// ---
// 0 K
if ((divisor.low & (divisor.low - 1)) == 0) {
// divisor is a power of 2
if (modulus != 0)
*modulus = dividend.low & (divisor.low - 1);
if (divisor.low == 1)
return dividend.all;
shift = __builtin_ctzll(divisor.low);
#ifdef OPTIMIZE
quotient.all = dividend.all >> shift;
#else
quotient.high = dividend.high >> shift;
quotient.low = (dividend.low >> shift)
| (dividend.high << (64 - shift));
#endif
return quotient.all;
}
// K X
// ---
// 0 K
shift = __builtin_clzll(divisor.low)
- __builtin_clzll(dividend.high)
+ 65;
// 1 < shift < 128
#ifdef OPTIMIZE
remainder.all = dividend.all >> shift;
quotient.all = dividend.all << (128 - shift);
#else
if (shift == 64) {
remainder.high = 0;
remainder.low = dividend.high;
quotient.high = dividend.low;
quotient.low = 0;
} else if (shift < 64) {
// 1 < shift < 64
remainder.high = dividend.high >> shift;
remainder.low = (dividend.low >> shift)
| (dividend.high << (64 - shift));
quotient.high = dividend.low << (64 - shift);
quotient.low = 0;
} else {
// 64 < shift < 128
remainder.high = 0;
remainder.low = dividend.high >> (shift - 64);
quotient.high = (dividend.low >> (shift - 64))
| (dividend.high << (128 - shift));
quotient.low = dividend.low << (128 - shift);
}
#endif
} else {
// K X
// ---
// K K
shift = __builtin_clzll(divisor.high)
- __builtin_clzll(dividend.high);
// 0 <= shift < 64 or shift large
if (shift > 63) {
if (modulus != 0)
*modulus = dividend.all;
return 0;
}
++shift;
// 0 < shift <= 64
#ifdef OPTIMIZE
remainder.all = dividend.all >> shift;
quotient.high = dividend.low << (64 - shift);
#else
if (shift == 64) {
remainder.high = 0;
remainder.low = dividend.high;
quotient.high = dividend.low;
} else {
remainder.high = dividend.high >> shift;
remainder.low = (dividend.low >> shift)
| (dividend.high << (64 - shift));
quotient.high = dividend.low << (64 - shift);
}
#endif
quotient.low = 0;
}
}
// Not a special case
// q and r are initialized with:
// remainder.all = dividend.all >> shift;
// quotient.all = dividend.all << (128 - shift);
// 0 < shift < 128
for (carry = 0; shift > 0; --shift) {
// remainder:quotient = ((remainder:quotient) << 1) | carry
#ifdef OPTIMIZE
remainder.all = (remainder.all << 1) | (quotient.all >> 127);
quotient.all = (quotient.all << 1) | carry;
#else
remainder.high = (remainder.high << 1) | (remainder.low >> 63);
remainder.low = (remainder.low << 1) | (quotient.high >> 63);
quotient.high = (quotient.high << 1) | (quotient.low >> 63);
quotient.low = (quotient.low << 1) | carry;
#endif
#if 0
carry = 0;
if (remainder.all < divisor.all)
continue;
carry = 1;
remainder.all -= divisor.all;
#elif 0
carry = remainder.all >= divisor.all;
if (carry != 0)
remainder.all -= divisor.all;
#else
const __int128_t sign = (__int128_t) (divisor.all - remainder.all - 1) >> 127;
carry = sign & 1;
remainder.all -= divisor.all & sign;
#endif
}
#ifdef OPTIMIZE
quotient.all = (quotient.all << 1) | carry;
#else
quotient.high = (quotient.high << 1) | (quotient.low >> 63);
quotient.low = (quotient.low << 1) | carry;
#endif
if (modulus != 0)
*modulus = remainder.all;
return quotient.all;
}
Note: with the preprocessor macro
OPTIMIZE
defined, this function gains about 10 %
to 25 % performance when compiled with Clang, but
looses about the same amount when compiled with
GCC!
Create the text file case4gcc.c
with the following
source code, refactored from
libgcc2.c
,
in the same directory as case4llvm.c
:
// Copyleft © 2015-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// Copyright © 1989-2024 Free Software Foundation, Inc.
typedef unsigned long long uint64_t;
static inline
uint64_t DIV(uint64_t dividend_hi, uint64_t dividend_lo, uint64_t divisor, uint64_t *remainder) {
uint64_t quotient;
__asm__("divq\t%[divisor]"
:"=a" (quotient),
"=d" (*remainder)
:"a" (dividend_lo),
"d" (dividend_hi),
[divisor] "rm" (divisor)
:"cc");
return quotient;
}
__uint128_t __udivmodti4(__uint128_t numerator, __uint128_t denominator, __uint128_t *modulus) {
union {
__uint128_t all;
struct {
uint64_t low, high;
};
} dividend, divisor, quotient, product;
unsigned bn, bm;
dividend.all = numerator;
divisor.all = denominator;
if (divisor.high == 0) {
if (divisor.low > dividend.high) // 0:q = n:n / 0:d
quotient.high = 0;
else // q:q = n:n / 0:d
quotient.high = DIV(0, dividend.high, divisor.low, ÷nd.high);
quotient.low = DIV(dividend.high, dividend.low, divisor.low, ÷nd.low);
// remainder in dividend.low
if (modulus != 0) {
dividend.high = 0;
*modulus = dividend.all;
}
} else {
if (divisor.high > dividend.high) { // 0:0 = n:n / d:d
quotient.all = 0;
// remainder in dividend.all
if (modulus != 0)
*modulus = dividend.all;
} else { // 0:q = n:n / d:d
bm = __builtin_clzll(divisor.high);
if (bm == 0) {
// from most significant bit of divisor.high is set
// and dividend.high >= divisor.high
// follows most significant bit of dividend.high is set
// as well as quotient.low is either 0 or 1
//
// this special case is necessary, not an optimization!
// the condition on the next line takes advantage that
// (due to program flow) dividend.high >= divisor.high
if (dividend.high > divisor.high || dividend.low >= divisor.low) {
dividend.all -= divisor.all;
quotient.low = 1;
} else
quotient.low = 0;
quotient.high = 0;
if (modulus != 0)
*modulus = dividend.all;
} else { // normalize
bn = 64 - bm;
quotient.high = dividend.high >> bn;
#ifdef OPTIMIZE
dividend.all <<= bm;
divisor.all <<= bm;
#else
dividend.high <<= bm;
dividend.high |= dividend.low >> bn;
dividend.low <<= bm;
divisor.high <<= bm;
divisor.high |= divisor.low >> bn;
divisor.low <<= bm;
#endif
quotient.low = DIV(quotient.high, dividend.high, divisor.high, ÷nd.high);
quotient.high = 0;
product.all = quotient.all * divisor.low;
if (product.all > dividend.all) {
product.all -= divisor.all;
quotient.low -= 1;
}
// remainder is (dividend - product) >> bm
if (modulus != 0) {
dividend.all -= product.all;
#ifdef OPTIMIZE
dividend.all >>= bm;
#else
dividend.low >>= bm;
dividend.low |= dividend.high << bn;
dividend.high >>= bm;
#endif
*modulus = dividend.all;
}
}
}
}
return quotient.all;
}
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# * The software is provided "as is" without any warranty, neither express
# nor implied.
# * In no event will the author be held liable for any damage(s) arising
# from the use of the software.
# * Redistribution of the software is allowed only in unmodified form.
# * Permission is granted to use the software solely for personal private
# and non-commercial purposes.
# * An individuals use of the software in his or her capacity or function
# as an agent, (independent) contractor, employee, member or officer of
# a business, corporation or organization (commercial or non-commercial)
# does not qualify as personal private and non-commercial purpose.
# * Without written approval from the author the software must not be used
# for a business, for commercial, corporate, governmental, military or
# organizational purposes of any kind, or in a commercial, corporate,
# governmental, military or organizational environment of any kind.
# Unix System V calling convention for AMD64 platform:
# - first 6 integer or pointer arguments (from left to right) are passed
# in registers RDI/R7, RSI/R6, RDX/R2, RCX/R1, R8 and R9
# (R10 is used as static chain pointer in case of nested functions);
# - surplus arguments are pushed on stack in reverse order (from right to
# left), 8-byte aligned;
# - 128-bit integer arguments are passed as pair of 64-bit integer arguments,
# low part before/below high part;
# - 128-bit integer result is returned in registers RAX/R0 (low part) and
# RDX/R2 (high part);
# - 64-bit integer or pointer result is returned in register RAX/R0;
# - 32-bit integer result is returned in register EAX;
# - registers RBX/R3, RSP/R4, RBP/R5, R12, R13, R14 and R15 must be
# preserved;
# - registers RAX/R0, RCX/R1, RDX/R2, RSI/R6, RDI/R7, R8, R9, R10 (in
# case of normal functions) and R11 are volatile and can be clobbered;
# - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
# before calling other functions (CALL instruction pushes 8 bytes);
# - a "red zone" of 128 bytes below the stack pointer can be clobbered.
# NOTE: raises "division exception" when divisor is 0!
.file "umodti3.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = dividend
# rcx:rdx = divisor
__umodti3:
sub rsp, 24
mov r8, rsp # r8 = oword ptr remainder
call __udivmodti4
pop rax
pop rdx # rdx:rax = remainder
pop rcx
ret
# rsi:rdi = dividend
# rcx:rdx = divisor
__udivti3:
xor r8, r8
# rsi:rdi = dividend
# rcx:rdx = divisor
# r8 = oword ptr remainder
__udivmodti4:
cmp rdi, rdx
mov rax, rsi
sbb rax, rcx
jb .trivial # dividend < divisor?
mov r10, rdx
mov r11, rcx # r11:r10 = divisor
bsr rcx, rcx # rcx = index of most significant '1' bit
# in high qword of divisor
jnz .extended # high qword of divisor <> 0?
# divisor < 2**64
cmp rsi, rdx
jae .long # high qword of dividend >= (low qword of) divisor?
# dividend < divisor * 2**64: quotient < 2**64
# perform normal division
.normal:
mov rdx, rsi
mov rax, rdi # rdx:rax = dividend
div r10 # rax = (low qword of) quotient,
# rdx = (low qword of) remainder
test r8, r8
jz 0f # address of remainder = 0?
mov [r8], rdx
mov [r8+8], r11 # high qword of remainder = 0
0:
mov rdx, r11 # rdx:rax = quotient
ret
# dividend >= divisor * 2**64: quotient >= 2**64
# perform "long" alias "schoolbook" division
.long:
mov rdx, r11 # rdx = 0
mov rax, rsi # rdx:rax = high qword of dividend
div r10 # rax = high qword of quotient,
# rdx = high qword of remainder'
mov rcx, rax # rcx = high qword of quotient
mov rax, rdi # rax = low qword of dividend
div r10 # rax = low qword of quotient,
# rdx = (low qword of) remainder
test r8, r8
jz 1f # address of remainder = 0?
mov [r8], rdx
mov [r8+8], r11 # high qword of remainder = 0
1:
mov rdx, rcx # rdx:rax = quotient
ret
# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
test r8, r8
jz 2f # address of remainder = 0?
mov [r8], rdi
mov [r8+8], rsi # remainder = dividend
2:
xor eax, eax
xor edx, edx # rdx:rax = quotient = 0
ret
# dividend >= divisor >= 2**64: quotient < 2**64
.extended:
xor ecx, 63 # ecx = number of leading '0' bits
# in (high qword of) divisor
jz .special # divisor >= 2**127?
# perform "extended & adjusted" division
mov r9, r11 # r9 = high qword of divisor
shld r9, r10, cl # r9 = divisor / 2**(index + 1)
# = divisor'
# shl r10, cl
mov rax, rdi
mov rdx, rsi # rdx:rax = dividend
push rbx
.ifnotdef JCCLESS
xor ebx, ebx # rbx = high qword of quotient' = 0
cmp rdx, r9
jb 3f # high qword of dividend < divisor'?
# high qword of dividend >= divisor':
# subtract divisor' from high qword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub rdx, r9 # rdx = high qword of dividend - divisor'
# = high qword of dividend'
inc ebx # rbx = high qword of quotient' = 1
3:
.elseif 0
sub rdx, r9 # rdx = high qword of dividend - divisor'
sbb rbx, rbx # rbx = (high qword of dividend < divisor') ? -1 : 0
and rbx, r9 # rbx = (high qword of dividend < divisor') ? divisor' : 0
add rdx, rbx # rdx = high qword of dividend
# - (high qword of dividend < divisor') ? 0 : divisor'
# = high qword of dividend'
neg rbx # CF = (high qword of dividend < divisor')
sbb ebx, ebx # ebx = (high qword of dividend < divisor') ? -1 : 0
inc ebx # rbx = (high qword of dividend < divisor') ? 0 : 1
# = high qword of quotient'
.elseif 0
sub rdx, r9 # rdx = high qword of dividend - divisor'
cmovb rdx, rsi # = high qword of dividend'
sbb ebx, ebx # ebx = (high qword of dividend < divisor') ? -1 : 0
inc ebx # rbx = (high qword of dividend < divisor') ? 0 : 1
# = high qword of quotient'
.else # JCCLESS
xor ebx, ebx # rbx = high qword of quotient' = 0
sub rdx, r9 # rdx = high qword of dividend - divisor'
cmovb rdx, rsi # = high qword of dividend'
sbb ebx, -1 # rbx = (high qword of dividend < divisor') ? 0 : 1
# = high qword of quotient'
.endif # JCCLESS
div r9 # rax = dividend' / divisor'
# = low qword of quotient',
# rdx = remainder'
shld rbx, rax, cl # rbx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl rax, cl
mov rax, r10 # rax = low qword of divisor
mov r9, r11 # r9 = high qword of divisor
imul r9, rbx # r9 = high qword of divisor * quotient"
mul rbx # rdx:rax = low qword of divisor * quotient"
.ifnotdef JCCLESS
add rdx, r9 # rdx:rax = divisor * quotient"
jnc 4f # divisor * quotient" < 2**64?
.if 0
sbb rbx, 0 # rbx = quotient" - 1
.else
dec rbx # rbx = quotient" - 1
.endif
sub rax, r10
sbb rdx, r11 # rdx:rax = divisor * (quotient" - 1)
4:
sub rdi, rax
sbb rsi, rdx # rsi:rdi = dividend - divisor * quotient"
# = remainder"
.else # JCCLESS
sub rdi, rax
sbb rsi, rdx # rsi:rdi = dividend
# - low qword of divisor * quotient"
sub rsi, r9 # rsi:rdi = dividend - divisor * quotient"
# = remainder"
.endif # JCCLESS
jnb 5f # remainder" >= 0?
# (with borrow, it is off by divisor,
# and quotient" is off by 1)
.if 0
sbb rbx, 0 # rbx = quotient" - 1
# = quotient
.else
dec rbx # rbx = quotient" - 1
# = quotient
.endif
add rdi, r10
adc rsi, r11 # rsi:rdi = remainder" + divisor
# = remainder
5:
test r8, r8
jz 6f # address of remainder = 0?
mov [r8], rdi
mov [r8+8], rsi # remainder = rsi:rdi
6:
mov rax, rbx # rax = (low qword of) quotient
xor edx, edx # rdx:rax = quotient
pop rbx
ret
# dividend >= divisor >= 2**127:
# quotient = 1, remainder = dividend - divisor
.special:
test r8, r8
jz 7f # address of remainder = 0?
sub rdi, r10
sbb rsi, r11 # rsi:rdi = dividend - divisor
# = remainder
mov [r8], rdi
mov [r8+8], rsi
7:
xor eax, eax
xor edx, edx
inc eax # rdx:rax = quotient = 1
ret
.size __udivmodti4, .-__udivmodti4
.type __udivmodti4, @function
.global __udivmodti4
.size __udivti3, .-__udivti3
.type __udivti3, @function
.global __udivti3
.size __umodti3, .-__umodti3
.type __umodti3, @function
.global __umodti3
.end
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
# Common "cdecl" calling and naming convention for i386 platform:
# - arguments are pushed on stack in reverse order (from right to left),
# 4-byte aligned;
# - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
# low part below high part;
# - 64-bit integer result is returned in registers EAX (low part) and
# EDX (high part);
# - 32-bit integer or pointer result is returned in register EAX;
# - registers EAX, ECX and EDX are volatile and can be clobbered;
# - registers EBX, ESP, EBP, ESI and EDI must be preserved;
# - function names are prefixed with an underscore.
# NOTE: raises "division exception" when divisor is 0!
.file "udivmoddi4.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+20] = (optional) qword ptr remainder
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__udivmoddi4:
mov ecx, [esp+8] # ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] # edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+8] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
mov eax, [esp+4] # edx:eax = dividend
div ecx # eax = (low dword of) quotient,
# edx = (low dword of) remainder
mov ecx, [esp+20] # ecx = address of remainder
test ecx, ecx
jz 0f # address of remainder = 0?
mov [ecx], edx # [ecx] = (low dword of) remainder
xor edx, edx
mov [ecx+4], edx
0:
xor edx, edx # edx:eax = quotient
ret
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
push eax # [esp] = high dword of quotient
mov eax, [esp+8] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
mov ecx, [esp+24] # ecx = address of remainder
test ecx, ecx
jz 1f # address of remainder = 0?
mov [ecx], edx # [ecx] = (low dword of) remainder
xor edx, edx
mov [ecx+4], edx
1:
pop edx # edx:eax = quotient
ret
# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
mov ecx, [esp+20] # ecx = address of remainder
test ecx, ecx
jz 2f # address of remainder = 0?
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = dividend
mov [ecx], eax
mov [ecx+4], edx # [ecx] = remainder = dividend
2:
xor eax, eax
xor edx, edx # edx:eax = quotient = 0
ret
# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor >= 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx
jb 3f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
3:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx # CF = (high dword of dividend < divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
.if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+12] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
push ebx # [esp] = quotient"
mov eax, [esp+20] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
imul ebx, [esp+24] # ebx = high dword of divisor * quotient"
mov ecx, [esp+16] # ecx = high dword of dividend
sub ecx, ebx # ecx = dividend
# - low dword of divisor * quotient"
mov ebx, [esp+12] # ebx = low dword of dividend
sub ebx, eax
sbb ecx, edx # ecx:ebx = dividend - divisor * quotient"
# = remainder"
.ifnotdef JCCLESS
pop eax # eax = quotient"
jnb 4f # remainder" >= 0?
# (with borrow, it is off by divisor,
# and quotient" is off by 1)
add ebx, [esp+16]
adc ecx, [esp+20] # ecx:ebx = remainder" + divisor
# = remainder
dec eax # eax = quotient" - 1
# = (low dword of) quotient
4:
.else # JCCLESS
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
cdq # edx = (remainder" < 0) ? -1 : 0
add [esp], eax # [esp] = quotient" - (remainder" < 0)
# = (low dword of) quotient
and eax, [esp+20]
and edx, [esp+24] # edx:eax = (remainder" < 0) ? divisor : 0
add ebx, eax
adc ecx, edx # ecx:ebx = remainder" + divisor
# = remainder
pop eax # eax = (low dword of) quotient
.endif # JCCLESS
mov edx, [esp+24] # edx = address of remainder
test edx, edx
jz 5f # address of remainder = 0?
mov [edx], ebx
mov [edx+4], ecx # [edx] = remainder
5:
xor edx, edx # edx:eax = quotient
pop ebx
ret
# dividend >= divisor >= 2**63:
# quotient = 1, remainder = dividend - divisor
.special:
or ecx, [esp+20] # ecx = address of remainder
jz 6f # address of remainder = 0?
.if 0
neg edx
neg eax
sbb edx, 0 # edx:eax = -divisor
add eax, [esp+4]
adc edx, [esp+8] # edx:eax = dividend - divisor
# = remainder
.else
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = dividend
sub eax, [esp+12]
sbb edx, [esp+16] # edx:eax = dividend - divisor
# = remainder
.endif
mov [ecx], eax
mov [ecx+4], edx # [ecx] = remainder
6:
xor eax, eax
xor edx, edx
inc eax # edx:eax = quotient = 1
ret
.size __udivmoddi4, .-__udivmoddi4
.type __udivmoddi4, @function
.global __udivmoddi4
.end
Create the text file case4.c
with the following content
in the same directory as case4gcc.c
and
case4llvm.c
:
// Copyright © 2015-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#include <stdint.h>
#include <stdio.h>
#include <time.h>
extern
__uint128_t __udivmodti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder);
__attribute__ ((noinline))
static
__uint128_t __unopti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder)
{
if (remainder != NULL)
*remainder = divisor;
return dividend;
}
__attribute__ ((always_inline))
static
__uint128_t lfsr128(void)
{
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
// initialised with 2**128 / golden ratio
static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
const __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
const __uint128_t sign = (__int128_t) lfsr >> 127;
return lfsr = (lfsr << 1) ^ (poly & sign);
}
__attribute__ ((always_inline))
static
__uint128_t lfsr64(void)
{
// 64-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
// initialised with 2**64 / golden ratio
static uint64_t lfsr = 0x9E3779B97F4A7C15;
const uint64_t sign = (int64_t) lfsr >> 63;
return lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
}
__attribute__ ((always_inline))
static
__uint128_t lfsr32(void)
{
// 32-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xDB710641 (CRC-32 IEEE),
// initialised with 2**32 / golden ratio
static uint32_t lfsr = 0x9E3779B9;
const uint32_t sign = (int32_t) lfsr >> 31;
return lfsr = (lfsr << 1) ^ (0xDB710641 & sign);
}
int main(void)
{
clock_t t0, t1, t2, tt;
uint32_t n;
__uint128_t dividend, divisor = ~0, remainder;
volatile __uint128_t quotient;
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128();
quotient = __unopti4(dividend, divisor, NULL);
divisor = lfsr64();
quotient = __unopti4(dividend, divisor, &remainder);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128();
quotient = __udivmodti4(dividend, divisor, NULL);
divisor = lfsr64();
quotient = __udivmodti4(dividend, divisor, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128();
quotient = __unopti4(dividend, divisor, NULL);
divisor = lfsr32();
quotient = __unopti4(dividend, divisor, &remainder);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128();
quotient = __udivmodti4(dividend, divisor, NULL);
divisor = lfsr32();
quotient = __udivmodti4(dividend, divisor, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr64();
quotient = __unopti4(dividend, divisor, NULL);
divisor = lfsr32();
quotient = __unopti4(dividend, divisor, &remainder);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr64();
quotient = __udivmodti4(dividend, divisor, NULL);
divisor = lfsr32();
quotient = __udivmodti4(dividend, divisor, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
t0 = clock();
for (n = 500000u; n > 0u; n--)
{
quotient = __unopti4(~0, 3, NULL);
quotient = __unopti4(~0, 3, &remainder);
}
t1 = clock();
for (n = 500000u; n > 0u; n--)
{
quotient = __udivmodti4(~0, 3, NULL);
quotient = __udivmodti4(~0, 3, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu micro-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
// Copyright © 2015-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#include <stdint.h>
#include <stdio.h>
#include <time.h>
extern
__uint128_t __udivmodti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder);
__attribute__ ((noinline))
static
__uint128_t __unopti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder)
{
if (remainder != NULL)
*remainder = divisor;
return dividend;
}
__attribute__ ((always_inline))
static
__uint128_t lfsr128l(void)
{
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
// initialised with 2**128 / golden ratio
static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
const __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
const __uint128_t sign = (__int128_t) lfsr >> 127;
return lfsr = (lfsr << 1) ^ (poly & sign);
}
__attribute__ ((always_inline))
static
__uint128_t lfsr128r(void)
{
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA,
// initialised with bit-vector of prime numbers: 2**n == prime
static __uint128_t lfsr = (__uint128_t) 0x800228A202088288 << 64 | 0x28208A20A08A28AC;
const __uint128_t poly = (__uint128_t) 0xB64E4D3FA8E7331B << 64 | 0xD871FA30D46D4DBA;
const __uint128_t mask = 0 - (lfsr & 1);
return lfsr = (lfsr >> 1) ^ (poly & mask);
}
int main(void)
{
clock_t t0, t1, t2, tt;
uint32_t m, n;
__uint128_t dividend, divisor = ~0, remainder;
volatile __uint128_t quotient;
for (m = 0u; m < 64u; m += m + 1u)
{
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128l();
dividend >>= dividend & m;
quotient = __unopti4(dividend, divisor, NULL);
divisor = lfsr128r();
divisor >>= divisor & m;
quotient = __unopti4(dividend, divisor, &remainder);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128l();
dividend >>= dividend & m;
quotient = __udivmodti4(dividend, divisor, NULL);
divisor = lfsr128r();
divisor >>= divisor & m;
quotient = __udivmodti4(dividend, divisor, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
t0 = clock();
for (n = 500000u; n > 0u; n--)
{
quotient = __unopti4(~0, 3, NULL);
quotient = __unopti4(~0, 3, &remainder);
}
t1 = clock();
for (n = 500000u; n > 0u; n--)
{
quotient = __udivmodti4(~0, 3, NULL);
quotient = __udivmodti4(~0, 3, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu micro-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
// Copyright © 2015-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#include <stdint.h>
#include <stdio.h>
#include <time.h>
extern
uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);
__attribute__ ((noinline))
static
uint64_t __unopdi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder)
{
if (remainder != NULL)
*remainder = divisor;
return dividend;
}
__attribute__ ((always_inline))
static
uint64_t lfsr64l(void)
{
// 64-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
// initialised with 2**64 / golden ratio
static uint64_t lfsr = 0x9E3779B97F4A7C15;
const uint64_t sign = (int64_t) lfsr >> 63;
return lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
}
__attribute__ ((always_inline))
static
uint64_t lfsr64r(void)
{
// 64-bit linear feedback shift register (Galois form) using
// primitive polynomial 0x2B5926535897936B (CRC-64 "Jones"),
// initialised with bit-vector of prime numbers: 2**n == prime
static uint64_t lfsr = 0x28208A20A08A28AC;
const uint64_t mask = 0 - (lfsr & 1);
return lfsr = (lfsr >> 1) ^ (0x2B5926535897936B & mask);
}
int main(void)
{
clock_t t0, t1, t2, tt;
uint32_t m, n;
uint64_t dividend, divisor = ~0, remainder;
volatile uint64_t quotient;
for (m = 0u; m < 32u; m += m + 1u)
{
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr64l();
dividend >>= dividend & m;
quotient = __unopdi4(dividend, divisor, NULL);
divisor = lfsr64r();
divisor >>= divisor & m;
quotient = __unopdi4(dividend, divisor, &remainder);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr64l();
dividend >>= dividend & m;
quotient = __udivmoddi4(dividend, divisor, NULL);
divisor = lfsr64r();
divisor >>= divisor & m;
quotient = __udivmoddi4(dividend, divisor, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopdi4() %4lu.%06lu 0\n"
"__udivmoddi4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
t0 = clock();
for (n = 500000u; n > 0u; n--)
{
quotient = __unopdi4(~0, 3, NULL);
quotient = __unopdi4(~0, 3, &remainder);
}
t1 = clock();
for (n = 500000u; n > 0u; n--)
{
quotient = __udivmoddi4(~0, 3, NULL);
quotient = __udivmoddi4(~0, 3, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopdi4() %4lu.%06lu 0\n"
"__udivmoddi4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu micro-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
Note: modification of the source files
case4llvm.c
, case4gcc.c
and
case4.c
to demonstrate the equally disastrous
(mis)performance of the 64÷64-bit integer division on 32-bit
processors is left as an exercise to the reader.
On your favourite machine with an AMD64 processor, where both Clang and the GNU Compiler Collection are installed, run the following command lines to compile, link and run the benchmark program:
lscpu gcc -mno-sse -O3 case4.c case4gcc.c echo 'GCC/libgcc' ./a.out clang -mno-sse -O3 case4.c case4llvm.c echo 'LLVM/clang/compiler-rt' ./a.outNote: if you prefer to use the library function
__udivmodti4()
provided by the respective compiler, run
the following command lines instead:
lscpu gcc -mno-sse -O3 case4.c echo 'GCC/libgcc' ./a.out clang -mno-sse -O3 -rtlib=compiler-rt case4.c echo 'LLVM/clang/compiler-rt' ./a.out
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 113 Model name: AMD Ryzen 5 3600 6-Core Processor Stepping: 0 Frequency boost: enabled CPU MHz: 2195.688 CPU max MHz: 3600,0000 CPU min MHz: 2200,0000 BogoMIPS: 7186.29 Virtualization: AMD-V L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 3 MiB L3 cache: 32 MiB NUMA node0 CPU(s): 0-11 […] GCC/libgcc LLVM/clang/compiler-rt __unopti4() 3.308552 0 3.411228 0 __udivmodti4() 8.207382 4.898830 12.463517 9.052289 11.515934 nano-seconds 15.874745 nano-seconds __unopti4() 3.291094 0 3.384842 0 __udivmodti4() 8.461054 5.169960 13.061772 9.676930 11.752148 nano-seconds 16.446614 nano-seconds __unopti4() 3.290164 0 3.379618 0 __udivmodti4() 8.141390 4.851226 13.752975 10.373357 11.431554 nano-seconds 17.132593 nano-seconds __unopti4() 3.342998 0 3.388291 0 __udivmodti4() 8.466321 5.123323 15.481519 12.093228 11.809319 nano-seconds 18.869810 nano-seconds __unopti4() 3.356793 0 3.422627 0 __udivmodti4() 8.335018 4.978225 18.226895 14.804268 11.691811 nano-seconds 21.649522 nano-seconds __unopti4() 3.333240 0 3.383521 0 __udivmodti4() 8.313948 4.980708 24.646997 21.263476 11.647188 nano-seconds 28.030518 nano-seconds __unopti4() 3.329830 0 3.435418 0 __udivmodti4() 8.615062 5.285232 37.806325 34.370907 11.944892 nano-seconds 41.241743 nano-seconds __unopti4() 1.668976 0 1.695685 0 __udivmodti4() 12.617999 10.949023 167.891651 166.195966 14.286975 nano-seconds 169.587336 nano-seconds __unopti4() 1.760362 0 1.708451 0 __udivmodti4() 18.046460 16.286098 246.065291 244.356840 19.806822 nano-seconds 247.773742 nano-seconds __unopti4() 1.248846 0 1.463868 0 __udivmodti4() 7.155582 5.906736 10.833658 9.369790 8.404428 nano-seconds 12.297526 nano-seconds
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 113 Model name: AMD Ryzen 5 3600 6-Core Processor Stepping: 0 Frequency boost: enabled CPU MHz: 2195.688 CPU max MHz: 3600,0000 CPU min MHz: 2200,0000 BogoMIPS: 7186.29 Virtualization: AMD-V L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 3 MiB L3 cache: 32 MiB NUMA node0 CPU(s): 0-11 […] GCC/libgcc LLVM/clang/compiler-rt __unopti4() 1.668976 0 1.695685 0 __udivmodti4() 12.617999 10.949023 167.891651 166.195966 14.286975 nano-seconds 169.587336 nano-seconds __unopti4() 1.760362 0 1.708451 0 __udivmodti4() 18.046460 16.286098 246.065291 244.356840 19.806822 nano-seconds 247.773742 nano-seconds __unopti4() 1.248846 0 1.463868 0 __udivmodti4() 7.155582 5.906736 10.833658 9.369790 8.404428 nano-seconds 12.297526 nano-secondsOUCH: in the
bestcase, the executable generated by Clang performs the 128÷128-bit integer division 1.6 times slower than the executable generated by GCC, while in the worst case it is but 15 (in words: fifteen) times slower – an utterly disastrous result!
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7262 8-Core Processor Stepping: 0 CPU MHz: 3193.726 BogoMIPS: 6387.45 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-15 […] GCC/libgcc LLVM/clang/compiler-rt __unopti4() 4.280000 0 3.980000 0 __udivmodti4() 10.230000 5.950000 15.810000 11.830000 14.510000 nano-seconds 19.790000 nano-seconds __unopti4() 4.290000 0 3.990000 0 __udivmodti4() 10.540000 6.250000 16.700000 12.710000 14.830000 nano-seconds 20.690000 nano-seconds __unopti4() 4.280000 0 3.990000 0 __udivmodti4() 10.090000 5.810000 17.980000 13.990000 14.370000 nano-seconds 21.970000 nano-seconds __unopti4() 4.280000 0 3.990000 0 __udivmodti4() 10.380000 6.100000 20.280000 16.290000 14.660000 nano-seconds 24.270000 nano-seconds __unopti4() 4.280000 0 3.990000 0 __udivmodti4() 10.340000 6.060000 24.560000 20.570000 14.620000 nano-seconds 28.550000 nano-seconds __unopti4() 4.280000 0 3.990000 0 __udivmodti4() 10.300000 6.020000 33.150000 29.160000 14.580000 nano-seconds 37.140000 nano-seconds __unopti4() 4.290000 0 3.990000 0 __udivmodti4() 10.580000 6.290000 51.200000 47.210000 14.870000 nano-seconds 55.190000 nano-seconds __unopti4() 1.920000 0 2.250000 0 __udivmodti4() 15.420000 13.500000 230.230000 227.980000 17.340000 nano-seconds 232.480000 nano-seconds __unopti4() 2.000000 0 2.280000 0 __udivmodti4() 22.120000 20.120000 340.400000 338.120000 24.120000 nano-seconds 342.680000 nano-seconds __unopti4() 1.620000 0 1.780000 0 __udivmodti4() 8.810000 7.190000 13.200000 11.420000 10.430000 nano-seconds 14.980000 nano-seconds
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7262 8-Core Processor Stepping: 0 CPU MHz: 3193.726 BogoMIPS: 6387.45 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-15 […] GCC/libgcc LLVM/clang/compiler-rt __unopti4() 1.920000 0 2.250000 0 __udivmodti4() 15.420000 13.500000 230.230000 227.980000 17.340000 nano-seconds 232.480000 nano-seconds __unopti4() 2.000000 0 2.280000 0 __udivmodti4() 22.120000 20.120000 340.400000 338.120000 24.120000 nano-seconds 342.680000 nano-seconds __unopti4() 1.620000 0 1.780000 0 __udivmodti4() 8.810000 7.190000 13.200000 11.420000 10.430000 nano-seconds 14.980000 nano-secondsOUCH: now the worst case is 17 (in words: seventeen) times slower!
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 94 Model name: Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz Stepping: 3 CPU MHz: 3311.998 BogoMIPS: 6623.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 128 KiB L1i cache: 128 KiB L2 cache: 1 MiB L3 cache: 24 MiB NUMA node0 CPU(s): 0-3 […] GCC/libgcc LLVM/clang/compiler-rt __unopti4() 5.075482 0 5.077103 0 __udivmodti4() 34.428038 29.352556 183.407210 178.330107 39.503520 nano-seconds 188.484313 nano-seconds __unopti4() 4.652310 0 4.653633 0 __udivmodti4() 36.139848 31.487538 280.501125 275.847492 40.792158 nano-seconds 285.154758 nano-seconds __unopti4() 4.938115 0 4.650043 0 __udivmodti4() 8.971484 4.033369 12.585962 7.935919 13.909599 nano-seconds 17.236005 nano-secondsOops: in the
bestcase, the executable generated by Clang performs the 128÷128-bit integer division at half the speed of the executable generated by GCC, while in the worst case it is but 9 (in words: nine) times slower – another devastating result!
LLVM-10.0.0-win64.exe
dumps the following duplicate files in the
directory C:\Program Files\LLVM\bin\
,
wasting nearly
500 MB disk space, which
is about a third of the disk space occupied by the whole package:
C:\>DIR "C:\Program Files\LLVM\bin" /O:-S […] 03/25/2020 0:15 PM 83,258,880 clang-cl.exe 03/25/2020 0:03 PM 83,258,880 clang.exe 03/25/2020 0:15 PM 83,258,880 clang++.exe 03/25/2020 0:15 PM 83,258,880 clang-cpp.exe […] 03/25/2020 0:15 PM 57,812,480 lld-link.exe 03/25/2020 0:05 PM 57,812,480 lld.exe 03/25/2020 0:15 PM 57,812,480 ld.lld.exe 03/25/2020 0:15 PM 57,812,480 ld64.lld.exe 03/25/2020 0:15 PM 57,812,480 wasm-ld.exe […] 03/25/2020 0:15 PM 18,182,144 llvm-ranlib.exe 03/25/2020 0:15 PM 18,182,144 llvm-lib.exe 03/25/2020 0:00 PM 18,182,144 llvm-ar.exe […] C:\>FC.EXE /B "C:\Program Files\LLVM\bin\clang.exe" "C:\Program Files\LLVM\bin\clang-cl.exe" Comparing files C:\Program Files\LLVM\bin\clang.exe and C:\Program Files\LLVM\bin\clang-cl.exe FC: no differences encountered […] C:\>FSUTIL.EXE Hardlink List "C:\Program Files\LLVM\bin\clang.exe" \Program Files\LLVM\bin\clang.exe C:\> […]Dito for the executable installer
LLVM-10.0.0-win32.exe
:
C:\>DIR "C:\Program Files (x86)\LLVM\bin" /O:-S […] 03/25/2020 11:21 AM 77,862,912 clang-cpp.exe 03/25/2020 11:21 AM 77,862,912 clang-cl.exe 03/25/2020 11:21 AM 77,862,912 clang++.exe 03/25/2020 11:13 AM 77,862,912 clang.exe […] 03/25/2020 11:22 AM 54,811,648 ld.lld.exe 03/25/2020 11:22 AM 54,811,648 ld64.lld.exe 03/25/2020 11:15 AM 54,811,648 lld.exe 03/25/2020 11:22 AM 54,811,648 lld-link.exe 03/25/2020 11:22 AM 54,811,648 wasm-ld.exe […] 03/25/2020 11:21 AM 17,346,560 llvm-ranlib.exe 03/25/2020 11:10 AM 17,346,560 llvm-ar.exe 03/25/2020 11:21 AM 17,346,560 llvm-lib.exe […] C:\>Ever heard of hardlinks?
It is possible to cross-compile LLVM itself. That is, you can create LLVM executables and libraries to be hosted on a platform different from the platform where they are built (a Canadian Cross build).
Clang/LLVM is natively a cross-compiler, meaning that one set of
programs can compile to all targets by setting the
-target
option.
Contrary to both statements, it’s not even possible to build
an executable for 32-bit processors with the 64-bit compiler and
linker on Windows, and vice versa, i.e. for another
architecture of the same platform!
To falsify LLVM’s statements and prove my claim,
create the text file case6.c
with the following content
in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
void case6(_Complex double a, _Complex double b) {
volatile _Complex double p = a * b, q = a / b;
}
Compile the source file case6.c
with Clang
from the 32-bit package, targetting but the AMD64
platform, and display the generated assembly code:
"C:\Program Files (x86)\LLVM\bin\clang.exe" -o- -O3 -S -target amd64-pc-windows case6.c
[…] case6: # @case6 […] callq __muldc3 […] callq __divdc3 […]Note: the functions
__divdc3()
and
__muldc3()
for the AMD64 platform are provided in the library
clang_rt.builtins-x86_64.lib
.
"C:\Program Files (x86)\LLVM\bin\clang.exe" -O3 -target amd64-pc-windows case6.c
llvm-3d1c91.o : error LNK2019: unresolved external symbol __muldc3 referenced in function _main llvm-3d1c91.o : error LNK2019: unresolved external symbol __divdc3 referenced in function _main a.exe : fatal error LNK1120: 2 unresolved externals clang: error: linker command failed with exit code 1120 (use -v to see invocation)Compile the source file
case6.c
a second time, now with
Clang from the 64-bit package, targetting but the
i386 platform, and display the generated assembly code:
"C:\Program Files\LLVM\bin\clang.exe" -o- -O3 -S -target i386-pc-windows case6.c
[…] _case6: # @case6 […] calll ___muldc3 […] calll ___divdc3 […]Note: the functions
__divdc3()
and
__muldc3()
for the i386 platform are
provided in the library clang_rt.builtins-i386.lib
.
"C:\Program Files\LLVM\bin\clang.exe" -O3 -target i386-pc-windows case6.c
llvm-f4c2d1.o : error LNK2019: unresolved external symbol ___muldc3 referenced in function _main llvm-f4c2d1.o : error LNK2019: unresolved external symbol ___divdc3 referenced in function _main a.exe : fatal error LNK1120: 2 unresolved externals clang: error: linker command failed with exit code 1120 (use -v to see invocation)Although the compiler installed with both packages is able to produce 32-bit as well as 64-bit objects, and the linker is able to produce either 32-bit executables from 32-bit objects or 64-bit executables from 64-bit objects, 32-bit objects can’t be linked to produce 32-bit executables using the 64-bit package, and vice versa: each package contains only the
clang_rt.*.lib
for its own processor architecture!
C:\>DIR "C:\Program Files\LLVM\lib\clang\10.0.0\lib\windows\*-i386.lib" […] File Not Found […] C:\>DIR "C:\Program Files (x86)\LLVM\lib\clang\10.0.0\lib\windows\*-x86_64.lib" […] File Not Found […]Note: side-by-side installation of the 32-bit and 64-bit package needs 3 GB disk space, wasting 1 GB for duplicate files – that’s 100 times the disk space of the missing
clang_rt.*.lib
libraries!
oldis but the current version for the other processor architecture!
oldversion and its uninstall information are overwritten – a really user-friendly move!
Also note the denglisch kauderwelsch
: the title bar
and the buttons are localised, but the message text isn’t.
undefined behaviour
case8.c
with the following content
in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __absusi2(int value) {
if (value < 0)
value = -value;
if (value < 0)
__builtin_trap();
return value;
}
int __absvsi2(int value) {
const int sign = value >> 31;
value += sign;
value ^= sign;
if (value < 0)
__builtin_trap();
return value;
}
int __abswsi2(int value) {
const int sign = 0 - (value < 0);
value ^= sign;
value -= sign;
if (value < 0)
__builtin_trap();
return value;
}
long __absudi2(long value) {
if (value < 0)
value = -value;
if (value < 0)
__builtin_trap();
return value;
}
long __absvdi2(long value) {
const long sign = value >> 63;
value += sign;
value ^= sign;
if (value < 0)
__builtin_trap();
return value;
}
long __abswdi2(long value) {
const long sign = 0 - (value < 0);
value ^= sign;
value -= sign;
if (value < 0)
__builtin_trap();
return value;
}
__int128_t __absuti2(__int128_t value) {
if (value < 0)
value = -value;
if (value < 0)
__builtin_trap();
return value;
}
__int128_t __absvti2(__int128_t value) {
const __int128_t sign = value >> 127;
value += sign;
value ^= sign;
if (value < 0)
__builtin_trap();
return value;
}
__int128_t __abswti2(__int128_t value) {
const __int128_t sign = 0 - (value < 0);
value ^= sign;
value -= sign;
if (value < 0)
__builtin_trap();
return value;
}
Compile the source file case8.c
with
Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case8.c
[…] __absusi2: # @__absusi2 # %bb.0: movl %edi, %eax negl %eax cmovll %edi, %eax retq […] __absvsi2: # @__absvsi2 # %bb.0: movl %edi, %eax negl %eax cmovll %edi, %eax retq […] __abswsi2: # @__abswsi2 # %bb.0: movl %edi, %eax negl %eax cmovll %edi, %eax retq […] __absudi2: # @__absudi2 # %bb.0: movq %rdi, %rax negq %rax cmovlq %rdi, %rax retq […] __absvdi2: # @__absvdi2 # %bb.0: movq %rdi, %rax negq %rax cmovlq %rdi, %rax retq […] __abswdi2: # @__abswdi2 # %bb.0: movq %rdi, %rax negq %rax cmovlq %rdi, %rax retq […] __absuti2: # @__absuti2 # %bb.0: xorl %edx, %edx movq %rdi, %rax negq %rax sbbq %rsi, %rdx testq %rsi, %rsi cmovnsq %rdi, %rax cmovnsq %rsi, %rdx retq […] __absvti2: # @__absvti2 # %bb.0: movq %rsi, %rdx movq %rdi, %rax movq %rsi, %rcx sarq $63, %rcx addq %rcx, %rax adcq %rcx, %rdx xorq %rcx, %rdx js .LBB7_1 # %bb.2: xorq %rcx, %rax retq .LBB7_1: ud2 […] __abswti2: # @__abswti2 # %bb.0: movq %rsi, %rdx movq %rdi, %rax movq %rsi, %rcx sarq $63, %rcx xorq %rcx, %rdx xorq %rcx, %rax subq %rcx, %rax sbbq %rcx, %rdx js .LBB8_1 # %bb.2: retq .LBB8_1: ud2 […] .ident "clang version 10.0.0 " […]Oops: except for the
__absvti2()
and
__abswti2()
functions, the optimiserremoves the conditional expression detecting integer overflow – or
undefined behaviour– i.e. it fails to recognise both alternative implementations of
abs()
only for 128-bit integers!
case9.c
with the following content
in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifdef __amd64__
__int128_t __absti2(__int128_t value) {
return value < 0 ? -value : value;
}
#else
long long __absdi2(long long value) {
#ifndef ALTERNATE
return value < 0 ? -value : value;
#else
return __builtin_llabs(value);
#endif
}
int __abssi2(int value) {
#ifndef ALTERNATE
return value < 0 ? -value : value;
#else
return __builtin_abs(value);
#endif
}
#endif
Compile the source file case9.c
with
Clang, engaging its optimiser, targetting the
AMD64 platform, and display the generated assembly
code:
clang -o- -O3 -S -target amd64-pc-linux case9.cNote: the left column shows the generated code, while the right column shows shorter, but still not properly optimised code as comment.
[…] __absti2: # @__absti2 # %bb.0: xorl %edx, %edx # xor edx, edx movq %rdi, %rax # mov rax, rdi negq %rax # neg rax sbbq %rsi, %rdx # sbb rdx, rsi testq %rsi, %rsi # cmovnsq %rdi, %rax # cmovs rax, rdi cmovnsq %rsi, %rdx # cmovs rdx, rsi retq # ret […] .ident "clang version 10.0.0 " […]Oops: the use of
CMOVcc
instructions
introduces an explicit data dependency
without necessity and impairs performance!
Ouch: the TEST
instruction is superfluous!
The following code for the __absti2()
function avoids
the CMOVcc
instructions and is
3 bytes shorter:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "absti2.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
__absti2:
mov rax, rsi # rax = high qword of argument
cqo # rdx = (argument < 0) ? -1 : 0
mov rax, rdx # rdx:rax = (argument < 0) ? -1 : 0
add rdi, rdx
adc rsi, rdx # rsi:rdi = (argument < 0) ? argument - 1 : argument
xor rax, rdi
xor rdx, rsi # rdx:rax = (argument < 0) ? -argument : argument
# = |argument|
ret
.size __absti2, .-__absti2
.type __absti2, @function
.global __absti2
.end
Compile the source file case9.c
a second time with
Clang, now targetting the i386 platform,
and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case9.c
[…] __absdi2: # @__absdi2 # %bb.0: movl 8(%esp), %edx movl 4(%esp), %eax movl %edx, %ecx sarl $31, %ecx addl %ecx, %eax adcl %ecx, %edx xorl %ecx, %eax xorl %ecx, %edx retl […] __abssi2: # @__abssi2 # %bb.0: movl 4(%esp), %ecx movl %ecx, %eax negl %eax cmovll %ecx, %eax retl […] .ident "clang version 10.0.0 " […]Oops: the use of a
CMOVcc
instruction
introduces an explicit data dependency
without necessity and impairs performance!
The following code for the __absdi2()
and
__abssi2()
functions avoids the
CMOVcc
instruction and is
shorter too:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "absdi2.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
__absdi2:
mov eax, [esp+8]
mov ecx, [esp+4] # eax:ecx = argument
cdq # edx = (argument < 0) ? -1 : 0
add ecx, edx
adc eax, edx # eax:ecx = (argument < 0) ? argument - 1 : argument
xor ecx, edx
xor edx, eax # edx:ecx = (argument < 0) ? -argument : argument
# = |argument|
mov eax, ecx # edx:eax = |argument|
ret
.size __absdi2, .-__absdi2
.type __absdi2, @function
.global __absdi2
__abssi2:
mov eax, [esp+4] # eax = argument
cdq # edx = (argument < 0) ? -1 : 0
add eax, edx # eax = (argument < 0) ? argument - 1 : argument
xor eax, edx # eax = (argument < 0) ? -argument : argument
# = |argument|
ret
.size __abssi2, .-__abssi2
.type __abssi2, @function
.global __abssi2
.end
Note: exploration of the code generated with the
preprocessor macro ALTERNATE
defined is left as an
exercise to the reader.
__bswapsi2()
, __bswapdi2()
and __builtin_bswap*()
case10.c
with the following source
code, partially copied from
__bswapsi2()
and
__bswapdi2()
,
in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
unsigned int __aswapsi2(unsigned int u) {
return __builtin_bswap32(u);
}
unsigned int __bswapsi2(unsigned int u) {
return (((u & 0xFF000000U) >> 24)
| ((u & 0x00FF0000U) >> 8)
| ((u & 0x0000FF00U) << 8)
| ((u & 0x000000FFU) << 24));
}
unsigned long long __aswapdi2(unsigned long long u) {
return __builtin_bswap64(u);
}
unsigned long long __bswapdi2(unsigned long long u) {
return (((u & 0xFF00000000000000ULL) >> 56)
| ((u & 0x00FF000000000000ULL) >> 40)
| ((u & 0x0000FF0000000000ULL) >> 24)
| ((u & 0x000000FF00000000ULL) >> 8)
| ((u & 0x00000000FF000000ULL) << 8)
| ((u & 0x0000000000FF0000ULL) << 24)
| ((u & 0x000000000000FF00ULL) << 40)
| ((u & 0x00000000000000FFULL) << 56));
}
Compile the source file case10.c
with
Clang, engaging its optimiser, targetting the i386 processor, and display the generated assembly code:
clang -m32 -march=i386 -o- -O3 -S -target i386-pc-linux case10.cNote: the left column shows the generated invalid code, while the right column shows properly optimised valid code as comment.
[…] __aswapsi2: # @__aswapsi2 # %bb.0: movl 4(%esp), %eax # mov eax, [esp+4] bswapl %eax # ror ax, 8 # ror eax, 16 # ror ax, 8 retl # ret […] __bswapsi2: # @__bswapsi2 # %bb.0: movl 4(%esp), %eax # mov eax, [esp+4] bswapl %eax # ror ax, 8 # ror eax, 16 # ror ax, 8 retl # ret […] __aswapdi2: # @__aswapdi2 # %bb.0: movl 4(%esp), %edx # mov eax, [esp+8] movl 8(%esp), %eax # mov edx, [esp+4] bswapl %eax # ror ax, 8 # ror dx, 8 # ror eax, 16 # ror edx, 16 # ror ax, 8 bswapl %edx # ror dx, 8 retl # ret […] __bswapdi2: # @__bswapdi2 # %bb.0: movl 4(%esp), %edx # mov eax, [esp+8] movl 8(%esp), %eax # mov edx, [esp+4] bswapl %eax # ror ax, 8 # ror dx, 8 # ror eax, 16 # ror edx, 16 # ror ax, 8 bswapl %edx # ror dx, 8 retl # ret […] .ident "clang version 10.0.0 " […]Ouch: the
BSWAP
instruction was introduced with the i486 processor, it
is not supported on the i386
processor!
clang_rt.builtins-i386.lib
and the
libclang_rt.builtins-x64_86.a
libraries.
__absvti2()
__absvti2()
function provided in the libclang_rt.builtins-x86_64.a
library exhibits the code shown in the left column below; the right
column shows better, but still not properly optimised code, with
only 10 instructions in just 27 bytes instead of 18 instructions in
69 bytes:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
.arch generic64
.code64
.intel_syntax noprefix
.global __absvti2
.type __absvti2, @function
.text
__absvti2:
mov rcx, 0x8000000000000000 #
xor rcx, rsi #
or rcx, rdi #
jz .overflow #
mov rdx, rsi # mov rdx, rsi
mov rax, rdi # mov rax, rdi
mov rcx, rsi # sar rsi, 63
sar rcx, 63 # xor rax, rsi
xor rdx, rcx # xor rdx, rsi
xor rax, rcx # sub rax, rsi
sub rax, rcx # sbb rdx, rsi
sbb rdx, rcx # jo .overflow
ret # ret
.overflow:
push rax # ud2
lea rdi, … #
lea rdx, … #
mov esi, 24 #
call __abort #
.ident "clang version 10.0.0 "
.end
The following source coaxes Clang to generate this
better code – almost, with but one superfluous
MOV
instruction:
// Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
__int128_t __absvti2(__int128_t value) {
#ifndef ALTERNATE
const __int128_t sign = 0 - (value < 0);
value ^= sign;
value -= sign;
#else
const __int128_t sign = value >> 127;
value += sign;
value ^= sign;
#endif
if (value < 0)
__builtin_trap();
return value;
}
__ashldi3()
, __ashrdi3()
and __lshrdi3()
__ashldi3()
function provided in the clang_rt.builtins-i386.lib
library exhibits the braindead code shown in the left column below;
the right column shows properly optimised code, with only 12
instructions in just 30 bytes instead of 24 instructions in 51 bytes,
avoiding a superfluous conditional branch:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
.arch generic32
.code32
.intel_syntax noprefix
.global ___ashldi3
.type ___ashldi3, @function
.text
___ashldi3:
push esi #
mov ecx, [esp+16] # mov ecx, [esp+12]
mov edx, [esp+8] # mov eax, [esp+4]
test cl, 32 # test cl, 32
jnz 0f # jnz 0f
mov esi, [esp+12] # mov edx, [esp+8]
test ecx, ecx #
jz 2f #
mov eax, edx #
shl eax, cl #
shld esi, edx, cl # shld edx, eax, cl
xor ecx, ecx # shl eax, cl
mov edx, esi #
jmp 1f # ret
0:
shl edx, cl # shl eax, cl
xor eax, eax # mov edx, eax
xor ecx, ecx # xor eax, eax
1:
or edx, ecx #
pop esi #
ret #
2:
mov eax, edx #
mov edx, esi #
pop esi #
ret # ret
.ident "clang version 10.0.0 "
.end
Note: exploration of the equally bad code generated
for the
__ashrdi3()
and
__lshrdi3()
functions is left as an exercise to the reader.
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
.arch generic32
.code32
.intel_syntax noprefix
.global __ashrdi3
.type __ashrdi3, @function
.text
__ashrdi3:
push esi #
mov ecx, [esp+16] # mov ecx, [esp+12]
mov eax, [esp+12] # mov edx, [esp+8]
test cl, 32 # test cl, 32
jnz 0f # jnz 0f
mov esi, [esp+8] # mov eax, [esp+4]
test ecx, ecx #
jz 1f #
mov edx, eax #
sar edx, cl # shrd eax, edx, cl
shrd esi, eax, cl # sar edx, cl
jmp 2f # ret
0:
mov edx, eax # mov eax, edx
sar eax, cl # sar eax, cl
sar edx, 31 # sar edx, 31
pop esi #
ret #
1:
mov edx, eax #
2:
mov eax, esi #
pop esi #
ret # ret
.ident "clang version 10.0.0 "
.end
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
.arch generic32
.code32
.intel_syntax noprefix
.global __lshrdi3
.type __lshrdi3, @function
.text
__lshrdi3:
push esi #
mov ecx, [esp+16] # mov ecx, [esp+12]
mov esi, [esp+12] # mov edx, [esp+8]
test cl, 32 # test cl, 32
jnz 0f # jnz 0f
mov eax, [esp+8] # mov eax, [esp+4]
test ecx, ecx #
jz 2f #
mov edx, esi #
shr edx, cl #
shrd eax, esi, cl # shrd eax, edx, cl
mov esi, eax # shr edx, cl
xor eax, eax #
jmp 1f # ret
0:
shr esi, cl # shr edx, cl
xor eax, eax # mov eax, edx
xor edx, edx # xor edx, edx
1:
or eax, esi #
pop esi #
ret #
2:
mov edx, esi #
pop esi #
ret # ret
.ident "clang version 10.0.0 "
.end
__cmp?i2()
and __ucmp?i2()
__cmpdi2()
,
and
__ucmpdi2()
,
provided in the clang_rt.builtins-i386.lib
library
exhibit the insane code shown in the left column below; the right
column shows shorter (but still unoptimised) code,
with 12 instructions in 35 bytes instead of 15 instructions in 47
bytes for each function:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
.arch generic32
.code32
.intel_syntax noprefix
.global ___cmpdi2
.global ___ucmpdi2
.type ___cmpdi2, @function
.type ___ucmpdi2, @function
.text
___cmpdi2:
mov ecx, [esp+16] # mov ecx, [esp+16]
xor eax, eax # xor eax, eax
cmp [esp+8], ecx # cmp ecx, [esp+8]
jl 0f # jg 0f
mov eax, 2 # mov eax, 2
jg 0f # jl 0f
mov ecx, [esp+4] #
mov edx, [esp+12] # mov ecx, [esp+12]
mov eax, 0 # xor eax, eax
cmp ecx, edx # cmp ecx, [esp+4]
jb 0f # ja 0f
cmp edx, ecx #
mov eax, 1 #
adc eax, 0 # adc eax, 1
0: # 0:
ret # ret
___ucmpdi2:
mov ecx, [esp+16] # mov ecx, [esp+16]
xor eax, eax # xor eax, eax
cmp [esp+8], ecx # cmp ecx, [esp+8]
jb 0f # ja 0f
mov eax, 2 # mov eax, 2
ja 0f # jb 0f
mov ecx, [esp+4] #
mov edx, [esp+12] # mov ecx, [esp+12]
mov eax, 0 # xor eax, eax
cmp ecx, edx # cmp ecx, [esp+4]
jb 0f # ja 0f
cmp edx, ecx #
mov eax, 1 #
adc eax, 0 # adc eax, 1
0: # 0:
ret # ret
.ident "clang version 10.0.0 "
.end
Assembly programmers but write straightforward and branch-free code
instead, using only 11 instructions in just 31 bytes for the
__cmpdi2()
function, and only 8 instructions in just 25
bytes for the __ucmpdi2()
function:
# Copyright © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "cmpdi2.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
__cmpdi2:
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = left argument
sub eax, [esp+12]
sbb edx, [esp+16] # edx:eax = left argument - right argument
setg ah # ah = (left argument > right argument) ? 1 : 0
setl al # al = (left argument < right argument) ? 1 : 0
stc
sbb al, ah # al = (left argument < right argument)
# - (left argument > right argument)
# - 1
# = {0, -1, -2}
movsx eax, al
neg eax # eax = (left argument > right argument)
# - (left argument < right argument)
# + 1
# = {0, 1, 2}
ret
.size __cmpdi2, .-__cmpdi2
.type __cmpdi2, @function
.global __cmpdi2
__ucmpdi2:
xor eax, eax # eax = 0
mov ecx, [esp+4]
mov edx, [esp+8] # edx:ecx = left argument
sub ecx, [esp+12]
sbb edx, [esp+16] # edx:ecx = left argument - right argument
seta al # eax = (left argument > right argument)
sbb eax, -1 # eax = left argument > right argument
# - left argument < right argument
# + 1
# = {0, 1, 2}
ret
.size __ucmpdi2, .-__ucmpdi2
.type __ucmpdi2, @function
.global __ucmpdi2
.end
Note: exploration of the equally bad code generated
for the
__cmpti2()
and
__ucmpti2()
functions is left as an exercise to the reader.
__mulo?i4()
__mulosi4()
and
__mulodi4()
functions provided in the clang_rt.builtins-i386.a
library exhibit completely insane and horrible
code with 51 instructions in 130 bytes and 98 instructions in 266
bytes respectively, while the
__mulosi4()
,
__mulodi4()
and
__muloti4()
functions provided in the clang_rt.builtins-x86_64.a
library exhibit completely insane and horrible
code with 44 instructions in 131 bytes, 48 instructions in 155
bytes, and 94 instructions in 302 bytes respectively.
Note: completely insane and horrible code not shown here to preserve the mental health of the reader!
Note: Clang generates calls of these
monstrosities for the
__builtin_*mul*_overflow
builtins as well as the
-fsanitize=integer
, -fsanitize=undefined
and -ftrapv
command line options.
Create the text file case11.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef __amd64__
#ifndef OPTIMIZE
long long mulodi4(long long multiplicand, long long multiplier, int *overflow) {
long long product;
*overflow = __builtin_smulll_overflow(multiplicand, multiplier, &product);
return product;
}
#else
long long __mulodi4(long long multiplicand, long long multiplier, int *overflow) {
unsigned long long product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);
*overflow = __builtin_umulll_overflow((multiplicand ^ sign) - sign,
(multiplier ^ tmp) - tmp,
&product);
sign ^= tmp;
product = (product ^ sign) - sign;
*overflow |= (long long) (product ^ sign) < 0;
return product;
}
#endif
#else // __amd64__
#ifndef OPTIMIZE
__int128_t muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow) {
__int128_t product;
*overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
return product;
}
#else
__int128_t __muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow) {
__uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);
*overflow = __builtin_mul_overflow((multiplicand ^ sign) - sign,
(multiplier ^ tmp) - tmp,
&product);
sign ^= tmp;
product = (product ^ sign) - sign;
*overflow |= (__int128_t) (product ^ sign) < 0;
return product;
}
#endif
#endif // __amd64__
Compile the source file case11.c
with
Clang, engaging the optimiser, targetting the
AMD64 platform, and display the generated assembly
code:
clang -o- -O3 -S -target amd64-pc-linux case11.cNote: the left column shows the generated code, while the right column shows shorter, but still not optimised code as comment.
[…] muloti4: # @muloti4 # %bb.0: pushq %rbx # push r8 subq $16, %rsp # xor eax, eax movq %r8, %rbx # push rax movq $0, 8(%rsp) # push rax leaq 8(%rsp), %r8 # mov r8, rsp callq __muloti4 # call __muloti4 xorl %ecx, %ecx # pop r8 cmpq $0, 8(%rsp) # pop rcx setne %cl # cmp rcx, r8 movl %ecx, (%rbx) # setne cl addq $16, %rsp # pop r8 popq %rbx # mov [r8], ecx retq # ret […] .ident "clang version 10.0.0 " […]OOPS: 13 instructions in just 30 bytes instead of 13 instructions in 46 bytes – but not counting the additional 343 instructions in 1109 bytes of the called
__muloti4()
, __divti3()
,
__udivti3()
and __udivmodti4()
functions!
Compile the source file case11.c
a second time with
Clang, now with the preprocessor macro
OPTIMIZE
defined, and display the generated assembly
code:
clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case11.c
[…] __muloti4: # @__muloti4 # %bb.0: pushq %rbp pushq %r15 pushq %r14 pushq %rbx movq %rcx, %r10 movq %rsi, %rax movq %rsi, %r11 sarq $63, %r11 sarq $63, %rcx xorq %r11, %rax xorq %r11, %rdi subq %r11, %rdi sbbq %r11, %rax movq %rdx, %rsi xorq %rcx, %r10 xorq %rcx, %rsi subq %rcx, %rsi sbbq %rcx, %r10 setne %dl testq %rax, %rax setne %r14b andb %dl, %r14b mulq %rsi movq %rax, %r9 seto %bpl movq %r10, %rax mulq %rdi movq %rax, %r10 seto %r15b orb %bpl, %r15b orb %r14b, %r15b addq %r9, %r10 movq %rdi, %rax mulq %rsi addq %r10, %rdx setb %bl orb %r15b, %bl movzbl %bl, %esi xorq %r11, %rcx xorq %rcx, %rdx xorq %rcx, %rax subq %rcx, %rax sbbq %rcx, %rdx xorq %rdx, %rcx shrq $63, %rcx orl %esi, %ecx movl %ecx, (%r8) popq %rbx popq %r14 popq %r15 popq %rbp retq […] .ident "clang version 10.0.0 " […]Oops: the proper implementation of the
__muloti4()
function coaxes Clang to
generate 52 instructions in 147 bytes instead of 94 instructions in
302 bytes – and without calls of the
__divti3()
and __udivti3()
functions!
Assembly programmers but use only 44 instructions in just 121 bytes,
don’t clobber the non-volatile registers RBP
,
RBX
, R14
and R15
, and avoid
the 8 memory accesses as well as the 9 partial register writes which
impair performance:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "muloti4.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = multiplicand
# rcx:rdx = multiplier
# r8 = dword ptr flag
__muloti4:
mov r10, rdx # r10 = low qword of multiplier
mov r11, rcx # r11 = high qword of multiplier
mov rax, rcx # rax = high qword of multiplier
cqo # rdx = (multiplier < 0) ? -1 : 0
mov r9, rdx # r9 = (multiplier < 0) ? -1 : 0
xor r10, rdx
xor r11, rdx # r11:r10 = (multiplier < 0) ? ~multiplier : multiplier
sub r10, rdx
sbb r11, rdx # r11:r10 = (multiplier < 0) ? -multiplier : multiplier
# = |multiplier|
mov rax, rsi # rax = high qword of multiplicand
cqo # rdx = (multiplicand < 0) ? -1 : 0
xor r9, rdx # r9 = (multiplicand < 0) <> (multiplier < 0) ? -1 : 0
# = (product < 0) ? -1 : 0
xor rdi, rdx
xor rsi, rdx # rsi:rdi = (multiplicand < 0) ? ~multiplicand : multiplicand
sub rdi, rdx
sbb rsi, rdx # rsi:rdi = (multiplicand < 0) ? -multiplicand : multiplicand
# = |multiplicand|
xor ecx, ecx
mov [r8], ecx # flag = 0
cmp rcx, rsi
sbb edx, edx # edx = (high qword of |multiplicand| = 0) ? 0 : -1
# = (|multiplicand| < 2**64) ? 0 : -1
cmp rcx, r11
sbb ecx, ecx # ecx = (high qword of |multiplier| = 0) ? 0 : -1
# = (|multiplicand| < 2**64) ? 0 : -1
and ecx, edx # ecx = (|multiplicand| < 2**64)
# & (|multiplier| < 2**64) ? 0 : -1
# = (|product| < 2**128) ? 0 : -1
mov rax, rsi
mul r10 # rdx:rax = high qword of |multiplicand|
# * low qword of |multiplier|
adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
mov rsi, rax
mov rax, rdi
mul r11 # rdx:rax = low qword of |multiplicand|
# * high qword of |multiplier|
adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
add rsi, rax # rsi = high qword of |multiplicand|
# * low qword of |multiplier|
# + low qword of |multiplicand|
# * high qword of |multiplier|
# adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
mov rax, rdi
mul r10 # rdx:rax = low qword of |multiplicand|
# * low qword of |multiplier|
add rdx, rsi # rdx:rax = |product % 2**128|
# = |product| % 2**128
adc ecx, ecx # ecx = (|product| < 2**128) ? 0 : *
.if 0
xor rax, r9
xor rdx, r9 # rdx:rax = (product < 0)
# ? product % 2**128 - 1 : product % 2**128
sub rax, r9
sbb rdx, r9 # rdx:rax = product % 2**128
xor r9, rdx # r9 = (product % 2**128 < -2**127)
# | (product % 2**128 >= 2**127)
# ? negative : positive
add r9, r9
.else
add rax, r9
adc rdx, r9 # rdx:rax = (product < 0)
# ? ~product % 2**128 : product % 2**128
mov rsi, rdx # rsi = (product % 2**128 < -2**127)
# | (product % 2**128 >= 2**127)
# ? negative : positive
xor rax, r9
xor rdx, r9 # rdx:rax = product % 2**128
add rsi, rsi
.endif
adc ecx, ecx # ecx = (-2**127 <= product < 2**127) ? 0 : *
setnz [r8] # flag = (-2**127 <= product < 2**127) ? 0 : 1
ret
.size __muloti4, .-__muloti4
.type __muloti4, @function
.global __muloti4
.end
Compile the source file case11.c
a third time with
Clang, now targetting the i386 platform,
and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case11.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] mulodi4: # @mulodi4 # %bb.0: pushl %ebx # jmp __mulodi4 pushl %edi # pushl %esi # subl $16, %esp # movl 48(%esp), %esi # movl 32(%esp), %eax # movl 36(%esp), %ecx # movl 40(%esp), %edx # movl 44(%esp), %edi # movl $0, 12(%esp) # subl $12, %esp # leal 24(%esp), %ebx # pushl %ebx # pushl %edi # pushl %edx # pushl %ecx # pushl %eax # calll __mulodi4 # addl $32, %esp # xorl %ecx, %ecx # cmpl $0, 12(%esp) # setne %cl # movl %ecx, (%esi) # addl $16, %esp # popl %esi # popl %edi # popl %ebx # retl # […] .ident "clang version 10.0.0 " […]OUCH: only 1 instruction in just 5 bytes instead of 28 instructions in 73 bytes – not counting the additional 434 instructions in 1087 bytes of the called
__mulodi4()
,
__divdi3()
, __udivdi3()
and
__udivmoddi4()
functions!
Compile the source file case11.c
a fourth time with
Clang, now with the preprocessor macro
OPTIMIZE
defined, and display the generated assembly
code:
clang -DOPTIMIZE -o- -O3 -S -target i386-pc-linux case11.c
[…] __mulodi4: # @__mulodi4 # %bb.0: pushl %ebp pushl %ebx pushl %edi pushl %esi subl $8, %esp movl 32(%esp), %eax movl 40(%esp), %esi movl 28(%esp), %ecx movl 36(%esp), %edi movl %eax, %ebp movl %esi, %ebx sarl $31, %ebp sarl $31, %ebx xorl %ebp, %ecx xorl %ebp, %eax subl %ebp, %ecx sbbl %ebp, %eax xorl %ebx, %edi xorl %ebx, %esi subl %ebx, %edi sbbl %ebx, %esi setne %dl testl %eax, %eax setne %dh andb %dl, %dh movb %dh, 3(%esp) # 1-byte Spill mull %edi movl %eax, 4(%esp) # 4-byte Spill movl %esi, %eax seto 2(%esp) # 1-byte Folded Spill mull %ecx movl %eax, %esi seto %al orb 2(%esp), %al # 1-byte Folded Reload addl 4(%esp), %esi # 4-byte Folded Reload movb %al, 4(%esp) # 1-byte Spill movl %ecx, %eax mull %edi addl %esi, %edx movl 44(%esp), %esi setb %cl xorl %ebp, %ebx orb 4(%esp), %cl # 1-byte Folded Reload xorl %ebx, %eax xorl %ebx, %edx orb 3(%esp), %cl # 1-byte Folded Reload subl %ebx, %eax sbbl %ebx, %edx xorl %edx, %ebx shrl $31, %ebx movzbl %cl, %ecx orl %ecx, %ebx movl %ebx, (%esi) addl $8, %esp popl %esi popl %edi popl %ebx popl %ebp retl […] .ident "clang version 10.0.0 " […]Ouch: the proper implementation of the
__mulodi4()
function coaxes Clang to
generate 59 instructions in 146 bytes instead of 98 instructions in
266 bytes – without calls of the __divdi3()
and
__udivdi3()
functions!
Assembly programmers but use only 57 instructions in just 134 bytes,
don’t clobber the non-volatile registers EBP
,
EDI
and ESI
, and avoid 8 partial register
writes which impair performance:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "mulodi4.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+20] = dword ptr flag
# [esp+16] = high dword of multiplier
# [esp+12] = low dword of multiplier
# [esp+8] = high dword of multiplicand
# [esp+4] = low dword of multiplicand
__mulodi4:
push ebx
mov eax, [esp+20] # eax = high dword of multiplier
mov ecx, [esp+16] # ecx = low dword of multiplier
cdq # edx = (multiplier < 0) ? -1 : 0
mov ebx, edx # ebx = (multiplier < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (multiplier < 0) ? ~multiplier : multiplier
sub ecx, edx
sbb eax, edx # eax:ecx = (multiplier < 0) ? -multiplier : multiplier
# = |multiplier|
mov [esp+16], ecx
mov [esp+20], eax # multiplier = |multiplier|
mov eax, [esp+12] # eax = high dword of multiplicand
mov ecx, [esp+8] # ecx = low dword of multiplicand
cdq # edx = (multiplicand < 0) ? -1 : 0
xor ebx, edx # ebx = (multiplier < 0) ^ (multiplicand < 0) ? -1 : 0
# = (product < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (multiplicand < 0) ? ~multiplicand : multiplicand
sub ecx, edx
sbb eax, edx # eax:ecx = (multiplicand < 0) ? -multiplicand : multiplicand
# = |multiplicand|
mov [esp+8], ecx
mov [esp+12], eax # multiplicand = |multiplicand|
push ebx # [esp] = (product < 0) ? -1 : 0
xor ebx, ebx # ebx = 0
# mov eax, [esp+16] # eax = high dword of |multiplicand|
cmp ebx, eax
sbb edx, edx # edx = (high dword of |multiplicand| = 0) ? 0 : -1
# = (|multiplicand| < 2**32) ? 0 : -1
mov ecx, [esp+24] # ecx = high dword of |multiplier|
cmp ebx, ecx
sbb ebx, ebx # ebx = (high dword of |multiplier| = 0) ? 0 : -1
# = (|multiplier| < 2**32) ? 0 : -1
and ebx, edx # ebx = (|multiplicand| < 2**32)
# & (|multiplier| < 2**32) ? 0 : -1
# = (|product| < 2**64) ? 0 : -1
mov edx, [esp+20] # edx = low dword of |multiplier|
mul edx # edx:eax = high dword of |multiplicand|
# * low dword of |multiplier|
adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
.if 0
xchg ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|,
# eax = high dword of |multiplier|
mov edx, [esp+12] # edx = low dword of |multiplicand|
.else
mov edx, ecx # edx = high dword of |multiplier|
mov ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|
mov eax, [esp+12] # eax = low dword of |multiplicand|
.endif
mul edx # edx:eax = high dword of |multiplier|
# * low dword of |multiplicand|
adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
add ecx, eax # ecx = high dword of |multiplicand|
# * low dword of |multiplier|
# + high dword of |multiplier|
# * low dword of |multiplicand|
# adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
mov eax, [esp+12] # eax = low dword of |multiplicand|
mov edx, [esp+20] # edx = low dword of |multiplier|
mul edx # edx:eax = low dword of |multiplicand|
# * low dword of |multiplier|
add edx, ecx # edx:eax = |product % 2**64|
# = |product| % 2**64
adc ebx, ebx # ebx = (|product| < 2**64) ? 0 : *
pop ecx # ecx = (product < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx # edx:eax = (product < 0) ? product % 2**64 - 1 : product % 2**64
sub eax, ecx
sbb edx, ecx # edx:eax = product % 2**64
xor ecx, edx # ecx = (multiplicand < 0)
# ^ (multiplier < 0)
# ^ (product < 0) ? negative : positive
add ecx, ecx
adc ebx, ebx # ebx = (-2**63 <= product < 2**63) ? 0 : *
neg ebx
sbb ebx, ebx # ebx = (-2**63 <= product < 2**63) ? 0 : -1
neg ebx # ebx = (-2**63 <= product < 2**63) ? 0 : 1
mov ecx, [esp+28] # ecx = address of flag
mov [ecx], ebx # flag = (-2**63 <= product < 2**63) ? 0 : 1
pop ebx
ret
.size __mulodi4, .-__mulodi4
.type __mulodi4, @function
.global __mulodi4
.end
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "mulodi4.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
__mulodi4:
push ebx
mov ebx, [esp+16] # ebx = low dword of multiplier
mov eax, [esp+8] # eax = low dword of multiplicand
mul ebx # edx:eax = low dword of multiplicand
# * low dword of multiplier
push eax
mov ecx, edx # ecx = low dword of "inner" product
mov eax, [esp+16] # eax = high dword of multiplicand
mul ebx # edx:eax = high dword of multiplicand
# * low dword of multiplier
xor ebx, ebx # ebx = high dword of "inner" product
add ecx, eax
adc ebx, edx # ebx:ecx = intermediate "inner" product
mov eax, [esp+12] # eax = low dword of multiplicand
mul dword ptr [esp+24]
# edx:eax = low dword of multiplicand
# * high dword of multiplier
add ecx, eax
adc ebx, edx # ebx:ecx = final "inner" product
push ecx
sbb ecx, ecx # ecx = 0 - carry from "inner" product
mov eax, [esp+20] # eax = high dword of multiplicand
mul dword ptr [esp+28]
# edx:eax = high dword of multiplicand
# * high dword of multiplier
neg ecx # ecx = carry from "inner" product
add eax, ebx
adc edx, ecx # edx:eax = high dword of multiplicand
# * high dword of multiplier
# + high dword of "inner" product
# + carry from "inner" product
# = high qword of (unsigned) product"
xor ebx, ebx # ebx = sign of multiplier = 0
cmp ebx, [esp+28]
jle 0f # (high dword of) multiplier >= 0?
not ebx # ebx = 0 - sign of multiplier = -1
sub eax, [esp+16]
sbb edx, [esp+20] # edx:eax = high qword of product"
# - multiplicand
# = high qword of product'
0:
xor ecx, ecx
cmp ecx, [esp+20]
jle 1f # (high dword of) multiplicand >= 0?
not ebx # ebx = (0 - sign of multiplier)
# ^ (0 - sign of multiplicand)
sub eax, [esp+24]
sbb edx, [esp+28] # edx:eax = high qword of product'
# - multiplier
# = high qword of (signed) product
1:
xor eax, ebx
xor edx, ebx # edx:eax = high qword of product
# ^ (0 - sign of multiplier)
# ^ (0 - sign of multiplicand)
or eax, edx # eax = high qword of product
# <> (0 - sign of multiplier)
# ^ (0 - sign of multiplicand)
pop edx # edx = high dword of (low qword of) product
shld ecx, edx, 1 # ecx = sign of product
add ebx, ecx # ebx = sign of product
# <> sign of multiplier
# ^ sign of multiplicand
or eax, ebx # eax = (-2**63 <= product < 2**63) ? 0 : *
setnz cl # ecx = (-2**63 <= product < 2**63) ? 0 : 1
mov eax, [esp+28] # eax = address of flag
mov [eax], ecx # flag = (-2**63 <= product < 2**63) ? 0 : 1
pop eax # edx:eax = (low qword of) product
pop ebx
ret
.size __mulodi4, .-__mulodi4
.type __mulodi4, @function
.global __mulodi4
.end
__parity?i2()
__paritysi2()
and
__paritydi2()
functions provided in the clang_rt.builtins-i386.lib
library exhibit the insane code shown in the left column below; the
right column shows properly optimised code, with 15 instructions in
40 bytes instead of 21 instructions in 57 bytes for both functions
together, more than halving the number of instructions executed per
function call:
# Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
.arch generic32
.code32
.intel_syntax noprefix
.global ___paritysi2
.global ___paritydi2
.type ___paritysi2, @function
.type ___paritydi2, @function
.text
___paritysi2:
mov eax, [esp+4] # mov eax, [esp+4]
mov ecx, eax #
shr ecx, 16 # shld ecx, eax, 16
xor ecx, eax # xor ecx, eax
mov eax, ecx #
shr eax, 8 #
xor eax, ecx #
mov ecx, eax #
shr ecx, 4 #
xor ecx, eax #
mov eax, 0x6996 #
and cl, 15 # xor eax, eax
shr eax, cl # xor cl, ch
and eax, 1 # setpo al
ret # ret
___paritydi2:
mov eax, [esp+8] # mov eax, [esp+8]
xor eax, [esp+4] # xor eax, [esp+4]
push eax # shld ecx, eax, 16
call ___paritysi2 # xor ecx, eax
add esp, 4 # xor eax, eax
# xor cl, ch
# setpo al
ret # ret
.ident "clang version 10.0.0 "
.end
__builtin_*()
__builtin_*()
functions, Clang emits
unoptimised code, which can even be worse than the
poor code provided in the clang_rt.builtins-i386.*
and
clang_rt.builtins-x86_64.*
libraries, and
fails optimisation!
__builtin_parity()
case12a.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __paritysi2(unsigned int value) {
return __builtin_parity(value);
}
int __paritydi2(unsigned long long value) {
return __builtin_parityll(value);
}
Compile the source file case12a.c
with
Clang, targetting the AMD64 platform,
and display the generated (unoptimised) assembly code:
clang -o- -S -target amd64-pc-linux case12a.c
[…] __paritysi2: # @__paritysi2 # %bb.0: pushq %rbp movq %rsp, %rbp movl %edi, -4(%rbp) movl -4(%rbp), %eax movl %eax, %ecx shrl %ecx andl $0x55555555, %ecx subl %ecx, %eax movl %eax, %ecx andl $0x33333333, %ecx shrl $2, %eax andl $0x33333333, %eax addl %eax, %ecx movl %ecx, %eax shrl $4, %eax addl %eax, %ecx andl $0xF0F0F0F, %ecx imull $0x1010101, %ecx, %eax shrl $24, %eax andl $1, %eax popq %rbp retq […] __paritydi2: # @__paritydi2 # %bb.0: pushq %rbp movq %rsp, %rbp movq %rdi, -8(%rbp) movq -8(%rbp), %rax movq %rax, %rcx shrq %rcx movabsq $0x5555555555555555, %rdx andq %rdx, %rcx subq %rcx, %rax movabsq $0x3333333333333333, %rcx movq %rax, %rdx andq %rcx, %rdx shrq $2, %rax andq %rcx, %rax addq %rax, %rdx movq %rdx, %rax shrq $4, %rax addq %rax, %rdx movabsq $0xF0F0F0F0F0F0F0F, %rax andq %rax, %rdx movabsq $0x101010101010101, %rax imulq %rax, %rdx shrq $56, %rdx andq $1, %rdx movl %edx, %eax popq %rbp retq […] .ident "clang version 10.0.0 " […]Compile the source file
case12a.c
a second time with
Clang, now engaging its optimiser, again targetting the
AMD64 platform, and display the generated
optimisedassembly code:
clang -o- -O3 -S -target amd64-pc-linux case12a.c
[…] __paritysi2: # @__paritysi2 # %bb.0: movl %edi, %ecx shrl $16, %ecx xorl %edi, %ecx movl %ecx, %edx shrl $8, %edx xorl %eax, %eax xorb %cl, %dl setnp %al retq […] __paritydi2: # @__paritydi2 # %bb.0: movq %rdi, %rax shrq %rax movabsq $0x5555555555555555, %rcx andq %rax, %rcx subq %rcx, %rdi movabsq $0x3333333333333333, %rax movq %rdi, %rcx andq %rax, %rcx shrq $2, %rdi andq %rax, %rdi addq %rcx, %rdi movq %rdi, %rax shrq $4, %rax addq %rdi, %rax movabsq $0x10F0F0F0F0F0F0F, %rcx andq %rax, %rcx movabsq $0x101010101010101, %rax imulq %rcx, %rax shrq $56, %rax andl $1, %eax retq […] .ident "clang version 10.0.0 " […]OUCH: the
optimisedcode generated for
__builtin_parityll()
is a perfect and
very convincing declaration of bankruptcy!
Compile the source file case12a.c
a third time with
Clang, again engaging its optimiser, now targetting the
i386 platform, and display the generated
optimised
assembly code:
clang -o- -O3 -S -target i386-pc-linux case12a.c
[…] __paritysi2: # @__paritysi2 # %bb.0: movl 4(%esp), %eax movl %eax, %ecx shrl $16, %ecx xorl %eax, %ecx xorl %eax, %eax xorb %ch, %cl setnp %al retl […] .LCPI1_0: .zero 16,85 .LCPI1_1: .zero 16,51 .LCPI1_2: .zero 16,15 […] __paritydi2: # @__paritydi2 # %bb.0: movq 4(%esp), %xmm0 # xmm0 = mem[0],zero movdqa %xmm0, %xmm1 psrlw $1, %xmm1 pand .LCPI1_0, %xmm1 psubb %xmm1, %xmm0 movdqa .LCPI1_1, %xmm1 # xmm1 = [51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51] movdqa %xmm0, %xmm2 psrlw $2, %xmm0 pand %xmm1, %xmm2 pand %xmm1, %xmm0 paddb %xmm2, %xmm0 movdqa %xmm0, %xmm1 psrlw $4, %xmm1 paddb %xmm0, %xmm1 pxor %xmm0, %xmm0 pand .LCPI1_2, %xmm1 psadbw %xmm1, %xmm0 movd %xmm0, %eax andl $1, %eax retl […] .ident "clang version 10.0.0 " […]OUCH: again the
optimisedcode generated for
__builtin_parityll()
is a perfect and
very convincing declaration of bankruptcy!
Compile the source file case12a.c
a fourth time with
Clang, again targetting the i386 platform,
now without
SSE support,
and display the generated optimised
assembly code:
clang -mno-sse -o- -O3 -S -target i386-pc-linux case12a.c
[…] __paritysi2: # @__paritysi2 # %bb.0: movl 4(%esp), %eax movl %eax, %ecx shrl $16, %ecx xorl %eax, %ecx xorl %eax, %eax xorb %ch, %cl setnp %al retl […] __paritydi2: # @__paritydi2 # %bb.0: movl 8(%esp), %ecx movl 4(%esp), %eax movl %ecx, %edx shrl %edx andl $0x55555555, %edx subl %edx, %ecx movl %ecx, %edx shrl $2, %ecx andl $0x33333333, %edx andl $0x33333333, %ecx addl %edx, %ecx movl %ecx, %edx shrl $4, %edx addl %ecx, %edx andl $0x10F0F0F, %edx imull $0x1010101, %edx, %ecx movl %eax, %edx shrl %edx shrl $24, %ecx andl $0x55555555, %edx subl %edx, %eax movl %eax, %edx shrl $2, %eax andl $0x33333333, %edx andl $0x33333333, %eax addl %edx, %eax movl %eax, %edx shrl $4, %edx addl %eax, %edx andl $0x10F0F0F, %edx imull $0x1010101, %edx, %eax shrl $24, %eax addl %ecx, %eax andl $1, %eax retl […] .ident "clang version 10.0.0 " […]OUCH: once more the
optimisedcode generated for
__builtin_parityll()
is a
perfect and very convincing
declaration of bankruptcy!
__builtin_rotateleft*()
and __builtin_rotateright*()
case12b.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
long long __rotldi3(long long value, long long count) {
return __builtin_rotateleft64(value, count);
}
long long __rotrdi3(long long value, long long count) {
return __builtin_rotateright64(value, count);
}
Compile the source file case12b.c
with
Clang, engaging the optimiser, targetting the
i386 platform, and display the generated assembly
code:
clang -o- -O3 -S -target i386-pc-linux case12b.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] __rotldi3: # @__rotldi3 # %bb.0: pushl %ebp # pushl %ebx # push ebx pushl %edi # pushl %esi # movb 28(%esp), %ch # mov ecx, [esp+16] movl 20(%esp), %esi # mov edx, [esp+12] movl 24(%esp), %edx # mov eax, [esp+8] xorl %ebp, %ebp # mov ebx, edx movb %ch, %cl # test cl, 32 movl %edx, %edi # cmovnz ebx, eax movl %esi, %ebx # cmovnz eax, edx movl %esi, %eax # mov edx, ebx negb %cl # shrl %cl, %edi # shld edx, eax, cl shrdl %cl, %edx, %ebx # shld eax, ebx, cl testb $32, %cl # movb %ch, %cl # cmovnel %edi, %ebx # cmovnel %ebp, %edi # shll %cl, %eax # shldl %cl, %esi, %edx # testb $32, %ch # cmovnel %eax, %edx # cmovnel %ebp, %eax # orl %ebx, %eax # orl %edi, %edx # popl %esi # popl %edi # popl %ebx # pop ebx popl %ebp # retl # ret […] __rotrdi3: # @__rotrdi3 # %bb.0: pushl %ebp # pushl %ebx # push ebx pushl %edi # pushl %esi # movl 20(%esp), %esi # mov eax, [esp+8] movl 24(%esp), %edx # mov edx, [esp+12] movb 28(%esp), %cl # mov ecx, [esp+16] xorl %ebp, %ebp # mov ebx, eax # test cl, 32 movl %edx, %edi # cmovnz ebx, edx movl %esi, %ebx # cmovnz edx, eax movl %esi, %eax # mov eax, ebx shrl %cl, %edi # shrd eax, edx, cl shrdl %cl, %edx, %ebx # shrd edx, ebx, cl testb $32, %cl # cmovnel %edi, %ebx # cmovnel %ebp, %edi # negb %cl # shll %cl, %eax # shldl %cl, %esi, %edx # testb $32, %cl # cmovnel %eax, %edx # cmovnel %ebp, %eax # orl %ebx, %eax # orl %edi, %edx # popl %esi # popl %edi # popl %ebx # pop ebx popl %ebp # retl # ret […] .ident "clang version 10.0.0 " […]OUCH: 31 instructions in 67 bytes for the
__rotldi3()
function, and 29 instructions in 63 bytes
for the __rotrdi3()
function, instead of only 13
instructions in 35 bytes for each of these functions – the
optimisedcode generated for
__builtin_rotate*()
is yet another perfect and very
convincing declaration of bankruptcy!
__builtin_mul_overflow()
case12c.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
unsigned int __umulvsi3(unsigned int multiplicand, unsigned int multiplier) {
unsigned int product;
if (__builtin_mul_overflow(multiplicand, multiplier, &product))
__builtin_trap();
return product;
}
unsigned long long __umulvdi3(unsigned long long multiplicand, unsigned long long multiplier) {
unsigned long long product;
if (__builtin_mul_overflow(multiplicand, multiplier, &product))
__builtin_trap();
return product;
}
#ifdef __amd64__
__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
__uint128_t product;
if (__builtin_mul_overflow(multiplicand, multiplier, &product))
__builtin_trap();
return product;
}
#endif
Compile the source file case12c.c
with
Clang, engaging the optimiser, targetting the
AMD64 platform, and display the generated assembly
code:
clang -o- -O3 -S -target amd64-pc-linux case12c.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] __umulvsi3: # @__umulvsi3 # %bb.0: movl %edi, %eax mull %esi jo .LBB0_1 # %bb.2: retq .LBB0_1: ud2 […] __umulvdi3: # @__umulvdi3 # %bb.0: movq %rdi, %rax mulq %rsi jo .LBB1_1 # %bb.2: retq .LBB1_1: ud2 […] __umulvti3: # @__umulvti3 # %bb.0: pushq %rbx # movq %rdx, %r8 # mov r8, rdx movq %rsi, %rax # mov r9, rcx testq %rcx, %rcx # neg rcx setne %dl # sbb ecx, ecx testq %rsi, %rsi # xor eax, eax setne %r9b # cmp rax, rsi andb %dl, %r9b # sbb eax, eax mulq %r8 # and ecx, eax movq %rax, %r11 # seto %r10b # mov rax, rdi movq %rcx, %rax # mul r9 mulq %rdi # adc ecx, ecx movq %rax, %rsi # mov r9, rax seto %bl # orb %r10b, %bl # mov rax, rsi orb %r9b, %bl # mul r8 addq %r11, %rsi # adc ecx, ecx movq %rdi, %rax # add r9, rax mulq %r8 # addq %rsi, %rdx # mov rax, rdi setb %cl # mul r8 orb %bl, %cl # add rdx, r9 cmpb $1, %cl # adc ecx, ecx je .LBB2_1 # jnz .LBB2_1 # %bb.2: popq %rbx # retq # ret .LBB2_1: ud2 # ud2 […] .ident "clang version 10.0.0 " […]Oops: the
__umulvti3()
function shows
28 instructions in 77 bytes instead of only 23 instructions in just
58 bytes, clobbers registers RBX
and R10
without necessity, and performs 2 superfluous memory accesses plus 9
partial register writes which impair performance!
Compile the source file case12c.c
a second time with
Clang, now targetting the i386 platform,
and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case12c.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] __umulvsi3: # @__umulvsi3 # %bb.0: movl 4(%esp), %eax mull 8(%esp) jo .LBB0_1 # %bb.2: retl .LBB0_1: ud2 […] __umulvdi3: # @__umulvdi3 # %bb.0: pushl %ebp # pushl %ebx # push ebx pushl %edi # pushl %esi # movl 32(%esp), %esi # xor ebx, ebx movl 24(%esp), %eax # mov ecx, [esp+12] movl 20(%esp), %ebp # cmp ebx, ecx testl %esi, %esi # sbb edx, edx setne %dl # mov eax, [esp+20] testl %eax, %eax # cmp ebx, eax setne %cl # sbb ebx, ebx andb %dl, %cl # and ebx, edx mull 28(%esp) # mul [esp+8] movl %eax, %edi # movl %esi, %eax # xchg ecx, eax seto %bl # adc ebx, ebx mull %ebp # mul [esp+16] movl %eax, %esi # movl %ebp, %eax # seto %ch # adc ebx, ebx mull 28(%esp) # add ecx, eax addl %edi, %esi # mov eax, [esp+8] orb %bl, %ch # mul [esp+16] addl %esi, %edx # add edx, ecx setb %bl # adc ebx, ebx orb %ch, %bl # orb %cl, %bl # cmpb $1, %bl # je .LBB1_1 # jnz .LBB1_1 # %bb.2: popl %esi # popl %edi # popl %ebx # pop ebx popl %ebp # retl # ret .LBB1_1: ud2 # ud2 […] .ident "clang version 10.0.0 " […]Ouch: the
__umulvdi3()
function shows
35 instructions in 77 bytes instead of only 23 instructions in just
54 bytes, clobbers registers EBP
, EDI
and
ESI
without necessity, and performs 5 excess memory
accesses plus 9 partial register writes which impair performance!
__builtin_add_overflow()
and __builtin_sub_overflow()
case12d.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// Saturating signed addition and subtraction
int clamp_add_32(int augend, int addend) {
#ifdef UNDEFINED_BEHAVIOUR
int sum, clamp = 2147483647 + (augend < 0);
#else
int sum, clamp = 2147483647 ^ (augend >> 31);
#endif
return __builtin_add_overflow(augend, addend, &sum) ? clamp : sum;
}
int clamp_sub_32(int minuend, int subtrahend) {
#ifdef UNDEFINED_BEHAVIOUR
int difference, clamp = 2147483647 + (minuend < 0);
#else
int difference, clamp = 2147483647 ^ (minuend >> 31);
#endif
return __builtin_sub_overflow(minuend, subtrahend, &difference) ? clamp : difference;
}
long long clamp_add_64(long long augend, long long addend) {
#ifdef UNDEFINED_BEHAVIOUR
long long sum, clamp = 9223372036854775807LL + (augend < 0);
#else
long long sum, clamp = 9223372036854775807LL ^ (augend >> 63);
#endif
return __builtin_add_overflow(augend, addend, &sum) ? clamp : sum;
}
long long clamp_sub_64(long long minuend, long long subtrahend) {
#ifdef UNDEFINED_BEHAVIOUR
long long difference, clamp = 9223372036854775807LL + (minuend < 0);
#else
long long difference, clamp = 9223372036854775807LL ^ (minuend >> 63);
#endif
return __builtin_sub_overflow(minuend, subtrahend, &difference) ? clamp : difference;
}
Compile the source file case12d.c
with
Clang, engaging the optimiser, targetting the
i386 platform, and display the generated assembly code:
clang -o- -Os -S -target i386-pc-linux case12d.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] clamp_add_32: # @clamp_add_32 # %bb.0: movl 4(%esp), %eax # mov eax, [esp+4] movl %eax, %ecx # sarl $31, %ecx # cdq xorl $2147483647, %ecx # xor edx, 0x7FFFFFFF addl 8(%esp), %eax # add eax, [esp+8] cmovol %ecx, %eax # cmovo eax, edx retl # ret […] clamp_sub_32: # @clamp_sub_32 # %bb.0: movl 4(%esp), %eax # mov eax, [esp+4] movl %eax, %ecx # sarl $31, %ecx # cdq xorl $2147483647, %ecx # xor edx, 0x7FFFFFFF subl 8(%esp), %eax # sub eax, [esp+8] cmovol %ecx, %eax # cmovo eax, edx retl # ret […] clamp_add_64: # @clamp_add_64 # %bb.0: pushl %ebp # pushl %ebx # push ebx pushl %edi # pushl %esi # pushl %eax # movl 28(%esp), %esi # mov edx, [esp+12] movl 24(%esp), %eax # mov eax, [esp+8] movl 36(%esp), %ebp # movl %esi, %edi # movl %esi, %edx # mov ecx, edx sarl $31, %edi # sar ecx, 31 movl %edi, %ebx # mov ebx, 0x7FFFFFFF xorl $2147483647, %edi # xor ebx, ecx addl 32(%esp), %eax # add eax, [esp+16] notl %ebx # adcl %ebp, %edx # adc edx, [esp+20] setns %cl # testl %esi, %esi # setns %ch # cmpb %cl, %ch # setne 3(%esp) # testl %ebp, %ebp # setns %cl # cmpb %cl, %ch # sete %cl # testb %cl, 3(%esp) # cmovnel %edi, %edx # cmovo edx, ebx cmovnel %edi, %eax # cmovo eax, ecx addl $4, %esp # popl %esi # popl %edi # popl %ebx # pop ebx popl %ebp # retl # ret […] clamp_sub_64: # @clamp_sub_64 # %bb.0: pushl %ebp # pushl %ebx # push ebx pushl %edi # pushl %esi # pushl %eax # movl 28(%esp), %esi # mov edx, [esp+12] movl 24(%esp), %eax # mov eax, [esp+8] movl 36(%esp), %ebp # movl %esi, %edi # movl %esi, %edx # mov ecx, edx sarl $31, %edi # sar ecx, 31 movl %edi, %ebx # mov ebx, 0x7FFFFFFF xorl $2147483647, %edi # xor ebx, ecx subl 32(%esp), %eax # sub eax, [esp+16] notl %ebx # sbbl %ebp, %edx # sbb eax, [esp+20] setns %cl # testl %esi, %esi # setns %ch # cmpb %cl, %ch # setne 3(%esp) # testl %ebp, %ebp # setns %cl # cmpb %cl, %ch # setne %cl # testb %cl, 3(%esp) # cmovnel %edi, %edx # cmovo edx, ebx cmovnel %ebx, %eax # cmovo eax, ecx addl $4, %esp # popl %esi # popl %edi # popl %ebx # pop ebx popl %ebp # retl # ret […] .ident "clang version 10.0.0 " […]OUCH: for each of the functions
clamp_add_64
and clamp_sub_64
, this
optimisingcompiler generates 34 instructions in 83 bytes, clobbering registers
EBP
, EDI
and
ESI
without necessity, wasting 10 (in words:
ten) superfluous instructions to
determine overflow via twiddling with sign bits, thereby performing
4 partial register writes which impair performance, instead of only
13 instructions in just 37 bytes and using the overflow flag –
a real shame!
__builtin_copysign()
case12e.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
double copysign(double destination, double source) {
return __builtin_copysign(destination, source);
}
Compile the source file case12e.c
with
Clang, engaging the optimiser, targetting the
i386 platform, and display the generated assembly code:
clang -o- -Os -S -target i386-pc-linux case12e.c
[…] .LCPI0_0: .quad -9223372036854775808 # double -0 .quad -9223372036854775808 # double -0 .LCPI0_1: .quad 9223372036854775807 # double NaN .quad 9223372036854775807 # double NaN .text copysign: # @copysign # %bb.0: subl $12, %esp movsd 24(%esp), %xmm0 # xmm0 = mem[0],zero movsd 16(%esp), %xmm1 # xmm1 = mem[0],zero andps .LCPI0_0, %xmm0 andps .LCPI0_1, %xmm1 orps %xmm0, %xmm1 movlps %xmm1, (%esp) fldl (%esp) addl $12, %esp retl […] .ident "clang version 10.0.0 " […]Ouch: 10 instructions in 43 bytes, plus 32 bytes for 2 constants, whose access yields a page fault in the worst case.
Compile the source file case12e.c
a second time with
Clang, now without
SSE support,
and display the generated assembly code:
clang -mno-sse -o- -Os -S -target i386-pc-linux case12e.c
[…] copysign: # @copysign # %bb.0: subl $12, %esp fldl 16(%esp) fldl 24(%esp) fstpl (%esp) fabs fld %st(0) fchs testb $-128, 7(%esp) fxch %st(1) fcmovne %st(1), %st fstp %st(1) addl $12, %esp retl […] .ident "clang version 10.0.0 " […]Oops: 13 instructions in 33 bytes, wasting 3 instructions by writing the second argument without necessity on the stack.
Proper code uses but only 5 instructions in just 19 bytes:
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "copysign.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
copysign:
mov eax, [esp+16]
shld [esp+8], eax, 1
ror [esp+8], 1
fld [esp+4]
ret
.size copysign, .-copysign
.type copysign, @function
.global copysign
.end
# Copyright © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "copysign.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
copysign:
.ifnotdef OPTIMIZE
mov rax, 0x7FFFFFFFFFFFFFFF
movd xmm2, rax
andpd xmm0, xmm2
andnpd xmm2, xmm1
orpd xmm0, xmm2
.else
movd rcx, xmm0
movd rdx, xmm1
shld rcx, rdx, 1
ror rcx, 1
movd xmm0, rcx
.endif
ret
.size copysign, .-copysign
.type copysign, @function
.global copysign
.end
Note: exploration of the equally bad code generated
for the copysignf()
and copysignl()
functions is left as an exercise to the reader.
case13.c
with the following
content in an arbitrary, preferable empty directory:
// Copyright © 2015-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#include <stdint.h>
#include <stdio.h>
#include <time.h>
#ifdef OPTIMIZE
__attribute__ ((noinline))
static
__int128_t __muloti4(__int128_t multiplicand, __int128_t multiplier, int *overflow)
{
__uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);
*overflow = __builtin_mul_overflow((multiplicand ^ sign) - sign,
(multiplier ^ tmp) - tmp,
&product);
sign ^= tmp;
product = (product ^ sign) - sign;
*overflow |= (__int128_t) (product ^ sign) < 0;
return product;
}
#endif
__attribute__ ((always_inline))
static
__uint128_t lfsr128(void)
{
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
// initialised with 2**128 / golden ratio
static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
const __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
const __uint128_t sign = 0 - __builtin_add_overflow(lfsr, lfsr, &lfsr);
return lfsr ^= poly & sign;
}
__attribute__ ((always_inline))
static
__uint128_t lfsr64(void)
{
// 64-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
// initialised with 2**64 / golden ratio
static uint64_t lfsr = 0x9E3779B97F4A7C15;
const uint64_t sign = 0 - __builtin_add_overflow(lfsr, lfsr, &lfsr);
return lfsr ^= 0xAD93D23594C935A9 & sign;
}
__attribute__ ((always_inline))
static
__uint128_t lfsr32(void)
{
// 32-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xDB710641 (CRC-32 IEEE),
// initialised with 2**32 / golden ratio
static uint32_t lfsr = 0x9E3779B9;
const uint32_t sign = 0 - __builtin_add_overflow(lfsr, lfsr, &lfsr);
return lfsr ^= 0xDB710641 & sign;
}
int main(void)
{
clock_t t0, t1, t2, tt;
uint32_t n;
__int128_t multiplicand, multiplier = ~0;
volatile __int128_t product;
volatile int overflow;
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
multiplicand = lfsr128();
product = multiplicand * multiplier;
multiplier = lfsr64();
product = multiplicand * multiplier;
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
multiplicand = lfsr128();
overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
multiplier = lfsr64();
overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__multi3() %4lu.%06lu 0\n"
"__muloti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
multiplicand = lfsr128();
product = multiplicand * multiplier;
multiplier = lfsr32();
product = multiplicand * multiplier;
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
multiplicand = lfsr128();
overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
multiplier = lfsr32();
overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__multi3() %4lu.%06lu 0\n"
"__muloti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
multiplicand = lfsr64();
product = multiplicand * multiplier;
multiplier = lfsr64();
product = multiplicand * multiplier;
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
multiplicand = lfsr64();
overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
multiplier = lfsr64();
overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__multi3() %4lu.%06lu 0\n"
"__muloti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
multiplicand = lfsr32();
product = multiplicand * multiplier;
multiplier = lfsr32();
product = multiplicand * multiplier;
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
multiplicand = lfsr32();
overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
multiplier = lfsr32();
overflow = __builtin_mul_overflow(multiplicand, multiplier, &product);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__multi3() %4lu.%06lu 0\n"
"__muloti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
Note: modification of the source file
case13.c
to demonstrate the equally devastating
(mis)performance of the overflow-checking 64×64-bit integer
multiplication on 32-bit processors is left as an exercise to the
reader.
Run the following command lines to compile, link and run the benchmark program:
lscpu clang -O3 -rtlib=compiler-rt case13.c echo 'LLVM/clang/compiler-rt' ./a.out clang -DOPTIMIZE -O3 case13.c echo 'LLVM/clang' ./a.out
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7262 8-Core Processor Stepping: 0 CPU MHz: 3193.726 BogoMIPS: 6387.45 Virtualization: AMD-V L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-15 […] LLVM/clang/compiler-rt LLVM/clang __multi3() 4.190000 0 4.180000 0 __muloti4() 238.320000 234.130000 7.170000 2.990000 242.510000 nano-seconds 11.350000 nano-seconds __multi3() 4.050000 0 4.050000 0 __muloti4() 348.200000 344.150000 7.230000 3.180000 352.250000 nano-seconds 11.280000 nano-seconds __multi3() 3.900000 0 3.880000 0 __muloti4() 232.630000 228.730000 8.420000 4.540000 236.530000 nano-seconds 12.300000 nano-seconds __multi3() 3.850000 0 3.850000 0 __muloti4() 4.240000 0.390000 4.230000 0.380000 8.090000 nano-seconds 8.080000 nano-secondsNote: the overhead for the pseudo-random number generators is negligible here!
Oops: with the properly implemented
__muloti4()
function, which is called by
__builtin_mul_overflow()
for signed 128-bit integers,
overflow-checking multiplication runs about 2 times slower than
unchecked multiplication.
OUCH: with the __muloti4()
function
provided in the compiler-rt library, overflow-checking
multiplication runs but about 2 orders of magnitude
slower than unchecked multiplication – an utterly devastating
result, proving the miserable implementation of Clang
and its runtime library again!
case14.c
with the following
content, presented by Matt Godbolt as example (t) in his
ACM
queue article
Optimizations in C++ Compilers – A practical journey,
in an arbitrary, preferable empty directory:
bool isWhitespace(char c)
{
return c == ' '
|| c == '\r'
|| c == '\n'
|| c == '\t';
}
Compile the source file case14.c
with
Clang, engaging the optimiser, targetting the
AMD64 platform, and display the generated assembly
code:
clang -o- -O3 -S -target amd64-pc-linux case14.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] isWhitespace: # @isWhitespace # %bb.0: cmpb $32, %dil # ja .LBB0_2 # # %bb.1: movl $1, %eax # movzbl %dil, %ecx # mov ecx, edi movabsq $0x100002400, %rdx # mov rax, 0x100002600 btq %rcx, %rdx # shr rax, cl jae .LBB0_2 # # %bb.3: retq # .LBB0_2: # xorl %eax, %eax # cmp cl, 33 cmpb $9, %dil # sbb ecx, ecx # neg ecx sete %al # and eax, ecx retq # ret […] .ident "clang version 10.0.0 " […]OOPS: 12 instructions in 42 bytes, including 2 conditional branches which impair performance, plus register
RDX
clobbered, instead of only 8 instructions in
just 19 bytes, and without (superfluous) conditional branches.
Compile the source file case14.c
a second time with
Clang, now targetting the i386 platform,
and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case14.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] isWhitespace: # @isWhitespace # %bb.0: pushl %esi # movb 8(%esp), %cl # mov ecx, [esp+4] movl %ecx, %edx # addb $-10, %dl # cmpb $22, %dl # ja .LBB0_2 # # %bb.1: movzbl %dl, %edx # xor eax, eax movl $0x400009, %esi # cmp eax, ecx movl $1, %eax # adc eax, 0x2600 btl %edx, %esi # shr eax, cl jae .LBB0_2 # # %bb.3: popl %esi # retl # .LBB0_2: xorl %eax, %eax # xor edx, edx cmpb $9, %cl # cmp ecx, 33 sete %al # adc edx, edx popl %esi # and eax, edx retl # ret […] .ident "clang version 10.0.0 " […]OUCH: 18 instructions in 45 bytes, including 2 conditional branches which impair performance, plus register
ESI
clobbered, instead of only 10 instructions in just
25 bytes, again without (superfluous) conditional branches, and no
non-volatile register clobbered.
optimiserfails – take 2
case15.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2018-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
long long __absvdi2(long long value) {
#ifdef ALTERNATE // undefined behaviour
value = __builtin_llabs(value);
if (value < 0)
__builtin_trap();
return value;
#else
long long sign = 0 - (value < 0);
if (__builtin_saddll_overflow(value, sign, &value))
__builtin_trap();
return value ^ sign;
#endif
}
long long __addvdi3(long long augend, long long addend) {
long long sum;
if (__builtin_saddll_overflow(augend, addend, &sum))
__builtin_trap();
return sum;
}
long long __mulvdi3(long long multiplicand, long long multiplier) {
long long product;
if (__builtin_smulll_overflow(multiplicand, multiplier, &product))
__builtin_trap();
return product;
}
long long __negvdi2(long long negend) {
long long negation;
if (__builtin_ssubll_overflow(0, negend, &negation))
__builtin_trap();
return negation;
}
long long __subvdi3(long long minuend, long long subtrahend) {
long long difference;
if (__builtin_ssubll_overflow(minuend, subtrahend, &difference))
__builtin_trap();
return difference;
}
Compile the source file case15.c
with
Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case15.cNote: the left column shows the generated insane code, while the right column shows properly optimised code as comment.
[…] __absvdi2: # @__absvdi2 # %bb.0: pushl %ebx # pushl %esi # pushl %eax # movl 20(%esp), %esi # mov eax, [esp+8] movl 16(%esp), %eax # mov ecx, [esp+4] movl %esi, %ecx # movl %esi, %edx # sarl $31, %ecx # cdq addl %ecx, %eax # add ecx, edx adcl %ecx, %edx # adc eax, edx setns %bl # testl %esi, %esi # setns %bh # cmpb %bl, %bh # setne 3(%esp) # testl %ecx, %ecx # setns %bl # cmpb %bl, %bh # sete %bl # andb 3(%esp), %bl # cmpb $1, %bl # je .LBB0_1 # into # %bb.2: xorl %ecx, %eax # xor ecx, edx xorl %ecx, %edx # xor edx, eax addl $4, %esp # mov eax, ecx popl %esi # popl %ebx # retl # ret .LBB0_1: ud2 # […] __addvdi3: # @__addvdi3 # %bb.0: pushl %ebx # pushl %esi # movl 12(%esp), %eax # mov eax, [esp+4] movl 16(%esp), %esi # mov edx, [esp+8] movl 24(%esp), %ecx # add eax, [esp+12] addl 20(%esp), %eax # adc edx, [esp+16] movl %esi, %edx # adcl %ecx, %edx # setns %bl # testl %esi, %esi # setns %bh # cmpb %bl, %bh # setne %bl # testl %ecx, %ecx # setns %cl # cmpb %cl, %bh # sete %cl # andb %bl, %cl # cmpb $1, %cl # je .LBB1_1 # into # %bb.2: popl %esi # popl %ebx # retl # ret .LBB1_1: ud2 # […] __mulvdi3: # @__mulvdi3 # %bb.0: pushl %edi # pushl %esi # pushl %eax # push ebx movl 16(%esp), %eax # mov ebx, [esp+16] movl 24(%esp), %edx # mov eax, [esp+8] movl 20(%esp), %ecx # mul ebx movl 28(%esp), %esi # push eax movl $0, (%esp) # mov ecx, edx subl $12, %esp # mov eax, [esp+16] leal 12(%esp), %edi # mul ebx pushl %edi # xor ebx, ebx pushl %esi # add ecx, eax pushl %edx # adc ebx, edx pushl %ecx # mov eax, [esp+12] pushl %eax # mul dword ptr [esp+24] calll __mulodi4 # add ecx, eax addl $32, %esp # adc ebx, edx cmpl $0, (%esp) # push ecx # sbb ecx, ecx # mov eax, [esp+20] # mul dword ptr [esp+28] # neg ecx # add eax, ebx # adc edx, ecx # xor ebx, ebx # cmp ebx, [esp+28] # jle 0f # not ebx # sub eax, [esp+16] # sbb edx, [esp+20] # 0: # xor ecx, ecx # cmp ecx, [esp+20] # jle 1f # not ebx # sub eax, [esp+24] # sbb edx, [esp+28] # 1: # xor eax, ebx # xor edx, ebx # or eax, edx # pop edx # shld ecx, edx, 1 # add ebx, ecx # or eax, ebx jne .LBB2_1 # jnz .LBB2_1 # pop eax # %bb.2: addl $4, %esp # pop ebx popl %esi # popl %edi # retl # ret .LBB2_1: ud2 # ud2 […] __negvdi2: # @__negvdi2 # %bb.0: xorl %eax, %eax # xor eax, eax xorl %edx, %edx # cdq movl 8(%esp), %ecx # subl 4(%esp), %eax # sub eax, [esp+4] sbbl %ecx, %edx # sbb edx, [esp+8] testl %edx, %ecx # js .LBB3_1 # into # %bb.2: retl # ret .LBB3_1: ud2 # […] __subvdi3: # @__subvdi3 # %bb.0: pushl %ebx # pushl %esi # movl 12(%esp), %eax # mov eax, [esp+4] movl 16(%esp), %esi # mov edx, [esp+8] movl 24(%esp), %ecx # subl 20(%esp), %eax # sub eax, [esp+12] movl %esi, %edx # sbbl %ecx, %edx # sbb edx, [esp+16] setns %bl # testl %esi, %esi # setns %bh # cmpb %bl, %bh # setne %bl # testl %ecx, %ecx # setns %cl # cmpb %cl, %bh # setne %cl # andb %bl, %cl # cmpb $1, %cl # je .LBB4_1 # into # %bb.2: popl %esi # popl %ebx # retl # ret .LBB4_1: ud2 # […] .ident "clang version 10.0.0 " […]OUCH: 29 instructions in 68 bytes instead of only 10 instructions in just 21 bytes for the
__absvdi2()
function, and 24 instructions in 47 bytes instead of only 6
instructions in just 18 bytes for each of the
__addvdi3()
and __subvdi3()
functions!
OOPS: 24 instructions in 60 bytes, plus 98
instructions in 266 bytes of the called __mulodi4()
function, instead of only 46 instructions in just 113 bytes for the
__mulvdi3()
function.
Note: exploration of the equally bad code generated
for the __absvsi2()
, __addvsi3()
,
__mulvsi3()
and __subvsi3()
functions as
well as the __absvti2()
, __addvti3()
,
__mulvti3()
and __subvti3()
functions is
left as an exercise to the reader.
optimiserfails – take 3
case16.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
unsigned int reverse(unsigned int value) {
#ifndef ALTERNATE
// swap all adjacent bits
value = ((value & 0xAAAAAAAAU) >> 1)
| ((value & ~0xAAAAAAAAU) << 1);
// swap adjacent pairs of bits
value = ((value & 0xCCCCCCCCU) >> 2)
| ((value & ~0xCCCCCCCCU) << 2);
// swap adjacent nibbles
value = ((value & 0xF0F0F0F0U) >> 4)
| ((value & ~0xF0F0F0F0U) << 4);
// swap adjacent octetts
value = ((value & 0xFF00FF00U) >> 8)
| ((value & ~0xFF00FF00U) << 8);
// swap high and low part
value = ((value & 0xFFFF0000U) >> 16)
| ((value & ~0xFFFF0000U) << 16);
#else
value = ((value << 1) & 0xAAAAAAAAU)
| ((value >> 1) & ~0xAAAAAAAAU);
value = ((value << 2) & 0xCCCCCCCCU)
| ((value >> 2) & ~0xCCCCCCCCU);
value = ((value << 4) & 0xF0F0F0F0U)
| ((value >> 4) & ~0xF0F0F0F0U);
value = ((value << 8) & 0xFF00FF00U)
| ((value >> 8) & ~0xFF00FF00U);
value = ((value << 16) & 0xFFFF0000U)
| ((value >> 16) & ~0xFFFF0000U);
#endif
return value;
}
Compile the source file case16.c
with
Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case16.cNote: the left column shows the generated code, while the right column shows only the replacement for properly optimised code as comment.
[…] reverse: # @reverse # %bb.0: movl 4(%esp), %eax movl %eax, %ecx andl $0xD5555555, %eax shrl %ecx andl $0x55555555, %ecx leal (%ecx,%eax,2), %eax movl %eax, %ecx andl $0xF3333333, %eax shrl $2, %ecx andl $0x33333333, %ecx leal (%ecx,%eax,4), %eax movl %eax, %ecx shll $4, %eax shrl $4, %ecx andl $0xF0F0F0F0, %eax andl $0x0F0F0F0F, %ecx orl %ecx, %eax movl %eax, %ecx # bswap %eax shll $8, %eax # shrl $8, %ecx # andl $0xFF00FF00, %eax # andl $0x00FF00FF, %ecx # orl %ecx, %eax # roll $16, %eax # retl […] .ident "clang version 10.0.0 " […]OUCH: the
optimiserfails to recognise this common and well-known idiom for
endianconversion!
Note: exploration of the code generated with the
preprocessor macro ALTERNATE
defined is left as an
exercise to the reader.
optimiserfails – take 4
case17.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __ucmpti2(__uint128_t a, __uint128_t b) {
if (a < b)
return 0;
if (a > b)
return 2;
return 1;
}
Compile the source file case17.c
with
Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case17.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] __ucmpti2: # @__ucmpti2 # %bb.0: cmpq %rdi, %rdx # cmp rdi, rdx movq %rcx, %rax # mov rax, rsi sbbq %rsi, %rax # sbb rax, rcx movl $1, %r8d # adcl $0, %r8d # xorl %eax, %eax # sbb eax, eax cmpq %rdx, %rdi # cmp rdx, rdi sbbq %rcx, %rsi # sbb rcx, rsi cmovael %r8d, %eax # adc eax, 1 retq # ret […] .ident "clang version 10.0.0 " […]Oops: 10 instructions in 32 bytes, including a superfluous conditional move which impairs performance, instead of 8 instructions in 21 bytes, without conditional move.
optimiserfails – take 5
case18.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2004-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __ucmpti2(__uint128_t a, __uint128_t b) {
if (a > b)
return 2;
if (a == b)
return 1;
return 0;
}
Compile the source file case18.c
with
Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-windows case18.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] __ucmpti2: # @__ucmpti2 # %bb.0: movdqa (%rcx), %xmm0 # mov r8, [rcx] movq (%rdx), %r8 # mov r9, [rcx+8] pcmpeqb (%rdx), %xmm0 # mov rcx, [rdx] pmovmskb %xmm0, %eax # mov rdx, [rdx+8] xorl %r9d, %r9d # cmp r8, rcx cmpl $0xFFFF, %eax # mov rax, r9 sete %r9b # sbb rax, rdx cmpq (%rcx), %r8 # sbb eax, eax movq 8(%rdx), %rax # cmp rcx, r8 sbbq 8(%rcx), %rax # sbb rdx, r9 movl $2, %eax # adc eax, 1 cmovael %r9d, %eax # retq # ret […] .ident "clang version 10.0.0 " […]Oops: 13 instructions in 48 bytes, including a superfluous conditional move which impairs performance, using the SSE register
XMM0
without necessity, performing 6 memory accesses,
instead of 12 instructions in 35 bytes, without conditional move,
performing 4 memory accesses.
optimiserfails – take 6
case19.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
int __clzti2(__uint128_t value) {
if ((value >> 64) != 0)
return __builtin_clzll(value >> 64);
if ((value & ~0ULL) != 0)
return __builtin_clzll(value) + 64;
return 128;
}
int __ctzti2(__uint128_t value) {
if ((value & ~0ULL) != 0)
return __builtin_ctzll(value);
if ((value >> 64) != 0)
return __builtin_ctzll(value >> 64) + 64;
return 128;
}
Compile the source file case19.c
with
Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case19.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] __clzti2: # @__clzti2 # %bb.0: testq %rsi, %rsi # bsr rax, rsi je .LBB0_2 # jz .LBB0_2 # %bb.1: bsrq %rsi, %rax # xorl $63, %eax # xor eax, 63 retq # ret .LBB0_2: testq %rdi, %rdi # bsr rax, rdi je .LBB0_3 # jz .LBB0_3 # %bb.4: bsrq %rdi, %rax # xorl $63, %eax # xor eax, 127 orl $64, %eax # retq # ret .LBB0_3: movl $128, %eax # mov eax, 128 retq # ret […] __ctzti2: # @__ctzti2 # %bb.0: testq %rdi, %rdi # je .LBB1_2 # # %bb.1: bsfq %rdi, %rax # bsf rax, rdi retq # jnz .return .LBB1_2: testq %rsi, %rsi # je .LBB1_3 # # %bb.4: bsfq %rsi, %rax # bsf rax, rsi # jz .LBB1_3 orl $64, %eax # or eax, 64 # .return: retq # ret .LBB1_3: movl $128, %eax # mov eax, 128 retq # ret […] .ident "clang version 10.0.0 " […]Oops: 13 instructions in 35 bytes instead of 10 instructions in 26 bytes for the
__clzti2()
function,
and 11 instructions in 30 bytes instead of 8 instructions in 22
bytes for the __ctzti2()
function.
optimiserfails – take 7
case20.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef __amd64__
long long __mindi3(long long a, long long b) {
return a < b ? a : b;
}
unsigned long long __umindi3(unsigned long long a, unsigned long long b) {
return a < b ? a : b;
}
#else
__int128_t __maxti3(__int128_t a, __int128_t b) {
return a > b ? a : b;
}
__uint128_t __umaxti3(__uint128_t a, __uint128_t b) {
return a > b ? a : b;
}
#endif
Compile the source file case20.c
with
Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case20.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] __maxti3: # @__maxti3 # %bb.0: movq %rdx, %rax # mov rax, rdx cmpq %rdx, %rdi # cmp rdx, rdi movq %rsi, %rdx # mov rdx, rcx sbbq %rcx, %rdx # sbb rcx, rsi cmovgeq %rdi, %rax # cmovl rax, rdi cmovgeq %rsi, %rcx # cmovl rdx, rsi movq %rcx, %rdx # retq # ret […] __umaxti3: # @__umaxti3 # %bb.0: movq %rdx, %rax # mov rax, rdx cmpq %rdi, %rdx # cmp rdx, rdi movq %rcx, %rdx # mov rdx, rcx sbbq %rsi, %rdx # sbb rcx, rsi cmovbq %rdi, %rax # cmovb rax, rdi cmovbq %rsi, %rcx # cmovb rdx, rsi movq %rcx, %rdx # retq # ret […] .ident "clang version 10.0.0 " […]Oops: 8 instructions in 24 bytes instead of 7 instructions in 21 bytes for the
__maxti3()
and
__umaxti3()
functions.
Note: exploration of the code generated for the
__minti3()
and __uminti3()
functions on
the AMD64 platform is left as an exercise to the
reader.
Compile the source file case20.c
a second time with
Clang, now targetting the i386 platform,
and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case20.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] __mindi3: # @__mindi3 # %bb.0: pushl %edi # pushl %esi # movl 24(%esp), %edx # mov edx, [esp+16] movl 12(%esp), %ecx # movl 20(%esp), %eax # mov eax, [esp+12] movl 16(%esp), %esi # cmpl %ecx, %eax # cmp eax, [esp+4] movl %edx, %edi # mov ecx, edx sbbl %esi, %edi # sbb ecx, [esp+8] cmovgel %ecx, %eax # cmovge eax, [esp+4] cmovgel %esi, %edx # cmovge edx, [esp+8] popl %esi # popl %edi # retl # ret […] __umindi3: # @__umindi3 # %bb.0: pushl %edi # pushl %esi # movl 16(%esp), %esi # movl 20(%esp), %eax # mov eax, [esp+4] movl 12(%esp), %ecx # movl 24(%esp), %edx # mov edx, [esp+8] cmpl %eax, %ecx # cmp eax, [esp+12] movl %esi, %edi # mov ecx, edx sbbl %edx, %edi # sbb ecx, [esp+16] cmovbl %ecx, %eax # cmovnb eax, [esp+12] cmovbl %esi, %edx # cmovnb edx, [esp+16] popl %esi # popl %edi # retl # ret […] .ident "clang version 10.0.0 " […]Ouch: 14 instructions in 33 bytes, clobbering the registers
EDI
and ESI
without necessity,
instead of only 8 instructions in 29 bytes for the
__mindi3()
and __umindi3()
functions.
Note: exploration of the equally bad code generated
for the __maxdi3()
and __umaxdi3()
functions on the i386 platform is left as an exercise
to the reader.
optimiserfails – take 8
case21.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
unsigned int __rotlsi3(unsigned int value, int count) {
return (value << (31 & count))
| (value >> (31 & -count));
}
unsigned int __rotrsi3(unsigned int value, int count) {
return (value >> (31 & count))
| (value << (31 & -count));
}
unsigned long long __rotldi3(unsigned long long value, int count) {
return (value << (63 & count))
| (value >> (63 & -count));
}
unsigned long long __rotrdi3(unsigned long long value, int count) {
return (value >> (63 & count))
| (value << (63 & -count));
}
#ifdef __amd64__
__uint128_t __rotlti3(__uint128_t value, int count) {
return (value << (127 & count))
| (value >> (127 & -count));
}
__uint128_t __rotrti3(__uint128_t value, int count) {
#ifndef OPTIMIZE
return (value >> (127 & count))
| (value << (127 & -count));
#else
__asm__("movq\t%[low], %%rax\n\t"
"shrdq\t%%cl, %[high], %%rax\n\t"
"shrdq\t%%cl, %[low], %[high]\n\t"
"movq\t%[high], %%rdx\n\t"
"test\t$64, %%cl\n\t"
"cmovnz\t%%rax, %%rdx\n\t"
"cmovnz\t%[high], %%rax"
:"=A" (value)
:"c" (count),
[low] "r" ((unsigned long long) (value & ~0ULL)),
[high] "r" ((unsigned long long) (value >> 64))
:"cc");
return value;
#endif
}
#endif // __amd64__
Compile the source file case21.c
with
Clang, engaging the optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case21.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] __rotlsi3: # @__rotlsi3 # %bb.0: movl %esi, %ecx movl %edi, %eax roll %cl, %eax retq […] __rotrsi3: # @__rotrsi3 # %bb.0: movl %esi, %ecx movl %edi, %eax rorl %cl, %eax retq […] __rotldi3: # @__rotldi3 # %bb.0: movl %esi, %ecx movq %rdi, %rax rolq %cl, %rax retq […] __rotrdi3: # @__rotrdi3 # %bb.0: movl %esi, %ecx movq %rdi, %rax rorq %cl, %rax retq […] __rotlti3: # @__rotlti3 # %bb.0: movl %edx, %ecx # mov ecx, edx movq %rsi, %rdx # movq %rdi, %rax # mov rax, rdi shldq %cl, %rdi, %rsi # shld rax, rsi, cl shlq %cl, %rdi # shld rsi, rdi, cl xorl %r8d, %r8d # mov rdx, rsi testb $64, %cl # test cl, 64 cmovneq %rdi, %rsi # cmovnz rdx, rax cmovneq %r8, %rdi # cmovnz rax, rsi negb %cl # shrdq %cl, %rdx, %rax # shrq %cl, %rdx # testb $64, %cl # cmovneq %rdx, %rax # cmovneq %r8, %rdx # orq %rdi, %rax # orq %rsi, %rdx # retq # ret […] __rotrti3: # @__rotrti3 # %bb.0: movl %edx, %ecx # mov ecx, edx movq %rsi, %rdx # movq %rdi, %rax # mov rax, rdi movq %rdi, %rsi # shrdq %cl, %rdx, %rsi # shrd rax, rsi, cl movq %rdx, %rdi # shrq %cl, %rdi # shrd rsi, rdi, cl xorl %r8d, %r8d # mov rdx, rsi testb $64, %cl # test cl, 64 cmovneq %rdi, %rsi # cmovnz rdx, rax cmovneq %r8, %rdi # cmovnz rax, rsi negb %cl # shldq %cl, %rax, %rdx # shlq %cl, %rax # testb $64, %cl # cmovneq %rax, %rdx # cmovneq %r8, %rax # orq %rsi, %rax # orq %rdi, %rdx # retq # ret […] .ident "clang version 10.0.0 " […]Ouch: while the optimiser recognises the common and well-known idiom for rotation of 32-bit as well as 64-bit integers and generates proper code for the
__rotlsi3()
,
__rotrsi3()
, __rotldi3()
and
__rotrdi3()
functions, it but fails rather bad at
128-bit integers and generates awful code using 18 instructions in
56 bytes and 20 instructions in 62 bytes respectively instead of
only 9 instructions in just 28 bytes for each of the
__rotlti3()
and __rotrti3()
functions!
Note: exploration of the equally bad code generated
for the __rotldi3()
and __rotrdi3()
functions on the i386 platform is left as an exercise
to the reader.
optimiserfails – take 9
case22.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
unsigned int __rotlsi3(unsigned int value, int count) {
#ifdef OPTIMIZE
return __builtin_rotateleft32(value, count);
#else
return (value << (31 & count))
| (value >> (31 & -count));
#endif
}
unsigned int __rotrsi3(unsigned int value, int count) {
#ifdef OPTIMIZE
return __builtin_rotateright32(value, count);
#else
return (value >> (31 & count))
| (value << (31 & -count));
#endif
}
unsigned long long __rotldi3(unsigned long long value, int count) {
#ifdef OPTIMIZE
return __builtin_rotateleft64(value, count);
#else
return (value << (63 & count))
| (value >> (63 & -count));
#endif
}
unsigned long long __rotrdi3(unsigned long long value, int count) {
#ifdef OPTIMIZE
return __builtin_rotateright64(value, count);
#else
return (value >> (63 & count))
| (value << (63 & -count));
#endif
}
unsigned int __aswapsi2(unsigned int value) {
return (__rotlsi3(value, 8) & 0x00FF00FFU)
#ifdef ALTERNATE
| (__rotlsi3(value, 24) & 0xFF00FF00U);
#else
| (__rotrsi3(value, 8) & 0xFF00FF00U);
#endif
}
unsigned int __bswapsi2(unsigned int value) {
return __rotlsi3(value & 0xFF00FF00U, 8)
#ifdef ALTERNATE
| __rotlsi3(value & 0x00FF00FFU, 24);
#else
| __rotrsi3(value & 0x00FF00FFU, 8);
#endif
}
unsigned long long __aswapdi2(unsigned long long value) {
return (__rotldi3(value, 8) & 0x000000FF000000FFULL)
| (__rotldi3(value, 24) & 0x0000FF000000FF00ULL)
#ifdef ALTERNATE
| (__rotldi3(value, 40) & 0x00FF000000FF0000ULL)
| (__rotldi3(value, 56) & 0xFF000000FF000000ULL);
#else
| (__rotrdi3(value, 24) & 0x00FF000000FF0000ULL)
| (__rotrdi3(value, 8) & 0xFF000000FF000000ULL);
#endif
}
unsigned long long __bswapdi2(unsigned long long value) {
return __rotldi3(value & 0xFF000000FF000000ULL, 8)
| __rotldi3(value & 0x00FF000000FF0000ULL, 24)
#ifdef ALTERNATE
| __rotldi3(value & 0x0000FF000000FF00ULL, 40)
| __rotldi3(value & 0x000000FF000000FFULL, 56);
#else
| __rotrdi3(value & 0x0000FF000000FF00ULL, 24)
| __rotrdi3(value & 0x000000FF000000FFULL, 8);
#endif
}
Compile the source file case22.c
with
Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case22.c
[…] __rotlsi3: # @__rotlsi3 # %bb.0: movb 8(%esp), %cl movl 4(%esp), %eax roll %cl, %eax retl […] __rotrsi3: # @__rotrsi3 # %bb.0: movb 8(%esp), %cl movl 4(%esp), %eax rorl %cl, %eax retl […] __rotldi3: # @__rotldi3 # %bb.0: pushl %ebp pushl %ebx pushl %edi pushl %esi movl 24(%esp), %esi movl 20(%esp), %eax movb 28(%esp), %cl xorl %ebp, %ebp movl %eax, %edi movl %esi, %ebx movl %esi, %edx shll %cl, %edi shldl %cl, %eax, %ebx testb $32, %cl cmovnel %edi, %ebx cmovnel %ebp, %edi negb %cl shrl %cl, %edx shrdl %cl, %esi, %eax testb $32, %cl cmovnel %edx, %eax cmovnel %ebp, %edx orl %edi, %eax orl %ebx, %edx popl %esi popl %edi popl %ebx popl %ebp retl […] __rotrdi3: # @__rotrdi3 # %bb.0: pushl %ebp pushl %ebx pushl %edi pushl %esi movl 20(%esp), %esi movl 24(%esp), %edx movb 28(%esp), %cl xorl %ebp, %ebp movl %edx, %edi movl %esi, %ebx movl %esi, %eax shrl %cl, %edi shrdl %cl, %edx, %ebx testb $32, %cl cmovnel %edi, %ebx cmovnel %ebp, %edi negb %cl shll %cl, %eax shldl %cl, %esi, %edx testb $32, %cl cmovnel %eax, %edx cmovnel %ebp, %eax orl %ebx, %eax orl %edi, %edx popl %esi popl %edi popl %ebx popl %ebp retl […] __aswapsi2: # @__aswapsi2 # %bb.0: movl 4(%esp), %eax movl %eax, %ecx roll $24, %eax roll $8, %ecx andl $0xFF00FF00, %eax andl $0xFF00FF, %ecx orl %ecx, %eax retl […] __bswapsi2: # @__bswapsi2 # %bb.0: movl 4(%esp), %ecx movl %ecx, %edx movl %ecx, %eax andl $0xFF00, %edx andl $0xFF0000, %eax shldl $8, %ecx, %edx shrdl $8, %ecx, %eax orl %edx, %eax retl […] __aswapdi2: # @__aswapdi2 # %bb.0: movl 4(%esp), %edx movl 8(%esp), %eax bswapl %eax bswapl %edx retl […] __bswapdi2: # @__bswapdi2 # %bb.0: movl 4(%esp), %edx movl 8(%esp), %eax bswapl %eax bswapl %edx retl […] .ident "clang version 10.0.0 " […]Oops: although the code generated for the
__rotldi3()
and __rotrdi3()
functions is
horrible, the __aswapdi2()
and
__bswapdi2()
functions are properly optimised.
OUCH: contrary to the __aswapdi2()
and
__bswapdi2()
functions, the __aswapsi2()
and __bswapsi2()
functions are but not
optimised!
Compile the source file case22.c
a second time with
Clang, again targetting the i386 platform,
now with the preprocessor macro OPTIMIZE
defined, and
display the generated assembly code:
clang -DOPTIMIZE -o- -O3 -S -target i386-pc-linux case22.c
[…] […] __rotlsi3: # @__rotlsi3 # %bb.0: movb 8(%esp), %cl movl 4(%esp), %eax roll %cl, %eax retl […] __rotrsi3: # @__rotrsi3 # %bb.0: movb 8(%esp), %cl movl 4(%esp), %eax rorl %cl, %eax retl […] __rotldi3: # @__rotldi3 # %bb.0: pushl %ebp pushl %ebx pushl %edi pushl %esi movb 28(%esp), %ch movl 20(%esp), %esi movl 24(%esp), %edx xorl %ebp, %ebp movb %ch, %cl movl %edx, %edi movl %esi, %ebx movl %esi, %eax negb %cl shrl %cl, %edi shrdl %cl, %edx, %ebx testb $32, %cl movb %ch, %cl cmovnel %edi, %ebx cmovnel %ebp, %edi shll %cl, %eax shldl %cl, %esi, %edx testb $32, %ch cmovnel %eax, %edx cmovnel %ebp, %eax orl %ebx, %eax orl %edi, %edx popl %esi popl %edi popl %ebx popl %ebp retl […] __rotrdi3: # @__rotrdi3 # %bb.0: pushl %ebp pushl %ebx pushl %edi pushl %esi movl 20(%esp), %esi movl 24(%esp), %edx movb 28(%esp), %cl xorl %ebp, %ebp movl %edx, %edi movl %esi, %ebx movl %esi, %eax shrl %cl, %edi shrdl %cl, %edx, %ebx testb $32, %cl cmovnel %edi, %ebx cmovnel %ebp, %edi negb %cl shll %cl, %eax shldl %cl, %esi, %edx testb $32, %cl cmovnel %eax, %edx cmovnel %ebp, %eax orl %ebx, %eax orl %edi, %edx popl %esi popl %edi popl %ebx popl %ebp retl […] __aswapsi2: # @__aswapsi2 # %bb.0: movl 4(%esp), %eax movl %eax, %ecx roll $24, %eax roll $8, %ecx andl $0xFF00FF00, %eax andl $0xFF00FF, %ecx orl %ecx, %eax retl […] __bswapsi2: # @__bswapsi2 # %bb.0: movl 4(%esp), %ecx movl %ecx, %edx movl %ecx, %eax andl $0xFF00, %edx andl $0xFF0000, %eax shldl $8, %ecx, %edx shrdl $8, %ecx, %eax orl %edx, %eax retl […] .LCPI8_0: .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .LCPI8_1: .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 […] __aswapdi2: # @__aswapdi2 # %bb.0: movq 4(%esp), %xmm0 # xmm0 = mem[0],zero pshufd $68, %xmm0, %xmm0 # xmm0 = xmm0[0,1,0,1] movdqa %xmm0, %xmm1 movdqa %xmm0, %xmm2 movdqa %xmm0, %xmm3 movdqa %xmm0, %xmm5 movdqa %xmm0, %xmm4 psrlq $56, %xmm1 psrlq $40, %xmm2 psllq $24, %xmm3 psrlq $8, %xmm5 psllq $40, %xmm4 movsd %xmm1, %xmm2 # xmm2 = xmm1[0],xmm2[1] movdqa %xmm0, %xmm1 psllq $8, %xmm1 movsd %xmm1, %xmm3 # xmm3 = xmm1[0],xmm3[1] movdqa %xmm0, %xmm1 psllq $56, %xmm0 psrlq $24, %xmm1 movsd %xmm4, %xmm0 # xmm0 = xmm4[0],xmm0[1] orpd %xmm2, %xmm3 movsd %xmm1, %xmm5 # xmm5 = xmm1[0],xmm5[1] andpd .LCPI8_1, %xmm3 orpd %xmm5, %xmm0 andpd .LCPI8_0, %xmm0 orpd %xmm0, %xmm3 pshufd $78, %xmm3, %xmm0 # xmm0 = xmm3[2,3,0,1] por %xmm3, %xmm0 movd %xmm0, %eax pshufd $229, %xmm0, %xmm0 # xmm0 = xmm0[1,1,2,3] movd %xmm0, %edx retl […] __bswapdi2: # @__bswapdi2 # %bb.0: pushl %ebx pushl %edi pushl %esi movl 16(%esp), %edx movl 20(%esp), %eax movl %edx, %edi movl %eax, %ebx movl %eax, %ecx movl %edx, %esi andl $0xFF0000, %edi andl $0xFF0000, %ebx shrl $24, %ecx shrl $24, %esi shrl $8, %ebx shrl $8, %edi orl %ecx, %ebx movl %eax, %ecx orl %esi, %edi movl %edx, %esi shll $24, %edx shll $24, %eax andl $0xFF00, %ecx andl $0xFF00, %esi shll $8, %esi shll $8, %ecx orl %edi, %esi orl %ebx, %ecx orl %esi, %edx orl %ecx, %eax popl %esi popl %edi popl %ebx retl […] .ident "clang version 10.0.0 " […]OUCH: the braindead implementation of the code generator for
__builtin_*()
strikes again!
Compile the source file case22.c
a third time with
Clang, now targetting the AMD64 platform,
and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case22.c
[…] __rotlsi3: # @__rotlsi3 # %bb.0: movl %esi, %ecx movl %edi, %eax roll %cl, %eax retq […] __rotrsi3: # @__rotrsi3 # %bb.0: movl %esi, %ecx movl %edi, %eax rorl %cl, %eax retq […] __rotldi3: # @__rotldi3 # %bb.0: movl %esi, %ecx movq %rdi, %rax rolq %cl, %rax retq […] __rotrdi3: # @__rotrdi3 # %bb.0: movl %esi, %ecx movq %rdi, %rax rorq %cl, %rax retq […] __aswapsi2: # @__aswapsi2 # %bb.0: movl %edi, %eax roll $8, %eax andl $0xFF00FF, %eax roll $24, %edi andl $0xFF00FF00, %edi addl %edi, %eax retq […] __bswapsi2: # @__bswapsi2 # %bb.0: movl %edi, %ecx andl $0xFF00, %ecx shldl $8, %edi, %ecx movl %edi, %eax andl $0xFF0000, %eax shrdl $8, %edi, %eax orl %ecx, %eax retq […] __aswapdi2: # @__aswapdi2 # %bb.0: movq %rdi, %rax bswapq %rax retq […] __bswapdi2: # @__bswapdi2 # %bb.0: movq %rdi, %rax bswapq %rax retq […] .ident "clang version 10.0.0 " […]OUCH: while the
__aswapdi2()
and
__bswapdi2()
functions are optimised, the
__aswapsi2()
and __bswapsi2()
functions
are not optimised!
Compile the source file case22.c
a fourth time with
Clang, again targetting the AMD64
platform, now with the preprocessor macro OPTIMIZE
defined, and display the generated assembly code:
clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case22.c
[…] __rotlsi3: # @__rotlsi3 # %bb.0: movl %esi, %ecx movl %edi, %eax roll %cl, %eax retq […] __rotrsi3: # @__rotrsi3 # %bb.0: movl %esi, %ecx movl %edi, %eax rorl %cl, %eax retq […] __rotldi3: # @__rotldi3 # %bb.0: movl %esi, %ecx movq %rdi, %rax rolq %cl, %rax retq […] __rotrdi3: # @__rotrdi3 # %bb.0: movl %esi, %ecx movq %rdi, %rax rorq %cl, %rax retq […] __aswapsi2: # @__aswapsi2 # %bb.0: movl %edi, %eax roll $8, %eax andl $0xFF00FF, %eax roll $24, %edi andl $0xFF00FF00, %edi addl %edi, %eax retq […] __bswapsi2: # @__bswapsi2 # %bb.0: movl %edi, %ecx andl $0xFF00, %ecx shldl $8, %edi, %ecx movl %edi, %eax andl $0xFF0000, %eax shrdl $8, %edi, %eax orl %ecx, %eax retq […] .LCPI8_0: .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .LCPI8_1: .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 .byte 0 # 0x0 .byte 255 # 0xff .byte 0 # 0x0 .byte 0 # 0x0 […] __aswapdi2: # @__aswapdi2 # %bb.0: movq %rdi, %xmm0 pshufd $68, %xmm0, %xmm0 # xmm0 = xmm0[0,1,0,1] movdqa %xmm0, %xmm1 psrlq $56, %xmm1 movdqa %xmm0, %xmm2 psrlq $40, %xmm2 movsd %xmm1, %xmm2 # xmm2 = xmm1[0],xmm2[1] movdqa %xmm0, %xmm1 psllq $8, %xmm1 movdqa %xmm0, %xmm3 psllq $24, %xmm3 movsd %xmm1, %xmm3 # xmm3 = xmm1[0],xmm3[1] orpd %xmm2, %xmm3 movdqa %xmm0, %xmm1 psrlq $24, %xmm1 movdqa %xmm0, %xmm2 psrlq $8, %xmm2 movsd %xmm1, %xmm2 # xmm2 = xmm1[0],xmm2[1] movdqa %xmm0, %xmm1 psllq $40, %xmm1 psllq $56, %xmm0 movsd %xmm1, %xmm0 # xmm0 = xmm1[0],xmm0[1] orpd %xmm2, %xmm0 andpd .LCPI8_0(%rip), %xmm0 andpd .LCPI8_1(%rip), %xmm3 orpd %xmm0, %xmm3 pshufd $78, %xmm3, %xmm0 # xmm0 = xmm3[2,3,0,1] por %xmm3, %xmm0 movq %xmm0, %rax retq […] __bswapdi2: # @__bswapdi2 # %bb.0: movl %edi, %eax andl $0xFF000000, %eax shldq $8, %rdi, %rax movabsq $0xFF000000FF0000, %rcx andq %rdi, %rcx rolq $24, %rcx orq %rax, %rcx movabsq $0xFF000000FF00, %rdx andq %rdi, %rdx rolq $40, %rdx movabsq $0xFF00000000, %rax andq %rdi, %rax shrdq $8, %rdi, %rax orq %rdx, %rax orq %rcx, %rax retq […] .ident "clang version 10.0.0 " […]OUCH: the braindead implementation of the code generator for
__builtin_*()
strikes again!
optimiserfails – take 10
case23.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef OPTIMIZE
#define __builtin_rotateleft32(value, count) (((value) << (31 & (count))) | ((value) >> (31 & -(count))))
#define __builtin_rotateright32(value, count) (((value) >> (31 & (count))) | ((value) << (31 & -(count))))
#define __builtin_rotateleft64(value, count) (((value) << (63 & (count))) | ((value) >> (63 & -(count))))
#define __builtin_rotateright64(value, count) (((value) >> (63 & (count))) | ((value) << (63 & -(count))))
#endif
unsigned int __aswapsi2(unsigned int value) {
return (__builtin_rotateleft32(value, 8) & 0x00FF00FFU)
#ifdef ALTERNATE
| (__builtin_rotateleft32(value, 24) & 0xFF00FF00U);
#else
| (__builtin_rotateright32(value, 8) & 0xFF00FF00U);
#endif
}
unsigned int __bswapsi2(unsigned int value) {
return __builtin_rotateleft32(value & 0xFF00FF00U, 8)
#ifdef ALTERNATE
| __builtin_rotateleft32(value & 0x00FF00FFU, 24);
#else
| __builtin_rotateright32(value & 0x00FF00FFU, 8);
#endif
}
unsigned int __cswapsi2(unsigned int value) {
value = ((value << 8) & 0xFF00FF00U)
| ((value >> 8) & 0x00FF00FFU);
#ifndef ALTERNATE
value = ((value << 16) & 0xFFFF0000U)
| ((value >> 16) & 0x0000FFFFU);
#else
value = (value << 16) | (value >> 16);
#endif
return value;
}
unsigned int __dswapsi2(unsigned int value) {
value = ((value & 0x00FF00FFU) << 8)
| ((value & 0xFF00FF00U) >> 8);
#ifndef ALTERNATE
value = ((value & 0x0000FFFFU) << 16)
| ((value & 0xFFFF0000U) >> 16);
#else
value = (value << 16) | (value >> 16);
#endif
return value;
}
unsigned long long __aswapdi2(unsigned long long value) {
return (__builtin_rotateleft64(value, 8) & 0x000000FF000000FFULL)
| (__builtin_rotateleft64(value, 24) & 0x0000FF000000FF00ULL)
#ifdef ALTERNATE
| (__builtin_rotateleft64(value, 40) & 0x00FF000000FF0000ULL)
| (__builtin_rotateleft64(value, 56) & 0xFF000000FF000000ULL);
#else
| (__builtin_rotateright64(value, 24) & 0x00FF000000FF0000ULL)
| (__builtin_rotateright64(value, 8) & 0xFF000000FF000000ULL);
#endif
}
unsigned long long __bswapdi2(unsigned long long value) {
return __builtin_rotateleft64(value & 0xFF000000FF000000ULL, 8)
| __builtin_rotateleft64(value & 0x00FF000000FF0000ULL, 24)
#ifdef ALTERNATE
| __builtin_rotateleft64(value & 0x0000FF000000FF00ULL, 40)
| __builtin_rotateleft64(value & 0x000000FF000000FFULL, 56);
#else
| __builtin_rotateright64(value & 0x0000FF000000FF00ULL, 24)
| __builtin_rotateright64(value & 0x000000FF000000FFULL, 8);
#endif
}
unsigned long long __cswapdi2(unsigned long long value) {
value = ((value << 8) & 0xFF00FF00FF00FF00ULL)
| ((value >> 8) & 0x00FF00FF00FF00FFULL);
value = ((value << 16) & 0xFFFF0000FFFF0000ULL)
| ((value >> 16) & 0x0000FFFF0000FFFFULL);
#ifndef ALTERNATE
value = ((value << 32) & 0xFFFFFFFF00000000ULL)
| ((value >> 32) & 0x00000000FFFFFFFFULL);
#else
value = (value << 32) | (value >> 32);
#endif
return value;
}
unsigned long long __dswapdi2(unsigned long long value) {
value = ((value & 0x00FF00FF00FF00FFULL) << 8)
| ((value & 0xFF00FF00FF00FF00ULL) >> 8);
value = ((value & 0x0000FFFF0000FFFFULL) << 16)
| ((value & 0xFFFF0000FFFF0000ULL) >> 16);
#ifndef ALTERNATE
value = ((value & 0x00000000FFFFFFFFULL) << 32)
| ((value & 0xFFFFFFFF00000000ULL) >> 32);
#else
value = (value << 32) | (value >> 32);
#endif
return value;
}
Compile the source file case23.c
with
Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -mno-sse -o- -O3 -S -target i386-pc-linux case23.c
[…] __aswapsi2: # @__aswapsi2 # %bb.0: movl 4(%esp), %eax bswapl %eax retl […] __bswapsi2: # @__bswapsi2 # %bb.0: movl 4(%esp), %eax bswapl %eax retl […] __cswapsi2: # @__cswapsi2 # %bb.0: movl 4(%esp), %eax bswapl %eax retl […] __dswapsi2: # @__dswapsi2 # %bb.0: movl 4(%esp), %eax bswapl %eax retl […] __aswapdi2: # @__aswapdi2 # %bb.0: movl 4(%esp), %edx movl 8(%esp), %eax bswapl %eax bswapl %edx retl […] __bswapdi2: # @__bswapdi2 # %bb.0: movl 4(%esp), %edx movl 8(%esp), %eax bswapl %eax bswapl %edx retl […] __cswapdi2: # @__cswapdi2 # %bb.0: movl 4(%esp), %edx movl 8(%esp), %eax bswapl %eax bswapl %edx retl […] __dswapdi2: # @__dswapdi2 # %bb.0: movl 4(%esp), %edx movl 8(%esp), %eax bswapl %eax bswapl %edx retl […] .ident "clang version 10.0.0 " […]Note: the generated code is properly optimised!
Compile the source file case23.c
a second time with
Clang, now with the preprocessor macro
OPTIMIZE
defined, and display the generated assembly
code:
clang -DOPTIMIZE -o- -O3 -S -target i386-pc-linux case23.c
[…] __aswapsi2: # @__aswapsi2 # %bb.0: movl 4(%esp), %eax movl %eax, %ecx roll $24, %eax roll $8, %ecx andl $0xFF00FF00, %eax andl $0xFF00FF, %ecx orl %ecx, %eax retl […] __bswapsi2: # @__bswapsi2 # %bb.0: movl 4(%esp), %ecx movl %ecx, %edx movl %ecx, %eax andl $0xFF00, %edx andl $0xFF0000, %eax shldl $8, %ecx, %edx shrdl $8, %ecx, %eax orl %edx, %eax retl […] __cswapsi2: # @__cswapsi2 # %bb.0: movl 4(%esp), %eax bswapl %eax retl […] __dswapsi2: # @__dswapsi2 # %bb.0: movl 4(%esp), %eax bswapl %eax retl […] __aswapdi2: # @__aswapdi2 # %bb.0: pushl %ebx pushl %edi pushl %esi movl 16(%esp), %esi movl 20(%esp), %eax movl %esi, %edx movl %esi, %ecx shldl $24, %eax, %edx shldl $8, %eax, %ecx movl %edx, %ebx movzbl %cl, %edi andl $0xFF0000, %ecx andl $0xFF000000, %edx andl $0xFF00, %ebx orl %edi, %ebx movl %esi, %edi shrl $24, %esi shrl $8, %edi andl $0xFF00, %edi orl %edi, %esi movl %eax, %edi shll $24, %eax shll $8, %edi orl %esi, %ecx andl $0xFF0000, %edi orl %ecx, %edx orl %ebx, %edi orl %edi, %eax popl %esi popl %edi popl %ebx retl […] __bswapdi2: # @__bswapdi2 # %bb.0: pushl %ebx pushl %edi pushl %esi movl 16(%esp), %edx movl 20(%esp), %eax movl %edx, %edi movl %eax, %ebx movl %eax, %ecx movl %edx, %esi andl $0xFF0000, %edi andl $0xFF0000, %ebx shrl $24, %ecx shrl $24, %esi shrl $8, %ebx shrl $8, %edi orl %ecx, %ebx movl %eax, %ecx orl %esi, %edi movl %edx, %esi shll $24, %edx shll $24, %eax andl $0xFF00, %ecx andl $0xFF00, %esi shll $8, %esi shll $8, %ecx orl %edi, %esi orl %ebx, %ecx orl %esi, %edx orl %ecx, %eax popl %esi popl %edi popl %ebx retl […] __cswapdi2: # @__cswapdi2 # %bb.0: movl 4(%esp), %edx movl 8(%esp), %eax bswapl %eax bswapl %edx retl […] __dswapdi2: # @__dswapdi2 # %bb.0: movl 4(%esp), %edx movl 8(%esp), %eax bswapl %eax bswapl %edx retl […] .ident "clang version 10.0.0 " […]OUCH: the braindead implementation of the code generator for
__builtin_*()
strikes again!
optimiserfails – take 11
case24.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
long long quotient(long long numerator, long long denominator, long long *modulus) {
*modulus = numerator % denominator;
return numerator / denominator;
}
long long modulus(long long numerator, long long denominator, long long *quotient) {
*quotient = numerator / denominator;
return numerator % denominator;
}
Compile the source file case24.c
with
Clang, engaging the optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case24.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] quotient: # @quotient # %bb.0: pushl %ebp # jmp __divmoddi4 pushl %ebx # pushl %edi # pushl %esi # subl $12, %esp # movl 44(%esp), %edi # movl 32(%esp), %ebx # movl 36(%esp), %ebp # movl 40(%esp), %esi # pushl %edi # pushl %esi # pushl %ebp # pushl %ebx # calll __divdi3 # addl $16, %esp # movl %eax, %ecx # movl %edx, 8(%esp) # 4-byte Spill imull %eax, %edi # mull %esi # addl %edi, %edx # movl 8(%esp), %edi # 4-byte Reload imull %edi, %esi # addl %edx, %esi # subl %eax, %ebx # movl 48(%esp), %eax # movl %edi, %edx # sbbl %esi, %ebp # movl %ebx, (%eax) # movl %ebp, 4(%eax) # movl %ecx, %eax # addl $12, %esp # popl %esi # popl %edi # popl %ebx # popl %ebp # retl # […] modulus: # @modulus # %bb.0: pushl %ebp # pushl %ebx # pushl %edi # pushl %esi # subl $12, %esp # sub esp, 8 movl 44(%esp), %ebx # movl 32(%esp), %esi # movl 36(%esp), %edi # movl 40(%esp), %ebp # # push esp pushl %ebx # push [esp+28] pushl %ebp # push [esp+28] pushl %edi # push [esp+28] pushl %esi # push [esp+28] calll __divdi3 # call __divmoddi4 addl $16, %esp # add esp, 20 movl %edx, %ecx # movl 48(%esp), %edx # mov ecx, [esp+28] imull %eax, %ebx # movl %ecx, 4(%edx) # mov [ecx], eax movl %eax, (%edx) # mov [ecx+4], edx mull %ebp # imull %ebp, %ecx # addl %ebx, %edx # addl %edx, %ecx # subl %eax, %esi # sbbl %ecx, %edi # movl %esi, %eax # movl %edi, %edx # addl $12, %esp # pop edx # pop eax popl %esi # popl %edi # popl %ebx # popl %ebp # retl # ret […] .ident "clang version 10.0.0 " […]OUCH: 35 superfluous instructions in 77 bytes for the function
quotient()
, wasting about 18 processor
clock cycles per function call, instead of only 1 (in words:
one) instruction in just 5 bytes – this
optimiseris a bad joke!
OOPS: 34 instructions in 74 bytes for the function
modulus()
, where just 14 instructions in only 40 bytes
yield the same result, but without clobbering 4 registers, without 8
superfluous memory accesses, without 3 superfluous multiplications,
without 5 superfluous additions or subtractions, and not wasting
about 10 processor clock cycles per function call.
optimiserfails – take 12
case25.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
unsigned long long multiply64(unsigned int multiplicand, unsigned int multiplier) {
unsigned long long product;
__asm__("mull\t%[multiplier]"
:"=A" (product)
:"%0" (multiplicand),
[multiplier] "rm" (multiplier)
:"cc");
return product;
}
#ifdef __amd64__
__uint128_t multiply128(unsigned long long multiplicand, unsigned long long multiplier) {
__uint128_t product;
__asm__("mulq\t%[multiplier]"
:"=A" (product)
:"%0" (multiplicand),
[multiplier] "rm" (multiplier)
:"cc");
return product;
}
#endif
Compile the source file case25.c
with
Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case25.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] multiply64: # @multiply64 # %bb.0: movl %edi, %eax # mov eax, edi movl %esi, -4(%rsp) # #APP mull -4(%rsp) # mul esi #NO_APP retq # ret […] multiply128: # @multiply128 # %bb.0: movq %rdi, %rax # mov rax, rdi movq %rsi, -8(%rsp) # #APP mulq -8(%rsp) # mul rsi #NO_APP retq # ret […] .ident "clang version 10.0.0 " […]Ouch: despite the contraint
rm
alias
register or memorygiven for the multiplier, it is written into the
red zoneon the stack and then read from there.
Compile the source file case25.c
a second time with
Clang, now with the option -mno-red-zone
,
and display the generated assembly code:
clang -mno-red-zone -o- -O3 -S -target amd64-pc-linux case25.cNote: the left column shows the generated code, while the right column shows the expected unoptimised code as comment.
[…] multiply64: # @multiply64 # %bb.0: subq $4, %rsp # movl %edi, %eax # mov eax, edi movl %esi, (%rsp) # push rsi #APP mull (%rsp) # mul dword ptr [rsp] #NO_APP addq $4, %rsp # pop rsi retq # ret […] multiply128: # @multiply128 # %bb.0: pushq %rax # movq %rdi, %rax # mov rax, rdi movq %rsi, (%rsp) # push rsi #APP mulq (%rsp) # mul qword ptr [rsp] #NO_APP popq %rcx # pop rsi retq # ret […] .ident "clang version 10.0.0 " […]Oops: instead to push register
ESI
which holds the multiplier onto the stack, the code generated for
the multiply64()
function stores the multiplier with a
MOV
instruction after decrementing the
stack pointer with a superfluous SUB
instruction, using 17 instead of 8 bytes.
Ouch: instead to push register RSI
which holds the multiplier onto the stack, the code generated for
the multiply128()
function stores the multiplier with a
MOV
instruction after decrementing the
stack pointer with a superfluous
PUSH
instruction, using 14 instead of 10 bytes.
optimiserfails – take 13
case4gcc.c
with
Clang, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
clang -o- -O3 -S -target amd64-pc-linux case4gcc.cNote: the left column shows the generated code, while the right column shows only the replacement for properly optimised code as comment.
[…] bsrq %r11, %rbx # bsrq %r11, %rcx xorq $63, %rbx # xorl $63, %ecx je .LBB1_11 # %bb.17: movq %rbx, %rcx # negq %rcx # movq %rsi, %rdx # xorl %edx, %edx shrq %cl, %rdx # shldq %cl, %rsi, %rdx movl %ebx, %ecx # shldq %cl, %rdi, %rsi shlq %cl, %rdi shldq %cl, %r9, %r11 shlq %cl, %r9 movq %r11, -8(%rsp) # movq %rsi, %rax divq -8(%rsp) # divq %r11 movq %rdx, %rsi movq %rax, %r10 mulq %r9 cmpq %rax, %rdi movq %rsi, %rcx # movq %rsi, %rbx sbbq %rdx, %rcx # sbbq %rdx, %rbx jae .LBB1_19 # %bb.18: subq %r9, %rax sbbq %r11, %rdx addq $-1, %r10 .LBB1_19: testq %r8, %r8 je .LBB1_21 # %bb.20: movl $64, %r9d # subq %rbx, %r9 # subq %rax, %rdi sbbq %rdx, %rsi movl %ebx, %ecx # shrq %cl, %rdi # shrdq %cl, %rdi, %rsi movq %rsi, %rax # movl %r9d, %ecx # shlq %cl, %rax # movl %ebx, %ecx # shrq %cl, %rsi # shrq %cl, %rdi orq %rdi, %rax # movq %rsi, 8(%r8) movq %rax, (%r8) # movq %rdi, (%r8) jmp .LBB1_21 […] .ident "clang version 10.0.0 " […]Oops: while the optimiser (partially) recognises the common and well-known idiom for multi-word left shifts and generates the code following line
# %bb.17:
, it but
fails to recognise the same idiom for the multi-word right shift
and generates the code following line # %bb.20:
, using
12 (in words: twelve) superfluous instructions in a
total of 42 instructions for the 2 hotbasic blocks shown!
Compile the source file case4gcc.c
a second time with
Clang, now with the preprocessor macro
OPTIMIZE
defined, targetting the AMD64
platform, and display the generated assembly code:
clang -DOPTIMIZE -o- -O3 -S -target amd64-pc-linux case4gcc.c
[…] .LBB1_7: bsrq %r10, %rbx xorq $63, %rbx je .LBB1_13 # %bb.8: movl %ebx, %ecx negb %cl movq %rsi, %rdx shrq %cl, %rdx movl %ebx, %ecx shldq %cl, %rdi, %rsi shlq %cl, %rdi shldq %cl, %r9, %r10 shlq %cl, %r9 xorl %r11d, %r11d testb $64, %bl cmovneq %rdi, %rsi cmovneq %r11, %rdi cmovneq %r9, %r10 cmovneq %r11, %r9 movq %r10, -8(%rsp) movq %rsi, %rax divq -8(%rsp) movq %rax, %r14 movq %rdx, %rcx movq %r9, %rax andq $-2, %rax mulq %r14 andq $-2, %rdi cmpq %rax, %rdi movq %rcx, %rsi sbbq %rdx, %rsi sbbq $0, %r14 testq %r8, %r8 je .LBB1_19 # %bb.9: xorl %r11d, %r11d subq %rax, %rdi sbbq %rdx, %rcx cmovaeq %r11, %r10 cmovaeq %r11, %r9 addq %rdi, %r9 adcq %rcx, %r10 movl %ebx, %ecx shrdq %cl, %r10, %r9 shrq %cl, %r10 testb $64, %bl cmovneq %r10, %r9 cmovneq %r11, %r10 movq %r10, 8(%r8) movq %r9, (%r8) jmp .LBB1_19 […] .ident "clang version 10.0.0 " […]Ouch: although the shift count is greater than 0 and can’t exceed 63 (what an optimiser worth its name can prove from the program flow), this
optimisergenerates 2 superfluous
TEST
instructions which always
set the zero flag ZF
, followed by 6
superfluous
CMOVNZ
instructions, i.e.
NOP
instructions, resulting in a
total of 49 instead of the former 42 instructions.
optimiserfails – take 14
case27.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// signed division overflows (really: raises 'division exception')
// only for -2**31 / -1 = ±2**31!
int clamp_div_32(int dividend, int divisor)
{
return (dividend == ~2147483647) & (divisor == -1) ? 2147483647 : dividend / divisor;
}
int clamp_mod_32(int dividend, int divisor)
{
return (dividend == ~2147483647) & (divisor == -1) ? 0 : dividend % divisor;
}
Compile the source file case27.c
with
Clang, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case27.cNote: the left column shows the generated code, while the right column shows properly optimised code as comment.
[…] clamp_div_32: # @clamp_div_32 # %bb.0: pushl %esi # movl 8(%esp), %eax # mov eax, [esp+4] movl 12(%esp), %esi # mov ecx, [esp+8] cmpl $-2147483648, %eax # cmp eax, -2147483648 jne LBB0_2 # jne LBB0_2 # %bb.1: movl $2147483647, %ecx cmpl $-1, %esi # cmp ecx, -1 jne LBB0_2 # jne LBB0_2 # %bb.3: movl %ecx, %eax # mov eax, 2147483647 popl %esi # retl # ret LBB0_2: cltd # cdq idivl %esi # idiv ecx movl %eax, %ecx # movl %ecx, %eax # popl %esi # retl # ret […] clamp_mod_32: # @clamp_mod_32 # %bb.0: movl 4(%esp), %eax # mov eax, [esp+4] movl 8(%esp), %ecx # mov ecx, [esp+8] cmpl $-2147483648, %eax # cmp eax, -2147483648 jne LBB1_2 # jne LBB1_2 # %bb.1: xorl %edx, %edx # cmpl $-1, %ecx # cmp ecx, -1 jne LBB1_2 # jne LBB1_2 # %bb.3: movl %edx, %eax # xor eax, eax retl # ret LBB1_2: cltd # cdq idivl %ecx # idiv ecx movl %edx, %eax # mov eax, edx retl # ret […] .ident "clang version 10.0.0 " […]Ouch: the code generated for these almost identical functions shows significant differences (the first function clobbers register
ESI
without necessity) and a common deficiency
(superfluous MOV
instructions).
optimiserfails – take 15
case28.c
with the following
content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <stefan.kanthak@nexgo.de>
// Signed division overflows (really: raises "division exception")
// only for -2**63 / -1 = ±2**63!
long long clamp_div_64(long long dividend, long long divisor) {
return (divisor == -1LL) & (dividend == ~9223372036854775807LL) ? 9223372036854775807LL : dividend / divisor;
}
long long clamp_mod_64(long long dividend, long long divisor) {
return (divisor ^ -1LL) | (dividend ^ ~9223372036854775807LL) ? dividend % divisor : 0LL;
}
Compile the source file case28.c
with
Clang, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
clang -o- -O3 -S -target i386-pc-linux case28.cNote: the left column shows the horrible generated code, while the right column shows properly optimised code as comment.
[…] clamp_div_64: # @clamp_div_64 # %bb.0: pushl %ebp # pushl %ebx # pushl %edi # pushl %esi # subl $12, %esp # movl 36(%esp), %esi # movl $-1, %ebp # movl 32(%esp), %ecx # movl 44(%esp), %ebx # movl 40(%esp), %edi # movl $2147483647, %edx # movl $-1, %eax addl $1, %ebp # movl $2147483647, %ebp # adcl $0, %ebp # xorl %esi, %ebp # orl %ecx, %ebp # jne .LBB0_2 # # %bb.1: movl %edi, %ebp # mov eax, [esp+12] andl %ebx, %ebp # and eax, [esp+16] cmpl $-1, %ebp # inc eax je .LBB0_3 # .LBB0_2: pushl %ebx # pushl %edi # pushl %esi # pushl %ecx # calll __divdi3 # jnz __divdi3 addl $16, %esp # cdq .LBB0_3: addl $12, %esp # sub eax, [esp+4] popl %esi # sbb edx, [esp+8] popl %edi # jno .LABEL popl %ebx # not eax popl %ebp # not edx # .LABEL: retl # ret […] clamp_mod_64: # @clamp_mod_64 # %bb.0: pushl %ebx # pushl %edi # pushl %esi # movl 20(%esp), %esi # movl 16(%esp), %ecx # movl 28(%esp), %ebx # movl 24(%esp), %edi # movl %esi, %eax # xorl $-2147483648, %eax # orl %ecx, %eax # jne .LBB1_2 # # %bb.1: movl %edi, %edx # mov eax, [esp+12]Oops: basic knowledge of elementary integer arithmetic would help the compiler to optimise the second function even before code generation: the remainder of a division by −1 (or 1) is 0, independent of the dividend!xorl %eax, %eax# andl %ebx, %edx # and eax, [esp+16] cmpl $-1, %edx # inc eax movl $0, %edx # je .LABELjne .LBB1_2# # %bb.3:popl %esi#popl %edi#popl %ebx#retl# .LBB1_2: pushl %ebx # pushl %edi # pushl %esi # pushl %ecx # calll __moddi3 # jnz __moddi3 addl $16, %esp # cdq .LABEL popl %esi # popl %edi # popl %ebx # retl # ret […] .ident "clang version 10.0.0 " […]
OUCH: 34 instructions in 84 bytes for the
saturating signed division and 31 instructions in 66 bytes for the
saturating signed modulus, clobbering registers EBP
,
EBX
, EDI
and ESI
and copying
the arguments without necessity, instead of only 11 instructions in
jusr 27 bytes respectively only 6 instructions in just 13 bytes.
OUCH: the 4 highlighted instructions of the first
function show completely confused code to load the
constant −2147483647 into register EBP
;
additionally the already loaded constants 2147483647 and −1 are
not (re)used!
OOPS: the 2 highlighted instructions of the second
function can be replaced by a single
INC
instruction; the following
conditional branch JNE
can then be replaced by a
JE
to the tail of the
function, saving the following 4 duplicate instructions; the preceding
XOR
instruction to clear
register EAX
is superfluous.
Use the X.509 certificate to send S/MIME encrypted mail.
Note: email in weird format and without a proper sender name is likely to be discarded!
I dislike
HTML (and even
weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your
nickname.
I abhor top posts and expect inline quotes in replies.
as iswithout any warranty, neither express nor implied.
cookiesin the web browser.
The web service is operated and provided by
Telekom Deutschland GmbH The web service provider stores a session cookie
in the web
browser and records every visit of this web site with the following
data in an access log on their server(s):