Valid HTML 4.01 Transitional Valid CSS Valid SVG 1.0

Me, myself & IT

Deficiencies in GCC’s Code Generator and Optimiser

Purpose
Reason
Case 0
Case 1: __udivmod{d,t}i4() Functions
Case 2: __ashl{s,d,t}i3() Functions
Case 3: __ashr{s,d,t}i3() Functions
Case 4: __lshr{s,d,t}i3() Functions
Case 5: Missing __rotl{s,d,t}i3() and __rotr{s,d,t}i3() Functions
Case 6: __cmp{s,d,t}i2() Functions
Case 7: __ucmp{s,d,t}i2() Functions
Case 8: __mul{s,d,t}i3() Functions
Case 9: __neg{s,d,t}i2() Functions
Case 10: __absv{s,d,t}i2() Functions
Case 11: __addv{s,d,t}i3() Functions
Case 12: __subv{s,d,t}i3() Functions
Case 13: __mulv{s,d,t}i3() Functions
Case 14: __negv{s,d,t}i2() Functions
Case 15: __parity{s,d,t}i2() Functions
Case 16: __builtin_parity() Builtin
Case 17: __builtin_mul_overflow() Builtin
Case 18: __builtin_sub_overflow() Builtin
Case 19: __builtin_copysign() Builtin
Case 20: -fsanitize=signed-integer-overflow Command Line Option
Case 21: -ftrapv Command Line Option
Case 22: Shell Game, or Crazy Braindead Register Allocator
Case 23: Undefined Behaviour in __bswapsi2() Function
Case 24: Undefined Behaviour or Optimiser Failure?
Case 25: Optimiser Failures
Case 26: More Optimiser Failures
Case 27: Yet More Optimiser Failures
Case 28: BDead Code Generation
Case 29: Superfluous Extraneous Code Generation
Case 30: Epic Optimiser Failures
Case 31: Cargo Cult, or Ignorance is Bliss, plus Epic Optimiser Failure
Case 32: More Epic Optimiser Failures
Case 33: Yet More Epic Optimiser Failures

Purpose

Demonstrate multiple bugs, defects, deficiencies and flaws of GCC and its libgcc library, especially the rather poor code generator.

Reason

The code generated by GCC (not only for its libgcc library) is rather poor, up to an order of magnitude worse than properly written (optimised) code.

Note: I recommend to read Randall Hyde’s ACM article, especially the following part:

Observation #6: Software engineers have been led to believe that their time is more valuable than CPU time; therefore, wasting CPU cycles in order to reduce development time is always a win. They've forgotten, however, that the application users' time is more valuable than their time.

Case 0

In his ACM queue article Optimizations in C++ Compilers – A practical journey, Matt Godbolt presents the following function as example (t):
bool isWhitespace(char c)
{
    return c == ' '
      || c == '\r'
      || c == '\n'
      || c == '\t';
}
He then shows the following code generated by GCC 9.1 for the AMD64 platform, which uses 7 instructions in 27 bytes:
isWhitespace(char):
  xor    eax, eax           ; result = false
  cmp    dil, 32            ; is c > 32
  ja     .L4                ; if so, exit with false
  movabs rax, 4294977024    ; rax = 0x100002600
  shrx   rax, rax, rdi      ; rax >>= c
  and    eax, 1             ; result = rax & 1
.L4:
  ret
This code is but not optimal – the conditional branch impairs performance, it should and can be avoided, as demonstrated by the following equivalent code, which uses 7 instructions in 26 bytes:
; Copyleft © 2020-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

isWhitespace:
	movabs	rax, 100002600h
	shrx	rax, rax, rdi
	cmp	dil, 33
	sbb	edi, edi
	neg	edi
	and	eax, edi
	ret
Since the SHRX instruction is only available on newer CPUs, also branch-free code that runs on all AMD64 processors, with 8 instructions in just 25 bytes:
; Copyleft © 2020-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

isWhitespace:
	mov	ecx, edi
	movabs	rax, 100002600h
	shr	rax, cl
	cmp	cl, 33
	sbb	ecx, ecx
	neg	ecx
	and	eax, ecx
	ret
Finally equivalent and branch-free code that works on 35 year old i386 processors, needing neither an AMD64 processor nor the SHRX instruction, with but 10 instructions in 26 bytes when the preprocessor macro OPTIMIZE is not defined, else just 25 bytes:
; Copyleft © 2020-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

isWhitespace:
	mov	ecx, [esp+4]
#ifndef OPTIMIZE
	mov	eax, 2600h	; eax = (1 << '\r') | (1 << '\n') | (1 << '\t')
	test	cl, cl
	setnz	al		; eax |= (c != '\0')
#else
	xor	eax, eax
	cmp	eax, ecx	; CF = (c != '\0')
	adc	eax, 2600h	; eax = (c != '\0') | (1 << '\r') | (1 << '\n') | (1 << '\t')
#endif
	shr	eax, cl		; eax >>= (c % ' ')
	xor	edx, edx
	cmp	ecx, 33		; CF = (c <= ' ')
	adc	edx, edx	; edx = (c <= ' ')
	and	eax, edx
	ret

Case 1: __udivmod{d,t}i4() Functions

libgcc provides the __udivmoddi4() and __udivmodti4() functions for unsigned integer division:
Runtime Function: unsigned long __udivmoddi4 (unsigned long a, unsigned long b, unsigned long *c)
Runtime Function: unsigned long long __udivmodti4 (unsigned long long a, unsigned long long b, unsigned long long *c)

These functions calculate both the quotient and remainder of the unsigned division of a and b. The return value is the quotient, and the remainder is placed in variable pointed to by c.

The last 5 statements of the following part of their source perform two multi-word left shifts on the variable pair d1, d0 holding the divisor and the variable triple n2, n1, n0 holding the dividend:
/* More subroutines needed by GCC output code on some machines.  */
/* Compile this one with gcc.  */
/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
[…]
UDWtype
__udivmoddi4 (UDWtype n, UDWtype d, UDWtype *rp)
{
[…]
  UWtype d0, d1, n0, n1, n2;
  UWtype b, bm;
[…]
	  count_leading_zeros (bm, d1);
	  if (bm == 0)
[…]
	  else
	    {
	      UWtype m1, m0;
	      /* Normalize.  */

	      b = W_TYPE_SIZE - bm;

	      d1 = (d1 << bm) | (d0 >> b);
	      d0 = d0 << bm;
	      n2 = n1 >> b;
	      n1 = (n1 << bm) | (n0 >> b);
	      n0 = n0 << bm;
[…]
For the AMD64 target platform, GCC compiles this sequence of shifts to the following unoptimised and rather awful code:
[…]
0000000000000000 <__udivmodti4>:
[…]
  b0: 4c 0f bd da           bsr    %rdx,%r11
  b4: 49 83 f3 3f           xor    $0x3f,%r11
  b8: 45 85 db              test   %r11d,%r11d
  bb: 75 33                 jne    f0 <__udivmodti4+0xf0>
[…]
  f0: 49 63 c3              movslq %r11d,%rax
  f3: bd 40 00 00 00        mov    $0x40,%ebp
  f8: 44 89 d9              mov    %r11d,%ecx
  fb: 4d 89 ec              mov    %r13,%r12
  fe: 48 29 c5              sub    %rax,%rbp
 101: 48 d3 e2              shl    %cl,%rdx
 104: 49 89 f2              mov    %rsi,%r10
 107: 48 89 f8              mov    %rdi,%rax
 10a: 89 e9                 mov    %ebp,%ecx
 10c: 44 89 db              mov    %r11d,%ebx
 10f: 49 d3 ec              shr    %cl,%r12
 112: 44 89 d9              mov    %r11d,%ecx
 115: 49 d3 e5              shl    %cl,%r13
 118: 89 e9                 mov    %ebp,%ecx
 11a: 49 09 d4              or     %rdx,%r12
 11d: 49 d3 ea              shr    %cl,%r10
 120: 44 89 d9              mov    %r11d,%ecx
 123: 48 d3 e6              shl    %cl,%rsi
 126: 89 e9                 mov    %ebp,%ecx
 128: 4c 89 d2              mov    %r10,%rdx
 12b: 48 d3 e8              shr    %cl,%rax
 12e: 44 89 d9              mov    %r11d,%ecx
 131: 48 09 c6              or     %rax,%rsi
 134: 48 d3 e7              shl    %cl,%rdi
 137: 48 89 f0              mov    %rsi,%rax
 13a: 49 89 f9              mov    %rdi,%r9
[…]
Oops: the TEST instruction at address b8 is superfluous.

Ouch: the optimiser fails to recognise the common and well-known idiom of complementary shifts, for which it should but generate the following properly optimised and significantly shorter code:

[…]
                            mov    %r11d, %ecx
                            xor    n2, n2
                            shld   %cl, n1, n2
                            shld   %cl, n0, n1
                            shl    %cl, n0
                            shld   %cl, d0, d1
                            shl    %cl, d0
[…]
Note: the optimiser but recognises the following similar, also common and well-known idiom used for rotation:
unsigned long long __rotldi3(unsigned long long value, int count) {
    return (value << (63 & count))
         | (value >> (63 & -count));
}
The code generated by GCC for the __udivmodti4() function shows 141 instructions in 439 bytes, while properly written and faster code uses only 85 instructions in just 232 bytes:
# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

# * The software is provided "as is" without any warranty, neither express
#   nor implied.
# * In no event will the author be held liable for any damage(s) arising
#   from the use of the software.
# * Redistribution of the software is allowed only in unmodified form.
# * Permission is granted to use the software solely for personal private
#   and non-commercial purposes.
# * An individuals use of the software in his or her capacity or function
#   as an agent, (independent) contractor, employee, member or officer of
#   a business, corporation or organization (commercial or non-commercial)
#   does not qualify as personal private and non-commercial purpose.
# * Without written approval from the author the software must not be used
#   for a business, for commercial, corporate, governmental, military or
#   organizational purposes of any kind, or in a commercial, corporate,
#   governmental, military or organizational environment of any kind.

# Unix System V calling convention for AMD64 platform:
# - first 6 integer or pointer arguments (from left to right) are passed
#   in registers RDI/R7, RSI/R6, RDX/R2, RCX/R1, R8 and R9
#   (R10 is used as static chain pointer in case of nested functions);
# - surplus arguments are pushed on stack in reverse order (from right to
#   left), 8-byte aligned;
# - 128-bit integer arguments are passed as pair of 64-bit integer arguments,
#   low part before/below high part;
# - 128-bit integer result is returned in registers RAX/R0 (low part) and
#   RDX/R2 (high part);
# - 64-bit integer or pointer result is returned in register RAX/R0;
# - 32-bit integer result is returned in register EAX;
# - registers RBX/R3, RSP/R4, RBP/R5, R12, R13, R14 and R15 must be
#   preserved;
# - registers RAX/R0, RCX/R1, RDX/R2, RSI/R6, RDI/R7, R8, R9, R10 (in
#   case of normal functions) and R11 are volatile and can be clobbered;
# - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
#   before calling other functions (CALL instruction pushes 8 bytes);
# - a "red zone" of 128 bytes below the stack pointer can be clobbered.

# NOTE: raises "division exception" when divisor is 0!

.file	"udivmodti4.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__umodti3:
	sub	rsp, 24
	mov	r8, rsp		# r8 = oword ptr remainder
	call	__udivmodti4
	pop	rax
	pop	rdx		# rdx:rax = remainder
	pop	rcx
	ret
				# rsi:rdi = dividend
				# rcx:rdx = divisor
__udivti3:
	xor	r8, r8
				# rsi:rdi = dividend
				# rcx:rdx = divisor
				# r8 = oword ptr remainder
__udivmodti4:
	cmp	rdi, rdx
	mov	rax, rsi
	sbb	rax, rcx
	jb	.trivial	# dividend < divisor?

	mov	r10, rdx
	mov	r11, rcx	# r11:r10 = divisor

	bsr	rcx, rcx	# rcx = index of most significant '1' bit
				#        in high qword of divisor
	jnz	.extended	# high qword of divisor <> 0?

	# remainder < divisor < 2**64

	cmp	rsi, rdx
	jae	.long		# high qword of dividend >= (low qword of) divisor?

	# dividend < divisor * 2**64: quotient < 2**64
	# perform normal division
.normal:
	mov	rdx, rsi
	mov	rax, rdi	# rdx:rax = dividend
	div	r10		# rax = (low qword of) quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	0f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
0:
	mov	rdx, r11	# rdx:rax = quotient
	ret

	# dividend >= divisor * 2**64: quotient >= 2**64
	# perform "long" alias "schoolbook" division
.long:
	mov	rdx, r11	# rdx = 0
	mov	rax, rsi	# rdx:rax = high qword of dividend
	div	r10		# rax = high qword of quotient,
				# rdx = high qword of remainder'
	mov	rcx, rax	# rcx = high qword of quotient
	mov	rax, rdi	# rax = low qword of dividend
	div	r10		# rax = low qword of quotient,
				# rdx = (low qword of) remainder
	test	r8, r8
	jz	1f		# address of remainder = 0?

	mov	[r8], rdx
	mov	[r8+8], r11	# high qword of remainder = 0
1:
	mov	rdx, rcx	# rdx:rax = quotient
	ret

	# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
	test	r8, r8
	jz	2f		# address of remainder = 0?

	mov	[r8], rdi
	mov	[r8+8], rsi	# remainder = dividend
2:
	xor	eax, eax
	xor	edx, edx	# rdx:rax = quotient = 0
	ret

	# dividend >= divisor >= 2**64: quotient < 2**64
.extended:
	xor	ecx, 63		# ecx = number of leading '0' bits
				#        in (high qword of) divisor
	jz	.special	# divisor >= 2**127?

	# perform "extended & adjusted" division

	shld	r11, r10, cl
	shl	r10, cl		# r11:r10 = divisor * 2**(index + 1)
				#         = divisor'
	xor	edx, edx
	shld	rdx, rsi, cl
	shld	rsi, rdi, cl
	shl	rdi, cl		# rdx:rsi:rdi = dividend * 2**(index + 1)
				#             = dividend'
	mov	rax, rsi
	div	r11		# rax = quotient',
				# rdx = high qword of remainder'
	mov	rsi, rdx	# rsi:rdi = remainder'
	mov	r9, rax		# r9 = quotient'
	mul	r10		# rdx:rax = quotient' * low qword of divisor'
				#         = remainder"
.ifnotdef JCCLESS
	cmp	rdx, rsi
	ja	4f
	jb	3f
	cmp	rax, rdi
	ja	4f		# remainder" > remainder'?
.else
	push	rbx
	mov	rbx, rdx
	cmp	rax, rdi
	sbb	rbx, rsi
	pop	rbx
	ja	4f		# quotient' > quotient?
.endif
3:
	test	r8, r8
	jnz	5f		# address of remainder <> 0?

	xor	edx, edx
	mov	rax, r9		# rdx:rax = quotient
	ret
4:
	dec	r9		# r9 = quotient' - 1
				#    = (low qword of) quotient
	test	r8, r8
	jz	6f		# address of remainder = 0?

	sub	rax, r10
	sbb	rdx, r11	# rdx:rax = remainder" - divisor'
5:
	sub	rdi, rax
	sbb	rsi, rdx	# rsi:rdi = remainder * 2**(index + 1)
	shrd	rdi, rsi, cl
	shr	rsi, cl		# rsi:rdi = remainder

	mov	[r8], rdi
	mov	[r8+8], rsi
6:
	xor	edx, edx
	mov	rax, r9		# rdx:rax = quotient
	ret

	# dividend >= divisor >= 2**127:
	# quotient = 1, remainder = dividend - divisor
.special:
	test	r8, r8
	jz	7f		# address of remainder = 0?

	sub	rdi, r10
	sbb	rsi, r11	# rsi:rdi = dividend - divisor
				#         = remainder
	mov	[r8], rdi
	mov	[r8+8], rsi
7:
	xor	eax, eax
	xor	edx, edx
	inc	eax		# rdx:rax = quotient = 1
	ret

.size	__udivmodti4, .-__udivmodti4
.type	__udivmodti4, @function
.global	__udivmodti4
.size	__udivti3, .-__udivti3
.type	__udivti3, @function
.global	__udivti3
.size	__umodti3, .-__umodti3
.type	__umodti3, @function
.global	__umodti3
.end
Note: exploration of the same unoptimised and awful code in the __divmodti4(), __divti3(), __modti3(), __udivti3() and __umodti3() as well as the __divmoddi4(), __divdi3(), __moddi3(), __udivmoddi4(), __udivdi3() and __umoddi3() functions is left as an exercise to the reader.

Case 2: __ashl{s,d,t}i3() Functions

libgcc provides the __ashlsi3(), __ashldi3() and __ashlti3() functions for integer shift operations:
Runtime Function: int __ashlsi3 (int a, int b)
Runtime Function: long __ashldi3 (long a, int b)
Runtime Function: long long __ashlti3 (long long a, int b)

These functions return the result of shifting a left by b bits.

For the AMD64 target platform, GCC compiles __ashlti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__ashlti3>:
   0:	48 85 d2             	test   %rdx,%rdx
   3:	74 3b                	je     40 <__ashlti3+0x40>
   5:	41 b8 40 00 00 00    	mov    $0x40,%r8d
   b:	49 29 d0             	sub    %rdx,%r8
   e:	4d 85 c0             	test   %r8,%r8
  11:	7e 1d                	jle    30 <__ashlti3+0x30>
  13:	89 d1                	mov    %edx,%ecx
  15:	48 89 f8             	mov    %rdi,%rax
  18:	48 d3 e6             	shl    %cl,%rsi
  1b:	48 d3 e0             	shl    %cl,%rax
  1e:	44 89 c1             	mov    %r8d,%ecx
  21:	48 89 f2             	mov    %rsi,%rdx
  24:	48 d3 ef             	shr    %cl,%rdi
  27:	48 09 fa             	or     %rdi,%rdx
  2a:	c3                   	retq   
  2b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  30:	44 89 c1             	mov    %r8d,%ecx
  33:	31 c0                	xor    %eax,%eax
  35:	f7 d9                	neg    %ecx
  37:	48 d3 e7             	shl    %cl,%rdi
  3a:	48 89 fa             	mov    %rdi,%rdx
  3d:	c3                   	retq   
  3e:	66 90                	xchg   %ax,%ax
  40:	48 89 f8             	mov    %rdi,%rax
  43:	48 89 f2             	mov    %rsi,%rdx
  46:	c3                   	retq   
[…]
Oops: the TEST instruction at address e is superfluous.

Properly written branch-free code uses but only 11 instructions in just 32 bytes instead of 26 instructions in 71 bytes and handles shift counts greater than 127:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"ashlti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__ashlti3:
	mov	ecx, edx
	xor	edx, edx
	xor	eax, eax
	shld	rsi, rdi, cl
	shl	rdi, cl		# rsi:rdi = value << count % 64
	cmp	ecx, 127
	cmovbe	rdx, rdi
	cmp	ecx, 63
	cmovbe	rdx, rsi
	cmovbe	rax, rdi	# rdx:rax = value << count
	ret

.size	__ashlti3, .-__ashlti3
.type	__ashlti3, @function
.global	__ashlti3
.end
Note: exploration of the equally bad code generated for the __ashldi3() function on the i386 platform is left as an exercise to the reader.

Case 3: __ashr{s,d,t}i3() Functions

libgcc provides the __ashrsi3(), __ashrdi3() and __ashrti3() functions for integer shift operations:
Runtime Function: int __ashrsi3 (int a, int b)
Runtime Function: long __ashrdi3 (long a, int b)
Runtime Function: long long __ashrti3 (long long a, int b)

These functions return the result of arithmetically shifting a right by b bits.

For the AMD64 target platform, GCC compiles __ashrti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__ashrti3>:
   0:	48 89 d1             	mov    %rdx,%rcx
   3:	48 85 d2             	test   %rdx,%rdx
   6:	74 38                	je     40 <__ashrti3+0x40>
   8:	41 b8 40 00 00 00    	mov    $0x40,%r8d
   e:	49 29 d0             	sub    %rdx,%r8
  11:	48 89 f2             	mov    %rsi,%rdx
  14:	4d 85 c0             	test   %r8,%r8
  17:	7e 17                	jle    30 <__ashrti3+0x30>
  19:	48 d3 ef             	shr    %cl,%rdi
  1c:	48 d3 fa             	sar    %cl,%rdx
  1f:	48 89 f0             	mov    %rsi,%rax
  22:	44 89 c1             	mov    %r8d,%ecx
  25:	48 d3 e0             	shl    %cl,%rax
  28:	48 09 f8             	or     %rdi,%rax
  2b:	c3                   	retq   
  2c:	0f 1f 40 00          	nopl   0x0(%rax)
  30:	44 89 c1             	mov    %r8d,%ecx
  33:	48 89 f0             	mov    %rsi,%rax
  36:	48 c1 fa 3f          	sar    $0x3f,%rdx
  3a:	f7 d9                	neg    %ecx
  3c:	48 d3 f8             	sar    %cl,%rax
  3f:	c3                   	retq   
  40:	48 89 f8             	mov    %rdi,%rax
  43:	48 89 f2             	mov    %rsi,%rdx
  46:	c3                   	retq   
[…]
Oops: the TEST instruction at address 14 is superfluous.

Properly written branch-free code uses but only 12 instructions in just 36 bytes instead of 25 instructions in 71 bytes and handles shift counts greater than 127:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"ashrti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__ashrti3:
	mov	ecx, edx
	mov	rax, rsi
	cqo
	mov	rax, rdx
	shrd	rdi, rsi, cl
	sar	rsi, cl		# rsi:rdi = value >> count % 64
	cmp	ecx, 127
	cmovbe	rax, rsi
	cmp	ecx, 63
	cmovbe	rdx, rsi
	cmovbe	rax, rdi	# rdx:rax = value >> count
	ret

.size	__ashrti3, .-__ashrti3
.type	__ashrti3, @function
.global	__ashrti3
.end
Note: exploration of the equally bad code generated for the __ashrdi3() function on the i386 platform is left as an exercise to the reader.

Case 4: __lshr{s,d,t}i3() Functions

libgcc provides the __lshrsi3(), __lshrdi3() and __lshrti3() functions for integer shift operations:
Runtime Function: int __lshrsi3 (int a, int b)
Runtime Function: long __lshrdi3 (long a, int b)
Runtime Function: long long __lshrti3 (long long a, int b)

These functions return the result of logically shifting a right by b bits.

For the AMD64 target platform, GCC compiles __lshrti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__lshrti3>:
   0:	48 89 d1             	mov    %rdx,%rcx
   3:	48 85 d2             	test   %rdx,%rdx
   6:	74 38                	je     40 <__lshrti3+0x40>
   8:	41 b8 40 00 00 00    	mov    $0x40,%r8d
   e:	49 29 d0             	sub    %rdx,%r8
  11:	4d 85 c0             	test   %r8,%r8
  14:	7e 1a                	jle    30 <__lshrti3+0x30>
  16:	48 89 f2             	mov    %rsi,%rdx
  19:	48 d3 ef             	shr    %cl,%rdi
  1c:	48 89 f0             	mov    %rsi,%rax
  1f:	48 d3 ea             	shr    %cl,%rdx
  22:	44 89 c1             	mov    %r8d,%ecx
  25:	48 d3 e0             	shl    %cl,%rax
  28:	48 09 f8             	or     %rdi,%rax
  2b:	c3                   	retq   
  2c:	0f 1f 40 00          	nopl   0x0(%rax)
  30:	44 89 c1             	mov    %r8d,%ecx
  33:	48 89 f0             	mov    %rsi,%rax
  36:	31 d2                	xor    %edx,%edx
  38:	f7 d9                	neg    %ecx
  3a:	48 d3 e8             	shr    %cl,%rax
  3d:	c3                   	retq   
  3e:	66 90                	xchg   %ax,%ax
  40:	48 89 f8             	mov    %rdi,%rax
  43:	48 89 f2             	mov    %rsi,%rdx
  46:	c3                   	retq   
[…]
Oops: the TEST instruction at address 11 is superfluous.

Properly written branch-free code uses but only 11 instructions in just 32 bytes instead of 25 instructions in 71 bytes and handles shift counts greater than 127:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"lshrti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__lshrti3:
	mov	ecx, edx
	xor	edx, edx
	xor	eax, eax
	shrd	rdi, rsi, cl
	shr	rsi, cl		# rsi:rdi = value >> count % 64
	cmp	ecx, 127
	cmovbe	rax, rsi
	cmp	ecx, 63
	cmovbe	rdx, rsi
	cmovbe	rax, rdi	# rdx:rax = value >> count
	ret

.size	__lshrti3, .-__lshrti3
.type	__lshrti3, @function
.global	__lshrti3
.end
Note: exploration of the equally bad code generated for the __lshrdi3() function on the i386 platform is left as an exercise to the reader.

Note: the __lshldi3() and __lshlti3() functions are equal to the __ashldi3() and __ashlti3() functions presented above.

Case 5: Missing __rotl{s,d,t}i3() and __rotr{s,d,t}i3() Functions

libgcc fails to provide the __rotlsi3(), __rotldi3() and __rotlti3() as well as the __rotrsi3(), __rotrdi3() and __rotrti3() functions for (unsigned) integer rotation alias circular shift.

Create the text file case5.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

unsigned int __rotlsi3(unsigned int value, int count) {
    return (value << (31 & count))
         | (value >> (31 & -count));
}

unsigned int __rotrsi3(unsigned int value, int count) {
    return (value >> (31 & count))
         | (value << (31 & -count));
}

unsigned long long __rotldi3(unsigned long long value, int count) {
    return (value << (63 & count))
         | (value >> (63 & -count));
}

unsigned long long __rotrdi3(unsigned long long value, int count) {
    return (value >> (63 & count))
         | (value << (63 & -count));
}

#ifdef __amd64__
__uint128_t __rotlti3(__uint128_t value, int count) {
    return (value << (127 & count))
         | (value >> (127 & -count));
}

__uint128_t __rotrti3(__uint128_t value, int count) {
#ifndef OPTIMIZE
    return (value >> (127 & count))
         | (value << (127 & -count));
#else
    __asm__("movq\t%[low], %%rax\n\t"
            "shrdq\t%%cl, %[high], %%rax\n\t"
            "shrdq\t%%cl, %[low], %[high]\n\t"
            "movq\t%[high], %%rdx\n\t"
            "test\t$64, %%cl\n\t"
            "cmovnz\t%%rax, %%rdx\n\t"
            "cmovnz\t%[high], %%rax"
           :"=A" (value)
           :"c" (count),
            [low] "r" ((unsigned long long) (value & ~0ULL)),
            [high] "r" ((unsigned long long) (value >> 64))
           :"cc");
    return value;
#endif // OPTIMIZE
}
#endif // __amd64__
Compile the source file case5.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -Os -S case5.c
[…]
__rotlsi3:
	movl	%edi, %eax
	movl	%esi, %ecx
	roll	%cl, %eax
	ret
[…]
__rotrsi3:
	movl	%edi, %eax
	movl	%esi, %ecx
	rorl	%cl, %eax
	ret
[…]
__rotldi3:
	movq	%rdi, %rax
	movl	%esi, %ecx
	rolq	%cl, %rax
	ret
[…]
__rotrdi3:
	movq	%rdi, %rax
	movl	%esi, %ecx
	rorq	%cl, %rax
	ret
[…]
__rotlti3:
	movq	%rdi, %r9
	movq	%rsi, %r8
	andl	$127, %edx
	movq	%rdi, %rsi
	movl	%edx, %ecx
	movq	%r9, %rax
	movq	%r8, %rdx
	movq	%r8, %rdi
	shldq	%cl, %r9, %rdx
	salq	%cl, %rax
	xorl	%r8d, %r8d
	testb	$64, %cl
	cmovne	%rax, %rdx
	cmovne	%r8, %rax
	negl	%ecx
	xorl	%r9d, %r9d
	andl	$127, %ecx
	shrdq	%cl, %rdi, %rsi
	shrq	%cl, %rdi
	testb	$64, %cl
	cmovne	%rdi, %rsi
	cmovne	%r9, %rdi
	orq	%rsi, %rax
	orq	%rdi, %rdx
	ret
[…]
__rotrti3:
	movq	%rsi, %r8
	movq	%rdi, %r9
	andl	$127, %edx
	movq	%rdi, %rsi
	movl	%edx, %ecx
	movq	%r9, %rax
	movq	%r8, %rdx
	movq	%r8, %rdi
	shrdq	%cl, %r8, %rax
	shrq	%cl, %rdx
	xorl	%r8d, %r8d
	testb	$64, %cl
	cmovne	%rdx, %rax
	cmovne	%r8, %rdx
	negl	%ecx
	andl	$127, %ecx
	shldq	%cl, %r9, %rdi
	salq	%cl, %rsi
	xorl	%r9d, %r9d
	testb	$64, %cl
	cmovne	%rsi, %rdi
	cmovne	%r9, %rsi
	orq	%rdi, %rdx
	orq	%rsi, %rax
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Ouch: while the optimiser recognises the common and well-known idiom for rotation of 32-bit and 64-bit integers, it but fails rather bad at 128-bit integers and generates awful code!

Properly written code uses but only 9 instructions in just 28 bytes instead of 25 instructions in 77 bytes for each of the __rotlti3() and __rotrti3() functions:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"rolti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = value
				# rdx = count
__rotlti3:
	mov	ecx, edx
	mov	rax, rdi
	shld	rax, rsi, cl
	shld	rsi, rdi, cl
	mov	rdx, rsi
	test	cl, 64
	cmovnz	rdx, rax
	cmovnz	rax, rsi
	ret

.size	__rotlti3, .-__rotlti3
.type	__rotlti3, @function
.global	__rotlti3

__rotrti3:
	mov	ecx, edx
	mov	rax, rdi
	shrd	rax, rsi, cl
	shrd	rsi, rdi, cl
	mov	rdx, rsi
	test	cl, 64
	cmovnz	rdx, rax
	cmovnz	rax, rsi
	ret

.size	__rotrti3, .-__rotrti3
.type	__rotrti3, @function
.global	__rotrti3
.end
Note: exploration of the equally bad code generated for rotation of 64-bit integers on the i386 platform is left as an exercise to the reader.

Case 6: __cmp{s,d,t}i2() Functions

libgcc provides the __cmpsi2(), __cmpdi2() and __cmpti2() functions for signed integer comparison:
Runtime Function: int __cmpsi2 (int a, int b)
Runtime Function: int __cmpdi2 (long a, long b)
Runtime Function: int __cmpti2 (long long a, long long b)

These functions perform a signed comparison of a and b. If a is less than b, they return 0; if a is greater than b, they return 2; and if a and b are equal they return 1.

For the AMD64 target platform, GCC compiles __cmpti2() to the following unoptimised and awkward code using (superfluous) conditional branches which impair performance:
[…]
0000000000000000 <__cmpti2>:
   0:	31 c0                	xor    %eax,%eax
   2:	48 39 ce             	cmp    %rcx,%rsi
   5:	7c 1c                	jl     23 <__cmpti2+0x23>
   7:	b8 02 00 00 00       	mov    $0x2,%eax
   c:	7f 15                	jg     23 <__cmpti2+0x23>
   e:	31 c0                	xor    %eax,%eax
  10:	48 39 d7             	cmp    %rdx,%rdi
  13:	72 0e                	jb     23 <__cmpti2+0x23>
  15:	b8 01 00 00 00       	mov    $0x1,%eax
  1a:	ba 02 00 00 00       	mov    $0x2,%edx
  1f:	48 0f 47 c2          	cmova  %rdx,%rax
  23:	c3                   	retq   
[…]
Properly written branch-free code uses 11 instructions in 21 bytes instead of 12 instructions in 36 bytes:
# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"cmpti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = left argument
				# rcx:rdx = right argument
__cmpti2:
	sub	rdi, rdx
	sbb	rsi, rcx	# rsi:rdi = left argument - right argument
	setg	ah		# ah = (left argument > right argument) ? 1 : 0
	setl	al		# al = (left argument < right argument) ? 1 : 0
	stc
	sbb	al, ah		# al = (left argument < right argument)
				#    - (left argument > right argument)
				#    - 1
				#    = {0, -1, -2}
	movsx	eax, al
	neg	eax		# rax = (left argument > right argument)
				#     - (left argument < right argument)
				#     + 1
				#     = {0, 1, 2}
	ret

.size	__cmpti2, .-__cmpti2
.type	__cmpti2, @function
.global	__cmpti2
.end
Note: exploration of the equally bad code generated for the __cmpdi2() and __cmpsi2() functions is left as an exercise to the reader.

Case 7: __ucmp{s,d,t}i2() Functions

libgcc provides the __ucmpsi2(), __ucmpdi2() and __ucmpti2() functions for unsigned integer comparison:
Runtime Function: int __ucmpsi2 (unsigned int a, unsigned int b)
Runtime Function: int __ucmpdi2 (unsigned long a, unsigned long b)
Runtime Function: int __ucmpti2 (unsigned long long a, unsigned long long b)

These functions perform an unsigned comparison of a and b. If a is less than b, they return 0; if a is greater than b, they return 2; and if a and b are equal they return 1.

For the AMD64 target platform, GCC compiles __ucmpti2() to the following unoptimised and awkward code using (superfluous) conditional branches which impair performance:
[…]
0000000000000000 <__ucmpti2>:
   0:	31 c0                	xor    %eax,%eax
   2:	48 39 ce             	cmp    %rcx,%rsi
   5:	72 1c                	jb     23 <__ucmpti2+0x23>
   7:	b8 02 00 00 00       	mov    $0x2,%eax
   c:	77 15                	ja     23 <__ucmpti2+0x23>
   e:	31 c0                	xor    %eax,%eax
  10:	48 39 d7             	cmp    %rdx,%rdi
  13:	72 0e                	jb     23 <__ucmpti2+0x23>
  15:	b8 01 00 00 00       	mov    $0x1,%eax
  1a:	ba 02 00 00 00       	mov    $0x2,%edx
  1f:	48 0f 47 c2          	cmova  %rdx,%rax
  23:	c3                   	retq   
[…]
Properly written branch-free code uses but only 6 instructions in just 15 bytes instead of 12 instructions in 36 bytes:
# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"ucmpti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = left argument
				# rcx:rdx = right argument
__ucmpti2:
	xor	eax, eax	# rax = 0
	sub	rdi, rdx
	sbb	rsi, rcx	# rsi:rdi = left argument - right argument
	seta	al		# rax = (left argument > right argument)
	sbb	eax, -1		# rax = (left argument > right argument)
				#     - (left argument < right argument)
				#     + 1
				#     = {0, 1, 2}
	ret

.size	__ucmpti2, .-__ucmpti2
.type	__ucmpti2, @function
.global	__ucmpti2
.end
Note: exploration of the equally bad code generated for the __ucmpdi2() and __ucmpsi2() functions is left as an exercise to the reader.

Case 8: __mul{s,d,t}i3() Functions

libgcc provides the __mulsi3(), __muldi3() and __multi3() functions for integer multiplication:
Runtime Function: int __mulsi3 (int a, int b)
Runtime Function: long __muldi3 (long a, long b)
Runtime Function: long long __multi3 (long long a, long long b)

These functions return the product of a and b.

For the AMD64 target platform, GCC compiles __multi3() to the following code, which uses a LEA instruction instead of the appropriate (shorter) ADD instruction to compute the high part of the product, and two superfluous MOV instructions:
[…]
0000000000000000 <__multi3>:
   0:	49 89 f0             	mov    %rsi,%r8
   3:	48 89 f8             	mov    %rdi,%rax
   6:	48 89 d6             	mov    %rdx,%rsi
   9:	48 0f af f9          	imul   %rcx,%rdi
   d:	49 0f af f0          	imul   %r8,%rsi
  11:	48 f7 e2             	mul    %rdx
  14:	48 01 d7             	add    %rdx,%rdi
  17:	48 8d 14 37          	lea    (%rdi,%rsi,1),%rdx
  1b:	c3                   	retq   
[…]
Properly written code uses but 7 instructions in 21 bytes instead of 9 instructions in 28 bytes:
# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"multi3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
__multi3:
	mov	rax, rdi	# rax = low qword of multiplicand
	imul	rsi, rdx	# rsi = high qword of multiplicand
				#     * low qword of multiplier
	imul	rdi, rcx	# rdi = low qword of multiplicand
				#     * high qword of multiplier
	mul	rdx		# rdx:rax = low qword of multiplicand
				#         * low qword of multiplier
	add	rdi, rsi
	add	rdx, rdi	# rdx:rax = product % 2**128
	ret

.size	__multi3, .-__multi3
.type	__multi3, @function
.global	__multi3
.end
Note: exploration of the equally bad code generated for the __muldi3() function on the i386 platform is left as an exercise to the reader.

Case 9: __neg{s,d,t}i2() Functions

libgcc provides the __negsi2(), __negdi2() and __negti2() functions for signed integer negation:
Runtime Function: int __negsi2 (int a)
Runtime Function: long __negdi2 (long a)
Runtime Function: long long __negti2 (long long a)

These functions return the negation of a.

For the AMD64 target platform, GCC compiles __negti2() to the following unoptimised and awkward code:
[…]
0000000000000000 <__negti2>:
   0:	48 f7 de             	neg    %rsi
   3:	48 89 f8             	mov    %rdi,%rax
   6:	48 f7 d8             	neg    %rax
   9:	48 89 f2             	mov    %rsi,%rdx
   c:	48 83 ff 01          	cmp    $0x1,%rdi
  10:	48 83 d2 ff          	adc    $0xffffffffffffffff,%rdx
  14:	c3                   	retq   
[…]
Properly written code uses but only 5 instructions in just 10 bytes instead of 7 instructions in 21 bytes:
# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"negti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = argument
__negti2:
	xor	eax, eax
	cdq			# rdx:rax = 0
	sub	rax, rdi
	sbb	rdx, rsi	# rdx:rax = -argument
	ret

.size	__negti2, .-__negti2
.type	__negti2, @function
.global	__negti2
.end
Note: exploration of the equally bad code generated for the __negdi2() function on the i386 platform is left as an exercise to the reader.

Case 10: __absv{s,d,t}i2() Functions

libgcc provides the __absvsi2(), __absvdi2() and __absvti2() functions:
Runtime Function: int __absvsi2 (int a)
Runtime Function: long __absvdi2 (long a)
Runtime Function: long long __absvti2 (long long a)

These functions return the absolute value of a.

For the AMD64 target platform, GCC compiles __absvti2() to the following unoptimised and awkward code:
[…]
0000000000000000 <__absvti2>:
   0:	48 89 f8             	mov    %rdi,%rax
   3:	48 89 f2             	mov    %rsi,%rdx
   6:	48 85 f6             	test   %rsi,%rsi
   9:	78 05                	js     10 <__absvti2+0x10>
   b:	c3                   	retq   
   c:	0f 1f 40 00          	nopl   0x0(%rax)
  10:	48 83 ec 08          	sub    $0x8,%rsp
  14:	48 f7 d8             	neg    %rax
  17:	48 83 d2 00          	adc    $0x0,%rdx
  1b:	48 f7 da             	neg    %rdx
  1e:	48 85 d2             	test   %rdx,%rdx
  21:	0f 88 00 00 00 00    	js     …
  27:	48 83 c4 08          	add    $0x8,%rsp
  2b:	c3                   	retq   
[…]
Oops: the TEST instruction at address 1e is superfluous.

Properly written (and almost branch-free) code uses but only 11 instructions in just 26 bytes instead of 14 instructions in 44 bytes:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"absvti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = argument
__absvti2:
	mov	rax, rsi
	cqo			# rdx = (argument < 0) ? -1 : 0
	add	rdi, rdx
	adc	rsi, rdx	# rsi:rsi = (argument < 0) ? argument - 1 : argument
	jo	.overflow

	mov	rax, rdx
	xor	rax, rdi
	xor	rdx, rsi	# rdx:rax = (argument < 0) ? -argument : argument
				#         = |argument|
	ret
.overflow:
	int	4
	ret

.size	__absvti2, .-__absvti2
.type	__absvti2, @function
.global	__absvti2
.end
Note: exploration of the equally bad code generated for the __absvdi2() and __absvsi2() functions is left as an exercise to the reader.

Case 11: __addv{s,d,t}i3() Functions

libgcc provides the __addvsi3(), __addvdi3() and __addvti3() functions for overflow-trapping signed integer addition:
Runtime Function: int __addvsi3 (int a, int b)
Runtime Function: long __addvdi3 (long a, long b)
Runtime Function: long long __addvti3 (long long a, long long b)

These functions return the sum of a and b; that is a + b.

For the AMD64 target platform, GCC compiles __addvti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__addvti3>:
   0:	49 89 f9             	mov    %rdi,%r9
   3:	49 89 d0             	mov    %rdx,%r8
   6:	48 83 ec 08          	sub    $0x8,%rsp
   a:	48 89 f2             	mov    %rsi,%rdx
   d:	4c 89 c8             	mov    %r9,%rax
  10:	48 89 f7             	mov    %rsi,%rdi
  13:	4c 01 c0             	add    %r8,%rax
  16:	48 11 ca             	adc    %rcx,%rdx
  19:	48 85 c9             	test   %rcx,%rcx
  1c:	78 12                	js     30 <__addvti3+0x30>
  1e:	4c 39 c8             	cmp    %r9,%rax
  21:	48 89 d1             	mov    %rdx,%rcx
  24:	48 19 f1             	sbb    %rsi,%rcx
  27:	7c 18                	jl     41 <__addvti3+0x41>
  29:	48 83 c4 08          	add    $0x8,%rsp
  2d:	c3                   	retq   
  2e:	66 90                	xchg   %ax,%ax
  30:	49 39 c1             	cmp    %rax,%r9
  33:	48 19 d7             	sbb    %rdx,%rdi
  36:	0f 8c 00 00 00 00    	jl     …
  3c:	48 83 c4 08          	add    $0x8,%rsp
  40:	c3                   	retq   
  41:	e9 00 00 00 00       	jmpq   …
[…]
Properly written (and almost branch-free code) uses but only 8 instructions in just 18 bytes instead of 23 instructions in 70 bytes:
# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"addvti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = augend
				# rcx:rdx = addend
__addvti3:
	add	rdi, rdx
	adc	rsi, rcx	# rsi:rdi = augend + addend
				#         = sum
	jo	.overflow

	mov	rax, rdi
	mov	rdx, rsi	# rdx:rax = sum
	ret
.overflow:
	int	4
	ret

.size	__addvti3, .-__addvti3
.type	__addvti3, @function
.global	__addvti3
.end
Note: exploration of the equally bad code generated for the __addvdi3() and __addvsi3() functions is left as an exercise to the reader.

Case 12: __subv{s,d,t}i3() Functions

libgcc provides the __subvsi3(), __subvdi3() and __subvti3() functions for overflow-trapping signed integer subtraction:
Runtime Function: int __subvsi3 (int a, int b)
Runtime Function: long __subvdi3 (long a, long b)
Runtime Function: long long __subvti3 (long long a, long long b)

These functions return the difference between a and b; that is a - b.

For the AMD64 target platform, GCC compiles __subvti3() to the following unoptimised and awful code:
[…]
0000000000000000 <__subvti3>:
   0:	49 89 f9             	mov    %rdi,%r9
   3:	49 89 d0             	mov    %rdx,%r8
   6:	48 83 ec 08          	sub    $0x8,%rsp
   a:	48 89 f2             	mov    %rsi,%rdx
   d:	4c 89 c8             	mov    %r9,%rax
  10:	4c 29 c0             	sub    %r8,%rax
  13:	48 19 ca             	sbb    %rcx,%rdx
  16:	48 85 c9             	test   %rcx,%rcx
  19:	78 15                	js     30 <__subvti3+0x30>
  1b:	49 39 c1             	cmp    %rax,%r9
  1e:	48 89 f1             	mov    %rsi,%rcx
  21:	48 19 d1             	sbb    %rdx,%rcx
  24:	7c 1e                	jl     44 <__subvti3+0x44>
  26:	48 83 c4 08          	add    $0x8,%rsp
  2a:	c3                   	retq   
  2b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  30:	4c 39 c8             	cmp    %r9,%rax
  33:	48 89 d1             	mov    %rdx,%rcx
  36:	48 19 f1             	sbb    %rsi,%rcx
  39:	0f 8c 00 00 00 00    	jl     …
  3f:	48 83 c4 08          	add    $0x8,%rsp
  43:	c3                   	retq   
  44:	e9 00 00 00 00       	jmpq   …
[…]
Properly written (and almost branch-free) code uses but only 8 instructions in just 18 bytes instead of 23 instructions in 73 bytes:
# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"subvti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = minuend
				# rcx:rdx = subtrahend
__subvti3:
	sub	rdi, rdx
	sbb	rsi, rcx	# rsi:rdi = minuend - subtrahend
				#         = difference
	jo	.overflow

	mov	rax, rdi
	mov	rdx, rsi	# rdx:rax = difference
	ret
.overflow:
	int	4
	ret

.size	__subvti3, .-__subvti3
.type	__subvti3, @function
.global	__subvti3
.end
Note: exploration of the equally bad code generated for the __subvdi3() and __subvsi3() functions is left as an exercise to the reader.

Case 13: __mulv{s,d,t}i3() Functions

libgcc provides the __mulvsi3(), __mulvdi3() and __mulvti3() functions for overflow-trapping signed integer multiplication:
Runtime Function: int __mulvsi3 (int a, int b)
Runtime Function: long __mulvdi3 (long a, long b)
Runtime Function: long long __mulvti3 (long long a, long long b)

These functions return the product of a and b; that is a * b.

For the AMD64 target platform, GCC compiles __mulvti3() to the following horrible code:
[…]
0000000000000000 <__mulvti3>:
   0:	41 55                	push   %r13
   2:	49 89 cb             	mov    %rcx,%r11
   5:	48 89 d0             	mov    %rdx,%rax
   8:	49 89 d2             	mov    %rdx,%r10
   b:	41 54                	push   %r12
   d:	49 89 fc             	mov    %rdi,%r12
  10:	48 89 d1             	mov    %rdx,%rcx
  13:	49 89 f0             	mov    %rsi,%r8
  16:	4c 89 e2             	mov    %r12,%rdx
  19:	49 89 f5             	mov    %rsi,%r13
  1c:	53                   	push   %rbx
  1d:	48 89 fe             	mov    %rdi,%rsi
  20:	48 c1 fa 3f          	sar    $0x3f,%rdx
  24:	48 c1 f8 3f          	sar    $0x3f,%rax
  28:	4c 89 df             	mov    %r11,%rdi
  2b:	4c 39 c2             	cmp    %r8,%rdx
  2e:	75 18                	jne    48 <__mulvti3+0x48>
  30:	4c 39 d8             	cmp    %r11,%rax
  33:	75 6b                	jne    a0 <__mulvti3+0xa0>
  35:	4c 89 e0             	mov    %r12,%rax
  38:	49 f7 ea             	imul   %r10
  3b:	5b                   	pop    %rbx
  3c:	41 5c                	pop    %r12
  3e:	41 5d                	pop    %r13
  40:	c3                   	retq   
  41:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  48:	4c 39 d8             	cmp    %r11,%rax
  4b:	0f 85 a7 00 00 00    	jne    f8 <__mulvti3+0xf8>
  51:	4c 89 e0             	mov    %r12,%rax
  54:	49 f7 e2             	mul    %r10
  57:	49 89 c1             	mov    %rax,%r9
  5a:	4c 89 c0             	mov    %r8,%rax
  5d:	48 89 d7             	mov    %rdx,%rdi
  60:	49 f7 e2             	mul    %r10
  63:	4d 85 c0             	test   %r8,%r8
  66:	79 09                	jns    71 <__mulvti3+0x71>
  68:	49 89 d0             	mov    %rdx,%r8
  6b:	49 29 c8             	sub    %rcx,%r8
  6e:	4c 89 c2             	mov    %r8,%rdx
  71:	48 85 c9             	test   %rcx,%rcx
  74:	79 06                	jns    7c <__mulvti3+0x7c>
  76:	4c 29 e0             	sub    %r12,%rax
  79:	4c 19 ea             	sbb    %r13,%rdx
  7c:	49 89 fa             	mov    %rdi,%r10
  7f:	45 31 db             	xor    %r11d,%r11d
  82:	49 01 c2             	add    %rax,%r10
  85:	4c 89 d0             	mov    %r10,%rax
  88:	49 11 d3             	adc    %rdx,%r11
  8b:	48 c1 f8 3f          	sar    $0x3f,%rax
  8f:	4c 39 d8             	cmp    %r11,%rax
  92:	0f 85 00 00 00 00    	jne    …
  98:	4c 89 c8             	mov    %r9,%rax
  9b:	4c 89 d2             	mov    %r10,%rdx
  9e:	eb 9b                	jmp    3b <__mulvti3+0x3b>
  a0:	4c 89 d0             	mov    %r10,%rax
  a3:	49 f7 e4             	mul    %r12
  a6:	49 89 c0             	mov    %rax,%r8
  a9:	4c 89 d8             	mov    %r11,%rax
  ac:	48 89 d3             	mov    %rdx,%rbx
  af:	49 f7 e4             	mul    %r12
  b2:	4d 85 db             	test   %r11,%r11
  b5:	79 09                	jns    c0 <__mulvti3+0xc0>
  b7:	48 89 d7             	mov    %rdx,%rdi
  ba:	4c 29 e7             	sub    %r12,%rdi
  bd:	48 89 fa             	mov    %rdi,%rdx
  c0:	48 85 f6             	test   %rsi,%rsi
  c3:	79 06                	jns    cb <__mulvti3+0xcb>
  c5:	4c 29 d0             	sub    %r10,%rax
  c8:	4c 19 da             	sbb    %r11,%rdx
  cb:	48 89 de             	mov    %rbx,%rsi
  ce:	31 ff                	xor    %edi,%edi
  d0:	48 01 c6             	add    %rax,%rsi
  d3:	48 89 f0             	mov    %rsi,%rax
  d6:	48 11 d7             	adc    %rdx,%rdi
  d9:	48 c1 f8 3f          	sar    $0x3f,%rax
  dd:	48 39 f8             	cmp    %rdi,%rax
  e0:	0f 85 00 00 00 00    	jne    …
  e6:	4c 89 c0             	mov    %r8,%rax
  e9:	48 89 f2             	mov    %rsi,%rdx
  ec:	e9 4a ff ff ff       	jmpq   3b <__mulvti3+0x3b>
  f1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  f8:	4d 85 c0             	test   %r8,%r8
  fb:	78 23                	js     120 <__mulvti3+0x120>
  fd:	4d 85 db             	test   %r11,%r11
 100:	78 4d                	js     14f <__mulvti3+0x14f>
 102:	4c 09 c7             	or     %r8,%rdi
 105:	0f 85 00 00 00 00    	jne    …
 10b:	4c 89 e0             	mov    %r12,%rax
 10e:	49 f7 e2             	mul    %r10
 111:	48 85 d2             	test   %rdx,%rdx
 114:	0f 89 21 ff ff ff    	jns    3b <__mulvti3+0x3b>
 11a:	e9 00 00 00 00       	jmpq   …
 11f:	90                   	nop
 120:	4d 85 db             	test   %r11,%r11
 123:	78 57                	js     17c <__mulvti3+0x17c>
 125:	0f 85 00 00 00 00    	jne    …
 12b:	49 83 f8 ff          	cmp    $0xffffffffffffffff,%r8
 12f:	0f 85 00 00 00 00    	jne    …
 135:	4c 89 e0             	mov    %r12,%rax
 138:	49 f7 e2             	mul    %r10
 13b:	49 89 d0             	mov    %rdx,%r8
 13e:	4d 29 d0             	sub    %r10,%r8
 141:	0f 89 00 00 00 00    	jns    …
 147:	4c 89 c2             	mov    %r8,%rdx
 14a:	e9 ec fe ff ff       	jmpq   3b <__mulvti3+0x3b>
 14f:	49 83 fb ff          	cmp    $0xffffffffffffffff,%r11
 153:	0f 85 00 00 00 00    	jne    …
 159:	4d 85 c0             	test   %r8,%r8
 15c:	0f 85 00 00 00 00    	jne    …
 162:	4c 89 d0             	mov    %r10,%rax
 165:	49 f7 e4             	mul    %r12
 168:	48 89 d7             	mov    %rdx,%rdi
 16b:	4c 29 e7             	sub    %r12,%rdi
 16e:	0f 89 00 00 00 00    	jns    …
 174:	48 89 fa             	mov    %rdi,%rdx
 177:	e9 bf fe ff ff       	jmpq   3b <__mulvti3+0x3b>
 17c:	4c 21 c7             	and    %r8,%rdi
 17f:	48 83 ff ff          	cmp    $0xffffffffffffffff,%rdi
 183:	0f 85 00 00 00 00    	jne    …
 189:	4c 89 d0             	mov    %r10,%rax
 18c:	4c 09 e0             	or     %r12,%rax
 18f:	0f 84 00 00 00 00    	je     …
 195:	4c 89 e0             	mov    %r12,%rax
 198:	49 f7 e2             	mul    %r10
 19b:	4c 29 e2             	sub    %r12,%rdx
 19e:	49 89 c0             	mov    %rax,%r8
 1a1:	48 89 d6             	mov    %rdx,%rsi
 1a4:	4c 29 d6             	sub    %r10,%rsi
 1a7:	0f 88 00 00 00 00    	js     …
 1ad:	e9 34 ff ff ff       	jmpq   e6 <__mulvti3+0xe6>
[…]
Properly written (and almost branch-free) code uses but only 45 instructions in just 119 bytes instead of 130 instructions in 434 bytes:
# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"mulvti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
__mulvti3:
	mov	r10, rdx	# r10 = low qword of multiplier
	mov	r11, rcx	# r11 = high qword of multiplier

	mov	rax, rcx	# rax = high qword of multiplier
	cqo			# rdx = (multiplier < 0) ? -1 : 0
	mov	r9, rdx		# r9 = (multiplier < 0) ? -1 : 0
	xor	r10, rdx
	xor	r11, rdx	# r11:r10 = (multiplier < 0) ? ~multiplier : multiplier
	sub	r10, rdx
	sbb	r11, rdx	# r11:r10 = (multiplier < 0) ? -multiplier : multiplier
				#         = |multiplier|
	mov	rax, rsi	# rax = high qword of multiplicand
	cqo			# rdx = (multiplicand < 0) ? -1 : 0
	xor	r9, rdx		# r9 = (multiplicand < 0) <> (multiplier < 0) ? -1 : 0
				#    = (product < 0) ? -1 : 0
	xor	rdi, rdx
	xor	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? ~multiplicand : multiplicand
	sub	rdi, rdx
	sbb	rsi, rdx	# rsi:rdi = (multiplicand < 0) ? -multiplicand : multiplicand
				#         = |multiplicand|
	xor	ecx, ecx
	cmp	rcx, rsi
	sbb	edx, edx	# edx = (high qword of |multiplicand| = 0) ? 0 : -1
				#     = (|multiplicand| < 2**64) ? 0 : -1
	cmp	rcx, r11
	sbb	ecx, ecx	# ecx = (high qword of |multiplier| = 0) ? 0 : -1
				#     = (|multiplier| < 2**64) ? 0 : -1
	and	ecx, edx	# ecx = (|multiplicand| < 2**64)
				#     & (|multiplier| < 2**64) ? 0 : -1
				#     = (|product| < 2**128) ? 0 : -1
	mov	rax, rsi
	mul	r10		# rdx:rax = high qword of |multiplicand|
				#         * low qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rsi, rax
	mov	rax, rdi
	mul	r11		# rdx:rax = low qword of |multiplicand|
				#         * high qword of |multiplier|
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	add	rsi, rax	# rsi = high qword of |multiplicand|
				#     * low qword of |multiplier|
				#     + low qword of |multiplicand|
				#     * high qword of |multiplier|
#	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *

	mov	rax, rdi
	mul	r10		# rdx:rax = low qword of |multiplicand|
				#         * low qword of |multiplier|
	add	rdx, rsi	# rdx:rax = |product % 2**128|
				#         = |product| % 2**128
	adc	ecx, ecx	# ecx = (|product| < 2**128) ? 0 : *
.if 0
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = (product < 0)
				#         ? product % 2**128 - 1 : product % 2**128
	sub	rax, r9
	sbb	rdx, r9		# rdx:rax = product % 2**128

	xor	r9, rdx		# r9 = (product % 2**128 < -2**127)
				#    | (product % 2**128 >= 2**127)
				#    ? negative : positive
	add	r9, r9
.else
	add	rax, r9
	adc	rdx, r9		# rdx:rax = (product < 0)
				#         ? ~product % 2**128 : product % 2**128
	mov	rsi, rdx	# rsi = (product % 2**128 < -2**127)
				#     | (product % 2**128 >= 2**127)
				#     ? negative : positive
	xor	rax, r9
	xor	rdx, r9		# rdx:rax = product % 2**128

	add	rsi, rsi
.endif
	adc	ecx, ecx	# ecx = (-2**127 <= product < 2**127) ? 0 : *
	jnz	.overflow	# product < -2**127?
				# product >= 2**127?
	ret
.overflow:
	int	4
	ret

.size	__mulvti3, .-__mulvti3
.type	__mulvti3, @function
.global	__mulvti3
.end
Note: exploration of the equally horrible code generated for the __mulvdi3() and __mulvsi3() functions is left as an exercise to the reader.

Case 14: __negv{s,d,t}i2() Functions

libgcc provides the __negvsi2(), __negvdi2() and __negvti2() functions for overflow-trapping signed integer negation:
Runtime Function: int __negvsi2 (int a)
Runtime Function: long __negvdi2 (long a)
Runtime Function: long long __negvti2 (long long a)

These functions return the negation of a; that is -a.

For the AMD64 target platform, GCC compiles __negvti2() to the following unoptimised and awful code:
[…]
0000000000000000 <__negvti2>:
   0:	48 89 f8             	mov    %rdi,%rax
   3:	48 83 ec 08          	sub    $0x8,%rsp
   7:	48 89 f2             	mov    %rsi,%rdx
   a:	48 f7 d8             	neg    %rax
   d:	48 83 d2 00          	adc    $0x0,%rdx
  11:	48 f7 da             	neg    %rdx
  14:	48 85 f6             	test   %rsi,%rsi
  17:	78 17                	js     30 <__negvti2+0x30>
  19:	31 c9                	xor    %ecx,%ecx
  1b:	48 39 c1             	cmp    %rax,%rcx
  1e:	48 19 d1             	sbb    %rdx,%rcx
  21:	7c 1b                	jl     3e <__negvti2+0x3e>
  23:	48 83 c4 08          	add    $0x8,%rsp
  27:	c3                   	retq   
  28:	0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  2f:	00 
  30:	48 85 d2             	test   %rdx,%rdx
  33:	0f 88 00 00 00 00    	js     …
  39:	48 83 c4 08          	add    $0x8,%rsp
  3d:	c3                   	retq   
  3e:	e9 00 00 00 00       	jmpq   …
[…]
Properly written code uses but only 8 instructions in just 15 bytes instead of 20 instructions in 67 bytes:
# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"negvti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = argument
__negvti2:
	xor	eax, eax
	cdq			# rdx:rax = 0
	sub	rax, rdi
	sbb	rdx, rsi	# rdx:rax = -argument
	jo	.overflow

	ret
.overflow:
	int	4
	ret

.size	__negvti2, .-__negvti2
.type	__negvti2, @function
.global	__negvti2
.end
Note: exploration of the equally bad code generated for the __negvdi2() and __negvsi2() functions is left as an exercise to the reader.

Case 15: __parity{s,d,t}i2() Functions

libgcc provides the __paritysi2(), __paritydi2() and __parityti2() functions:
Runtime Function: int __paritysi2 (unsigned int a)
Runtime Function: int __paritydi2 (unsigned long a)
Runtime Function: int __parityti2 (unsigned long long a)

These functions return the value zero if the number of bits set in a is even, and the value one otherwise.

For the AMD64 target platform, GCC compiles __parityti2() to the following unoptimised and awful code:
[…]
0000000000000000 <__parityti2>:
   0:	48 31 f7             	xor    %rsi,%rdi
   3:	b8 96 69 00 00       	mov    $0x6996,%eax
   8:	48 89 fe             	mov    %rdi,%rsi
   b:	48 89 f9             	mov    %rdi,%rcx
   e:	48 c1 ee 20          	shr    $0x20,%rsi
  12:	48 31 f1             	xor    %rsi,%rcx
  15:	48 89 ce             	mov    %rcx,%rsi
  18:	48 c1 ee 10          	shr    $0x10,%rsi
  1c:	48 31 ce             	xor    %rcx,%rsi
  1f:	48 89 f1             	mov    %rsi,%rcx
  22:	48 c1 e9 08          	shr    $0x8,%rcx
  26:	48 31 ce             	xor    %rcx,%rsi
  29:	48 89 f1             	mov    %rsi,%rcx
  2c:	48 c1 e9 04          	shr    $0x4,%rcx
  30:	48 31 f1             	xor    %rsi,%rcx
  33:	83 e1 0f             	and    $0xf,%ecx
  36:	d3 f8                	sar    %cl,%eax
  38:	83 e0 01             	and    $0x1,%eax
  3b:	c3                   	retq   
[…]
Properly written code uses but only 9 instructions in just 24 bytes instead of 19 instructions in 60 bytes:
# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"parityti2.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# rsi:rdi = argument
__parityti2:
	xor	rdi, rsi
	shld	rdx, rdi, 32
	xor	edx, edi
	shld	edx, edi, 16
	xor	edx, edi
	xor	eax, eax
	xor	dl, dh
	setpo	al
	ret

.size	__parityti2, .-__parityti2
.type	__parityti2, @function
.global	__parityti2
.end
Note: exploration of the equally bad code generated for the __paritydi2() and __paritysi2() functions is left as an exercise to the reader.

Case 16: __builtin_parity() Builtin

Create the text file case16.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __paritysi2(unsigned int value) {
    return __builtin_parity(value);
}

int __paritydi2(unsigned long long value) {
    return __builtin_parityll(value);
}
Compile the source file case16.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case16.c
[…]
__paritysi2:
	movl	4(%esp), %eax
	movl	%eax, %edx
	shrl	$16, %eax
	xorl	%eax, %edx
	xorl	%eax, %eax
	xorb	%dh, %dl
	setnp	%al
	ret
[…]
__paritydi2:
	pushl	%ebx
	movl	8(%esp), %eax
	movl	12(%esp), %edx
	movl	%edx, %ebx
	xorl	%eax, %ebx
	movl	%ebx, %edx
	shrl	$16, %ebx
	xorl	%ebx, %edx
	xorl	%eax, %eax
	xorb	%dh, %dl
	setnp	%al
	popl	%ebx
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Oops: the code for the __paritydi2() function clobbers register EBX instead of the volatile register ECX, thus wasting a PUSH and a POP instruction which perform 2 superfluous memory accesses.

Case 17: __builtin_mul_overflow() Builtin

Create the text file case17.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef OPTIMIZE
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
    __int128_t product;

    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();

    return product;
}

__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
    __uint128_t product;

    if (__builtin_mul_overflow(multiplicand, multiplier, &product))
        __builtin_trap();

    return product;
}
#else
__int128_t __mulvti3(__int128_t multiplicand, __int128_t multiplier) {
    __uint128_t product, sign = 0 - (multiplicand < 0), tmp = 0 - (multiplier < 0);

    if (__builtin_mul_overflow((multiplicand ^ sign) - sign,
                               (multiplier ^ tmp) - tmp,
                               &product))
        __builtin_trap();

    sign ^= tmp;
    product = (product ^ sign) - sign;

    if ((__int128_t) (product ^ sign) < 0)
        __builtin_trap();

    return product;
}

__uint128_t __umulvti3(__uint128_t multiplicand, __uint128_t multiplier) {
    union {
        __uint128_t all;
        struct {
            unsigned long long low, high;
        };
    } product;
    unsigned long long tmp;

    product.all = (multiplicand & ~0ULL) * (multiplier & ~0ULL);

    if ((((multiplicand >> 64) != 0) && ((multiplier >> 64) != 0))
     || __builtin_umulll_overflow(multiplicand & ~0ULL, multiplier >> 64, &tmp)
     || __builtin_uaddll_overflow(tmp, product.high, &product.high)
     || __builtin_umulll_overflow(multiplicand >> 64, multiplier & ~0ULL, &tmp)
     || __builtin_uaddll_overflow(tmp, product.high, &product.high))
        __builtin_trap();

    return product.all;
}
#endif // OPTIMIZE
Compile the source file case17.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -Os -S case17.c
[…]
__mulvti3:
	pushq	%rbp
	movq	%rsi, %r10
	movq	%rdx, %rsi
	movq	%rdi, %rdx
	movq	%rsi, %rax
	sarq	$63, %rdx
	movq	%rcx, %r11
	sarq	$63, %rax
	movq	%rsp, %rbp
	pushq	%r15
	pushq	%r14
	xorl	%r14d, %r14d
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	cmpq	%r10, %rdx
	jne	.L4
	cmpq	%rcx, %rax
	jne	.L5
	movq	%rdi, %rax
	imulq	%rsi
	movq	%rax, %rbx
	movq	%rdx, %r8
	jmp	.L2
.L5:
	movq	%rsi, %rcx
	movq	%r11, %rbx
	movq	%r11, %r8
	movq	%rdi, %r15
	jmp	.L6
.L4:
	cmpq	%rcx, %rax
	jne	.L7
	movq	%rdi, %rcx
	movq	%r10, %rbx
	movq	%r10, %r8
	movq	%rsi, %r15
.L6:
	movq	%rdi, %rax
	mulq	%rsi
	movq	%rax, %r12
	movq	%r15, %rax
	movq	%rdx, %r13
	mulq	%r8
	testq	%r8, %r8
	jns	.L8
	xorl	%r8d, %r8d
	subq	%r8, %rax
	sbbq	%r15, %rdx
.L8:
	testq	%r15, %r15
	jns	.L9
	subq	%rcx, %rax
	sbbq	%rbx, %rdx
.L9:
	movq	%r13, %r8
	xorl	%r9d, %r9d
	movq	%r12, %rbx
	addq	%rax, %r8
	adcq	%rdx, %r9
	movq	%r8, %rdx
	sarq	$63, %rdx
	cmpq	%r9, %rdx
	je	.L2
	imulq	%rsi, %r10
	movq	%rdi, %rax
	imulq	%rdi, %r11
	mulq	%rsi
	leaq	(%r10,%r11), %r8
	addq	%rdx, %r8
	movq	%rax, %rbx
	jmp	.L3
.L7:
	movq	%r10, %r8
	movq	%rcx, %rdx
	movq	%rdi, %rax
	imulq	%rdi, %rdx
	imulq	%rsi, %r8
	addq	%rdx, %r8
	mulq	%rsi
	addq	%rdx, %r8
	leaq	1(%r10), %rdx
	movq	%rax, %rbx
	cmpq	$1, %rdx
	ja	.L3
	leaq	1(%rcx), %rdx
	cmpq	$1, %rdx
	ja	.L3
	cmpq	%rcx, %r10
	jne	.L11
	cmpq	%rbx, %r14
	movq	%r14, %rax
	sbbq	%r8, %rax
	jl	.L2
	jmp	.L3
.L11:
	testq	%r8, %r8
	js	.L2
.L3:
	movl	$1, %r14d
.L2:
	testq	%r14, %r14
	je	.L1
	ud2
.L1:
	movq	%rbx, %rax
	movq	%r8, %rdx
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	ret
[…]
__umulvti3:
	pushq	%rbp
	movq	%rsi, %r10
	movq	%rcx, %r11
	movq	%rdx, %rsi
	movq	%rsp, %rbp
	pushq	%r12
	xorl	%r12d, %r12d
	pushq	%rbx
	testq	%r10, %r10
	jne	.L18
	testq	%rcx, %rcx
	jne	.L19
	movq	%rdi, %rax
	mulq	%rdx
	movq	%rax, %rbx
	movq	%rdx, %r8
	jmp	.L16
.L19:
	movq	%rcx, %r8
	movq	%rdi, %r9
	jmp	.L20
.L18:
	testq	%rcx, %rcx
	jne	.L22
	movq	%r10, %r8
	movq	%rdx, %r9
.L20:
	movq	%rdi, %rax
	mulq	%rsi
	movq	%rax, %rcx
	movq	%r9, %rax
	movq	%rdx, %rbx
	xorl	%r9d, %r9d
	mulq	%r8
	movq	%rbx, %r8
	movq	%rcx, %rbx
	addq	%rax, %r8
	adcq	%rdx, %r9
	testq	%r9, %r9
	je	.L16
.L22:
	imulq	%rsi, %r10
	movq	%rdi, %rax
	movl	$1, %r12d
	imulq	%rdi, %r11
	mulq	%rsi
	leaq	(%r10,%r11), %r8
	addq	%rdx, %r8
	movq	%rax, %rbx
.L16:
	testq	%r12, %r12
	je	.L15
	ud2
.L15:
	movq	%rbx, %rax
	movq	%r8, %rdx
	popq	%rbx
	popq	%r12
	popq	%rbp
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: 104 instructions in 295 bytes for the __mulvti3() function – still horrible, but not as bad as the __mulvti3() function provided in libgcc any more!

Ouch: 54 instructions in 148 bytes for the __umulvti3() function – also rather terrible!

Compile the source file case17.c a second time with GCC, now with the preprocessor macro OPTIMIZE defined, and display the generated assembly code:

gcc -DOPTIMIZE -m64 -o- -Os -S case17.c
[…]
__mulvti3:
	pushq	%rbp
	movq	%rcx, %rax
	movq	%rdi, %r10
	movq	%rsi, %r11
	movq	%rax, %rdi
	sarq	$63, %rdi
	movq	%rsp, %rbp
	pushq	%r14
	movq	%rdi, %r14
	pushq	%r13
	pushq	%r12
	pushq	%rbx
	movq	%rsi, %rbx
	movslq	%edi, %rsi
	sarq	$63, %rbx
	movslq	%ebx, %rcx
	movq	%rbx, %r12
	movq	%rcx, %r8
	sarq	$63, %rcx
	xorq	%r8, %r10
	movq	%rcx, %r9
	movq	%rsi, %rcx
	sarq	$63, %rsi
	movq	%rsi, %rbx
	xorq	%r9, %r11
	movq	%r10, %rsi
	subq	%r8, %rsi
	movq	%r11, %rdi
	sbbq	%r9, %rdi
	xorq	%rcx, %rdx
	xorq	%rbx, %rax
	movq	%rdx, %r10
	movq	%rax, %r11
	subq	%rcx, %r10
	sbbq	%rbx, %r11
	xorl	%r13d, %r13d
	movq	%r11, %r8
	testq	%rdi, %rdi
	jne	.L4
	testq	%r11, %r11
	jne	.L5
	movq	%rsi, %rax
	mulq	%r10
	movq	%rax, %rcx
	movq	%rdx, %r8
	jmp	.L2
.L5:
	movq	%rsi, %r9
	jmp	.L6
.L4:
	testq	%r11, %r11
	jne	.L8
	movq	%rdi, %r8
	movq	%r10, %r9
.L6:
	movq	%rsi, %rax
	mulq	%r10
	movq	%rax, %rcx
	movq	%r9, %rax
	movq	%rdx, %rbx
	xorl	%r9d, %r9d
	mulq	%r8
	movq	%rbx, %r8
	addq	%rax, %r8
	adcq	%rdx, %r9
	testq	%r9, %r9
	je	.L2
.L8:
	movq	%rdi, %r8
	movq	%r11, %rax
	movl	$1, %r13d
	imulq	%rsi, %rax
	imulq	%r10, %r8
	addq	%rax, %r8
	movq	%rsi, %rax
	mulq	%r10
	addq	%rdx, %r8
	movq	%rax, %rcx
.L2:
	testq	%r13, %r13
	je	.L9
.L10:
	ud2
.L9:
	xorl	%r14d, %r12d
	movslq	%r12d, %r12
	movq	%r12, %rsi
	sarq	$63, %r12
	xorq	%rsi, %rcx
	xorq	%r12, %r8
	movq	%r12, %rbx
	movq	%rcx, %rax
	movq	%r8, %rdx
	subq	%rsi, %rax
	sbbq	%r12, %rdx
	xorq	%rdx, %rbx
	js	.L10
	popq	%rbx
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%rbp
	ret
[…]
__umulvti3:
	movq	%rdi, %rax
	movq	%rdx, %r8
	mulq	%rdx
	movq	%rdx, %r11
	movq	%rax, %r9
	testq	%rsi, %rsi
	je	.L14
	testq	%rcx, %rcx
	je	.L14
.L17:
	ud2
.L14:
	movq	%rdi, %rax
	mulq	%rcx
	movq	%rax, %rdi
	jo	.L17
	addq	%r11, %rdi
	jc	.L17
	movq	%rsi, %rax
	mulq	%r8
	movq	%rax, %rsi
	jo	.L17
	addq	%rdi, %rsi
	movq	%rsi, %rdx
	jc	.L17
	movq	%r9, %rax
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Oops: the optimised implementation coaxes GCC to generate 96 instructions in 273 bytes for the __mulvti3() function – a little shorter and faster than the code generated for __builtin_mul_overflow(), but more than twice the size of the proper assembly code shown in case 13 above, and less than half as fast!

OOPS: 25 instructions in 66 bytes for the __umulvti3() function – also not optimal, but much better than the rather terrible code generated for __builtin_mul_overflow().

Note: exploration of the equally bad code generated for checked multiplication of signed and unsigned 64-bit integers on the i386 platform is left as an exercise to the reader.

Case 18: __builtin_sub_overflow() Builtin

Create the text file case18.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

__int128_t __absvti2(__int128_t value) {
    __int128_t sign = 0 - (value < 0);

    if (__builtin_sub_overflow(value ^ sign, sign, &value))
        __builtin_trap();

    return value;
}

__int128_t __negvti2(__int128_t negend) {
    __int128_t negation;

    if (__builtin_sub_overflow(0, negend, &negation))
        __builtin_trap();

    return negation;
}

__int128_t __subvti3(__int128_t minuend, __int128_t subtrahend) {
    __int128_t difference;

    if (__builtin_sub_overflow(minuend, subtrahend, &difference))
        __builtin_trap();

    return difference;
}
Compile the source file case18.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -Os -S case18.c
[…]
__absvti2:
	movq	%rsi, %rax
	movq	%rdi, %r8
	movq	%rsi, %rcx
	sarq	$63, %rax
	cltq
	movq	%rax, %rsi
	sarq	$63, %rax
	movq	%rax, %rdi
	xorq	%rsi, %r8
	xorq	%rdi, %rcx
	movq	%r8, %rax
	movq	%rcx, %rdx
	subq	%rsi, %rax
	sbbq	%rdi, %rdx
	jno	.L1
	ud2
.L1:
	ret
[…]
__negvti2:
	movq	%rdi, %r9
	movq	%rsi, %r8
	movq	%rdi, %rsi
	xorl	%edx, %edx
	movq	%r8, %rdi
	movq	%r9, %r8
	movl	$1, %eax
	negq	%r8
	movq	%rdi, %r9
	adcq	$0, %r9
	salq	$63, %rax
	negq	%r9
	cmpq	%rax, %rdi
	jne	.L7
	xorl	%edx, %edx
	testq	%rsi, %rsi
	sete	%dl
.L7:
	testq	%rdx, %rdx
	je	.L6
	ud2
.L6:
	movq	%r8, %rax
	movq	%r9, %rdx
	ret
[…]
__subvti3:
	movq	%rsi, %r8
	movq	%rdi, %rsi
	movq	%rsi, %rax
	movq	%r8, %rdi
	movq	%rdx, %r8
	subq	%r8, %rax
	movq	%rdi, %rdx
	sbbq	%rcx, %rdx
	jno	.L11
	ud2
.L11:
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: terrible code for the __absvti2() and __negvti2() functions!

OOPS: awful code with 4 superfluous MOV instructions in a total of 11 instructions for the __subvti3() function.

Case 19: __builtin_copysign() Builtin

Create the text file case19.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

double copysign(double destination, double source) {
    return __builtin_copysign(destination, source);
}
Compile the source file case19.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case19.c
[…]
copysign:
	fldl	12(%esp)
	fxam
	fnstsw	%ax
	fstp	%st(0)
	fldl	4(%esp)
	fabs
	testb	$2, %ah
	je	L1
	fchs
L1:
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Ouch: 10 instructions in 24 bytes, including a conditional branch which impairs performance!

Properly written code uses but only 5 instructions in just 19 bytes, without conditional branch:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"copysign.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text

copysign:
	mov	eax, [esp+16]
	shld	[esp+8], eax, 1
	ror	[esp+8], 1
	fld	qword ptr [esp+4]
	ret

.size	copysign, .-copysign
.type	copysign, @function
.global	copysign
.end
Compile the source file case19.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -O3 -S case19.c
[…]
copysign:
	pushq	%rbp
	andpd	.LC1(%rip), %xmm0
	movapd	%xmm1, %xmm2
	andpd	.LC0(%rip), %xmm2
	movq	%rsp, %rbp
	orpd	%xmm2, %xmm0
	popq	%rbp
	ret
	.section .rdata,"dr"
	.align 16
.LC0:
	.long	0
	.long	-2147483648
	.long	0
	.long	0
	.align 16
.LC1:
	.long	-1
	.long	2147483647
	.long	0
	.long	0
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: 8 instructions in 30 bytes, plus 2 constants in 32 bytes, of which either 4 instructions or (at least) one of the 2 constants are superfluous!

Properly written code uses but only 6 instructions in either 28 or just 24 bytes, and performs no memory access:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"copysign.s"
.arch	generic64
.code64
.intel_syntax noprefix
.text
				# xmm0 = destination
				# xmm1 = source
copysign:
.ifnotdef OPTIMIZE
	mov	rax, 0x7FFFFFFFFFFFFFFF
	movd	xmm2, rax
	andpd	xmm0, xmm2
	andnpd	xmm2, xmm1
	orpd	xmm0, xmm2
.else
	movd	rcx, xmm0
	movd	rdx, xmm1
	shld	rcx, rdx, 1
	ror	rcx, 1
	movd	xmm0, rcx
.endif
	ret

.size	copysign, .-copysign
.type	copysign, @function
.global	copysign
.end
Note: exploration of the equally bad code generated for the copysignf() and copysignl() functions is left as an exercise to the reader.

Case 20: -fsanitize=signed-integer-overflow Command Line Option

Create the text file case20.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int si2(int u) {
    return v << 31;
}

long long di2(long long u) {
    return v << 63;
}

#ifdef __amd64__
__int128_t ti2(__int128_t u) {
    return v << 127;
}
#endif
Compile the source file case20.c with GCC, enabling signed overflow detection, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -fsanitize=signed-integer-overflow -m64 -o- -Os -S case20.c
[…]
si2:
	movl	%edi, %eax
	sall	$31, %eax
	ret
[…]
di2:
	movq	%rdi, %rax
	salq	$63, %rax
	ret
[…]
ti2:
	movq	%rdi, %rdx
	xorl	%eax, %eax
	salq	$63, %rdx
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: GCC fails to generate code for integer overflow detection – which occurs for arguments not equal 0 here!

Note: exploration of this bug on the i386 platform is left as an exercise to the reader.

Case 21: -ftrapv Command Line Option

Create the text file case21.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int si3(int v, int w) {
    volatile int p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
    return v < 0 ? -v : v;
}

long long di3(long long v, long long w) {
    volatile long long p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
    return v < 0 ? -v : v;
}

#ifdef __amd64__
__int128_t ti3(__int128_t v, __int128_t w) {
    volatile __int128_t p = v * w, q = v / w, r = v % w, s = v + w, t = v - w, u = -v;
    return v < 0 ? -v : v;
}
#endif
Compile the source file case21.c with GCC, enabling overflow-trapping, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -ftrapv -m64 -o- -Os -S case21.c
[…]
	.globl	__mulvsi3
	.globl	__addvsi3
	.globl	__subvsi3
	.globl	__negvsi2
	.globl	si3
	.type	si3, @function
si3:
	pushq	%rbp
	movl	%esi, %ebp
	pushq	%rbx
	movl	%edi, %ebx
	subq	$40, %rsp
	call	__mulvsi3
	movl	%ebp, %esi
	movl	%ebx, %edi
	movl	%eax, 8(%rsp)
	movl	%ebx, %eax
	cltd
	idivl	%ebp
	movl	%eax, 12(%rsp)
	movl	%edx, 16(%rsp)
	call	__addvsi3
	movl	%ebp, %esi
	movl	%ebx, %edi
	movl	%eax, 20(%rsp)
	call	__subvsi3
	movl	%ebx, %edi
	movl	%eax, 24(%rsp)
	call	__negvsi2
	movl	%ebx, %esi
	movl	%ebx, %edi
	sarl	$31, %esi
	movl	%eax, 28(%rsp)
	xorl	%esi, %edi
	call	__subvsi3
	addq	$40, %rsp
	popq	%rbx
	popq	%rbp
	ret
[…]
	.globl	__mulvdi3
	.globl	__addvdi3
	.globl	__subvdi3
	.globl	__negvdi2
	.globl	di3
	.type	di3, @function
di3:
	pushq	%rbp
	movq	%rsi, %rbp
	pushq	%rbx
	movq	%rdi, %rbx
	subq	$56, %rsp
	call	__mulvdi3
	movq	%rbp, %rsi
	movq	%rbx, %rdi
	movq	%rax, (%rsp)
	movq	%rbx, %rax
	cqto
	idivq	%rbp
	movq	%rax, 8(%rsp)
	movq	%rdx, 16(%rsp)
	call	__addvdi3
	movq	%rbp, %rsi
	movq	%rbx, %rdi
	movq	%rax, 24(%rsp)
	call	__subvdi3
	movq	%rbx, %rdi
	movq	%rax, 32(%rsp)
	call	__negvdi2
	movq	%rbx, %rsi
	movq	%rbx, %rdi
	sarq	$63, %rsi
	movq	%rax, 40(%rsp)
	xorq	%rsi, %rdi
	call	__subvdi3
	addq	$56, %rsp
	popq	%rbx
	popq	%rbp
	ret
[…]
	.globl	__mulvti3
	.globl	__divti3
	.globl	__modti3
	.globl	__addvti3
	.globl	__subvti3
	.globl	__negvti2
	.globl	ti3
	.type	ti3, @function
ti3:
	pushq	%r14
	movq	%rcx, %r14
	pushq	%r13
	movq	%rdx, %r13
	pushq	%rbp
	movq	%rdi, %rbp
	pushq	%rbx
	movq	%rsi, %rbx
	subq	$104, %rsp
	call	__mulvti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, (%rsp)
	movq	%rdx, 8(%rsp)
	movq	%r13, %rdx
	call	__divti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 16(%rsp)
	movq	%rdx, 24(%rsp)
	movq	%r13, %rdx
	call	__modti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 32(%rsp)
	movq	%rdx, 40(%rsp)
	movq	%r13, %rdx
	call	__addvti3
	movq	%r14, %rcx
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 48(%rsp)
	movq	%rdx, 56(%rsp)
	movq	%r13, %rdx
	call	__subvti3
	movq	%rbp, %rdi
	movq	%rbx, %rsi
	movq	%rax, 64(%rsp)
	movq	%rdx, 72(%rsp)
	call	__negvti2
	movq	%rbx, %rcx
	movq	%rbp, %rdi
	sarq	$63, %rcx
	movq	%rax, 80(%rsp)
	xorq	%rcx, %rbx
	movq	%rdx, 88(%rsp)
	xorq	%rcx, %rdi
	movq	%rcx, %rdx
	movq	%rbx, %rsi
	call	__subvti3
	addq	$104, %rsp
	popq	%rbx
	popq	%rbp
	popq	%r13
	popq	%r14
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Oops: instead to generate a single call of the library function __divmodti4() which returns both quotient and remainder, GCC generates separate calls of the library functions __divti3() and __modti3(), doubling the execution time.

Ouch: instead to generate the trivial code for overflow-trapping addition, negation and subtraction inline, GCC generates calls of the library functions __addv{s,d,t}i3(), __mulv{s,d,t}i3(), __negv{s,d,t}i2() and __subv{s,d,t}i3(), increasing the execution time about an order of magnitude!

Note: exploration of the equally bad code generated for overflow-trapping integer arithmetic of 64-bit integers on the i386 platform is left as an exercise to the reader.

Case 22: Shell Game, or Crazy Braindead Register Allocator

Take a second look at (the beginning of) the __mulvti3() function and the code generated for it:
/* More subroutines needed by GCC output code on some machines.  */
/* Compile this one with gcc.  */
/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
[…]
typedef union {
  DWtype ll;
  struct {
    Wtype low, high;
  } s;
} DWunion;
[…]
DWtype
__mulvDI3 (DWtype u, DWtype v)
{
  /* The unchecked multiplication needs 3 Wtype x Wtype multiplications,
     but the checked multiplication needs only two.  */
  const DWunion uu = {.ll = u};
  const DWunion vv = {.ll = v};

  if (__builtin_expect (uu.s.high == uu.s.low >> (W_TYPE_SIZE - 1), 1))
    {
      /* u fits in a single Wtype.  */
      if (__builtin_expect (vv.s.high == vv.s.low >> (W_TYPE_SIZE - 1), 1))
	{
	  /* v fits in a single Wtype as well.  */
	  /* A single multiplication.  No overflow risk.  */
	  return (DWtype) uu.s.low * (DWtype) vv.s.low;
	}
[…]
[…]
0000000000000000 <__mulvti3>:
   0:	41 55                	push   %r13
   2:	49 89 cb             	mov    %rcx,%r11
   5:	48 89 d0             	mov    %rdx,%rax
   8:	49 89 d2             	mov    %rdx,%r10
   b:	41 54                	push   %r12
   d:	49 89 fc             	mov    %rdi,%r12
  10:	48 89 d1             	mov    %rdx,%rcx
  13:	49 89 f0             	mov    %rsi,%r8
  16:	4c 89 e2             	mov    %r12,%rdx
  19:	49 89 f5             	mov    %rsi,%r13
  1c:	53                   	push   %rbx
  1d:	48 89 fe             	mov    %rdi,%rsi
  20:	48 c1 fa 3f          	sar    $0x3f,%rdx
  24:	48 c1 f8 3f          	sar    $0x3f,%rax
  28:	4c 89 df             	mov    %r11,%rdi
  2b:	4c 39 c2             	cmp    %r8,%rdx
  2e:	75 18                	jne    48 <__mulvti3+0x48>
  30:	4c 39 d8             	cmp    %r11,%rax
  33:	75 6b                	jne    a0 <__mulvti3+0xa0>
  35:	4c 89 e0             	mov    %r12,%rax
  38:	49 f7 ea             	imul   %r10
  3b:	5b                   	pop    %rbx
  3c:	41 5c                	pop    %r12
  3e:	41 5d                	pop    %r13
  40:	c3                   	retq   
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: 25 instructions in 65 bytes, including 8 superfluous MOV instructions shuffling registers, clobbering the non-volatile registers RBX, R12 and R13 without necessity, and performing 6 superfluous memory accesses!

Properly written and straightforward code uses but only 11 instructions in just 31 bytes, without clobbering non-volatile registers, and without memory accesses at all:

# Copyright © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"mulvti3.s"
.arch	generic64
.code64
.intel_syntax noprefix
.global	__mulvti3
.type	__mulvti3, @function
.text
				# rsi:rdi = multiplicand
				# rcx:rdx = multiplier
__mulvti3:
	mov	r8, rdi
	mov	r9, rdx
	sra	r8, 63
	sra	r9, 63
	cmp	r8, rsi
	jne	__mulvti3+0x48
	cmp	r9, rcx
	jne	__mulvti3+0xa0
	mov	rax, rdi
	imul	rdx
	ret
[…]

Case 23: Undefined Behaviour in __bswapsi2() Function

The function __bswapsi2() defined in libgcc2.c exhibits Undefined Behaviour in its last line:
/* More subroutines needed by GCC output code on some machines.  */
/* Compile this one with gcc.  */
/* Copyright (C) 1989-2024 Free Software Foundation, Inc.
[…]
typedef int SItype __attribute__ ((mode (SI)));
[…]
SItype
__bswapsi2 (SItype u)
{
  return ((((u) & 0xff000000) >> 24)
	  | (((u) & 0x00ff0000) >>  8)
	  | (((u) & 0x0000ff00) <<  8)
	  | (((u) & 0x000000ff) << 24));
}
[…]
3.4.3
Undefined behavior

[…]

EXAMPLE An example of undefined behavior is the behavior on integer overflow.

Case 24: Undefined Behaviour or Optimiser Failure?

Create the text file case24.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __absusi2(int value) {
    if (value < 0)
        value = -value;

    if (value < 0)
        __builtin_trap();

    return value;
}

int __absvsi2(int value) {
    const int sign = value >> 31;
    value += sign;
    value ^= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

int __abswsi2(int value) {
    const int sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __absudi2(long long value) {
    if (value < 0)
        value = -value;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __absvdi2(long long value) {
    const long long sign = value >> 63;
    value += sign;
    value ^= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

long long __abswdi2(long long value) {
    const long long sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

#ifdef __amd64__
__int128_t __absuti2(__int128_t value) {
    if (value < 0)
        value = -value;

    if (value < 0)
        __builtin_trap();

    return value;
}

__int128_t __absvti2(__int128_t value) {
    const __int128_t sign = value >> 127;
    value += sign;
    value ^= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}

__int128_t __abswti2(__int128_t value) {
    const __int128_t sign = 0 - (value < 0);
    value ^= sign;
    value -= sign;

    if (value < 0)
        __builtin_trap();

    return value;
}
#endif // __amd64__
Compile the source file case24.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -Os -S -Wall -Wextra case24.c
[…]
__absusi2:
	movl	%edi, %eax
	cltd
	xorl	%edx, %eax
	subl	%edx, %eax
	ret
[…]
__absvsi2:
	movl	%edi, %edx
	sarl	$31, %edx
	leal	(%rdi,%rdx), %eax
	xorl	%edx, %eax
	jns	.L2
	ud2
.L2:
	ret
[…]
__abswsi2:
	movl	%edi, %eax
	movl	%edi, %edx
	shrl	$31, %eax
	movl	%eax, %edi
	negl	%edi
	xorl	%edx, %edi
	addl	%edi, %eax
	jns	.L5
	ud2
.L5:
	ret
[…]
__absudi2:
	movq	%rdi, %rax
	cqto
	xorq	%rdx, %rax
	subq	%rdx, %rax
	ret
[…]
__absvdi2:
	movq	%rdi, %rdx
	sarq	$63, %rdx
	leaq	(%rdi,%rdx), %rax
	xorq	%rdx, %rax
	jns	.L8
	ud2
.L8:
	ret
[…]
__abswdi2:
	movq	%rdi, %rax
	cqto
	movslq	%edx, %rdx
	xorq	%rdx, %rax
	subq	%rdx, %rax
	jns	.L10
	ud2
.L10:
	ret
[…]
__absuti2:
	movq	%rsi, %rax
	movq	%rdi, %r8
	movq	%rsi, %rcx
	sarq	$63, %rax
	movq	%rax, %rsi
	xorq	%rax, %r8
	xorq	%rsi, %rcx
	movq	%r8, %rax
	movq	%rcx, %rdx
	subq	%rsi, %rax
	sbbq	%rsi, %rdx
	ret
[…]
__absvti2:
	movq	%rdi, %rax
	movq	%rsi, %rdi
	movq	%rsi, %rdx
	sarq	$63, %rdi
	addq	%rdi, %rax
	adcq	%rdi, %rdx
	xorq	%rdi, %rax
	xorq	%rdi, %rdx
	jns	.L13
	ud2
.L13:
	ret
[…]
__abswti2:
	movq	%rsi, %rax
	movq	%rdi, %r8
	movq	%rsi, %rcx
	sarq	$63, %rax
	cltq
	movq	%rax, %rsi
	sarq	$63, %rax
	movq	%rax, %rdi
	xorq	%rsi, %r8
	xorq	%rdi, %rcx
	movq	%r8, %rax
	movq	%rcx, %rdx
	subq	%rsi, %rax
	sbbq	%rdi, %rdx
	testq	%rdx, %rdx
	jns	.L15
	ud2
.L15:
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Ouch: the optimiser pessimiser removes the conditional expression detecting possible integer overflow of the negation, i.e. undefined behaviour, for the __absusi2(), __absudi2() and __absuti2() functions – without warning, despite the command line options -Wall -Wextra given.

Oops: the optimiser fails to recognise the common and well-known alternative implementations used for the __absvsi2(), __absvdi2() and __absvti2() plus the __abswsi2(), __abswdi2() and __abswti2() functions, and generates unoptimised and awful code especially for the latter.

Note: exploration of the equally bad code generated for the i386 platform is left as an exercise to the reader.

Case 25: Optimiser Failures

Create the text file case25.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __amd64__
inline
unsigned long long __rotldi3(unsigned long long m, unsigned n) {
    return (m << n) | (m >> (64 - n));
}

inline
unsigned long long __rotrdi3(unsigned long long m, unsigned n) {
    return (m >> n) | (m << (64 - n));
}

unsigned long long right63(unsigned long long o) {
    return __rotrdi3(o, 63);
}

unsigned long long right33(unsigned long long o) {
    return __rotrdi3(o, 33);
}

unsigned long long right31(unsigned long long o) {
    return __rotrdi3(o, 31);
}

unsigned long long right1(unsigned long long o) {
    return __rotrdi3(o, 1);
}
#else
inline
__uint128_t __rotlti3(__uint128_t m, unsigned n) {
    return (m << n) | (m >> (128 - n));
}

inline
__uint128_t __rotrti3(__uint128_t m, unsigned n) {
    return (m >> n) | (m << (128 - n));
}

__uint128_t left127(__uint128_t o) {
    return __rotlti3(o, 127);
}

__uint128_t left65(__uint128_t o) {
    return __rotlti3(o, 65);
}

__uint128_t left63(__uint128_t o) {
    return __rotlti3(o, 63);
}

__uint128_t left1(__uint128_t o) {
    return __rotlti3(o, 1);
}
#endif // __amd64__
Compile the source file case26.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case26.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…]
right63:
	movl	4(%esp), %eax	#	mov	eax, [esp+4]
	movl	8(%esp), %edx	#	mov	edx, [esp+8]
	movl	%eax, %ecx	#	bt	eax, 31
	shldl	$1, %edx, %eax	#	adc	edx, edx
	shldl	$1, %ecx, %edx	#	adc	eax, eax
	ret			#	ret
[…]
right33:
	movl	4(%esp), %eax	#	mov	edx, [esp+4]
	movl	8(%esp), %edx	#	mov	eax, [esp+8]
	movl	%eax, %ecx	#	bt	edx, 1
	shldl	$31, %edx, %eax	#	rcr	eax, 1
	shldl	$31, %ecx, %edx	#	rcr	edx, 1
	ret			#	ret
[…]
right31:
	movl	4(%esp), %eax	#	mov	edx, [esp+4]
	movl	8(%esp), %edx	#	mov	eax, [esp+8]
	movl	%eax, %ecx	#	bt	edx, 31
	shrdl	$31, %edx, %eax	#	adc	eax, eax
	shrdl	$31, %ecx, %edx	#	adc	edx, edx
	ret			#	ret
[…]
right1:
	movl	4(%esp), %eax	#	mov	eax, [esp+4]
	movl	8(%esp), %edx	#	mov	edx, [esp+8]
	movl	%eax, %ecx	#	bt	eax, 1
	shrdl	$1, %edx, %eax	#	rcr	edx, 1
	shrdl	$1, %ecx, %edx	#	rcr	eax, 1
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OOPS: while GCC replaces the circular left shift by 63 with the equivalent circular right shift by 1, it but fails to replace the circular left shift by 33, i.e. register size plus 1, with the equivalent circular left shift by 1 of the exchanged registers, and also fails to replace the circular left shift by 31, i.e. register size − 1, with the equivalent circular right shift by 1 of the exchanged registers!

OUCH: additionally GCC clobbers register ECX in all 4 examples without necessity, and (ab)uses slow SHLD respectively SHRD instructions instead of fast RCL respectively RCR instructions or even faster ADC instructions!

Compile the source file case25.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case25.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…]
left127:
	movq	%rdi, %rax	#	mov	rax, rdi
	movq	%rsi, %rdx	#	mov	rdx, rsi
				#	shr	esi, 1
	shrdq	$1, %rsi, %rax	#	rcr	rax, 1
	shrdq	$1, %rdi, %rdx	#	rcr	rdx, 1
	ret			#	ret
[…]
left65:
	movq	%rdi, %rax	#	mov	rax, rsi
	movq	%rsi, %rdx	#	mov	rdx, rdi
				#	add	rdi, rdi
	shrdq	$63, %rsi, %rax	#	adc	rax, rax
	shrdq	$63, %rdi, %rdx	#	adc	rdx, rdx
	ret			#	ret
[…]
left63:
	movq	%rdi, %rax	#	mov	rax, rsi
	movq	%rsi, %rdx	#	mov	rdx, rdi
				#	shr	edi, 1
	shldq	$63, %rsi, %rax	#	rcr	rax, 1
	shldq	$63, %rdi, %rdx	#	rcr	rdx, 1
	ret			#	ret
[…]
left1:
	movq	%rdi, %rax	#	mov	rax, rdi
	movq	%rsi, %rdx	#	mov	rdx, rsi
				#	add	rsi, rsi
	shldq	$1, %rsi, %rax	#	adc	rax, rax
	shldq	$1, %rdi, %rdx	#	adc	rdx, rdx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: like above, GCC (ab)uses slow SHLD respectively SHRD instructions instead of fast RCL respectively RCR instructions or even faster ADC instructions!

Note: unlike above, the second of the SHLD respectively SHRD instructions does not depend on the first one, so processors with 2 execution ports for slow shift instructions can execute them in parallel.

Case 26: More Optimiser Failures

Create the text file case26.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int sign32(int value) {
    return 0 - (value < 0);
}

long long sign64(long long value) {
#ifdef ALTERNATE
    return 0LL - (value < 0);
#else
    return 0 - (value < 0);
#endif
}

#ifdef __amd64__
__int128_t sign128(__int128_t value) {
#ifdef ALTERNATE
    return (__int128_t) 0 - (value < 0);
#else
    return 0 - (value < 0);
#endif
}
#endif // __amd64__
Compile the source file case26.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case26.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…]
_sign32:
	movl	4(%esp), %eax
	sarl	$31, %eax
	ret
[…]
_sign64:
	movl	8(%esp), %eax	#	mov	eax, [esp+8]
	sarl	$31, %eax	#	cdq
	cltd			#	mov	eax, edx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Compile the source file case26.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -O3 -S case26.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…]
sign32:
	movl	%edi, %eax
	sarl	$31, %eax
	ret
[…]
sign64:
	sarq	$63, %rdi	#	mov	rax, rdi
	movslq	%edi, %rax	#	sar	rax, 63
	ret			#	ret
[…]
sign128:
	sarq	$63, %rsi	#	mov	rax, rsi
	movslq	%esi, %rax	#	cqo
	cqto			#	mov	rax, rdx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
Compile the source file case26.c a third time with GCC, again targetting the AMD64 platform, now with the preprocessor macro ALTERNATE defined, and display the generated assembly code:
gcc -DALTERNATE -m64 -o- -O3 -S case26.c
[…]
sign32:
	movl	%edi, %eax
	sarl	$31, %eax
	ret
[…]
sign64:
	sarq	$63, %rdi
	movslq	%edi, %rax
	ret
[…]
sign128:
	shrq	$63, %rsi
	xorl	%edx, %edx
	movq	%rsi, %rax
	negq	%rax
	adcq	$0, %rdx
	negq	%rdx
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: that’s what I call a complete failure!

Case 27: Yet More Optimiser Failures

Create the text file case27.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int __clzti2(__uint128_t value) {
    unsigned long long high = value >> 64, low = value, mask = -(high == 0);
    return (64 & (int) mask) + __builtin_clzll((high & ~mask) | (low & mask));
}

int __ctzti2(__uint128_t value) {
    unsigned long long high = value >> 64, low = value, mask = -(low == 0);
    return (64 & (int) mask) + __builtin_ctzll((high & mask) | (low & ~mask));
}
Compile the source file case27.c with GCC, engaging its optimiser, targetting the AMD64 platform, and display the generated assembly code:
gcc -m64 -o- -O3 -S case27.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…]
__clzti2:
	xorl	%edx, %edx	#	xor	edx, edx
	testq	%rsi, %rsi	#	test	rsi, rsi
	sete	%dl		#	setz	dl
	movl	%edx, %ecx	#
	subl	$1, %edx	#
	negl	%ecx		#
	movslq	%edx, %rax	#
	movslq	%ecx, %rdx	#
	andq	%rsi, %rax	#
	andl	$64, %ecx	#
	andq	%rdi, %rdx	#	cmovnz	rdi, rsi
	orq	%rdx, %rax	#	shl	edx, 6
	bsrq	%rax, %rax	#	bsr	rax, rdi
	xorq	$63, %rax	#	xor	eax, 63
	addl	%ecx, %eax	#	add	eax, edx
	ret			#	ret
[…]
__ctzti2:
	xorl	%eax, %eax	#	xor	edx, edx
	testq	%rdi, %rdi	#	test	rdi, rdi
	sete	%al		#	setz	dl
	movl	%eax, %ecx	#
	subl	$1, %eax	#
	negl	%ecx		#
	cltq			#
	movslq	%ecx, %rdx	#
	andq	%rdi, %rax	#
	andl	$64, %ecx	#
	andq	%rsi, %rdx	#	cmovnz	rsi, rdi
	orq	%rdx, %rax	#	shl	edx, 6
	rep bsfq %rax, %rax	#	bsf	rax, rsi
	addl	%ecx, %eax	#	add	eax, edx
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: that’s what I call a complete failure!

Case 28: BDead Code Generation

Create the text file case28.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

// Stein's algorithm: greatest common divisor

unsigned long long __gcddi3(unsigned long long p, unsigned long long q) {
    unsigned           r, s = 0;
    unsigned long long t;

    if (p == 0)
        return q;

    if (q == 0)
        return p;

    if (p == q)
        return p;
#ifndef CTZ
    while (((p | q) & 1) == 0)
        p >>= 1, q >>= 1, s++;

    while ((p & 1) == 0)
        p >>= 1;

    do {
        while ((q & 1) == 0)
            q >>= 1;
#elif CTZ != 32
    s = __builtin_ctzll(p | q), p >>= s, q >>= s;
    r = __builtin_ctzll(p), p >>= r;

    do {
        r = __builtin_ctzll(q), q >>= r;
#else
    if (((unsigned long) p | (unsigned long) q) == 0)
        p >>= 32, q >>= 32, s = 32;

    r = __builtin_ctzl((unsigned long) p | (unsigned long) q), p >>= r, q >>= r, s += r;

    if ((unsigned long) p == 0)
        p >>= 32;

    r = __builtin_ctzl(p), p >>= r;

    do
    {
        if ((unsigned long) q == 0)
            q >>= 32;

        r = __builtin_ctzl(q), q >>= r;
#endif
        if (p < q)
            t = q, q = p, p = t;
    } while (q -= p);

    return p << s;
}
Compile the source file case28.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -DCTZ -m32 -mtune=i386 -o- -O3 -S case28.c
[…]
__gcddi3:
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$44, %esp
	movl	64(%esp), %ecx
	movl	68(%esp), %ebx
	movl	72(%esp), %esi
	movl	76(%esp), %edi
	movl	%ebx, %ebp
	movl	%ecx, 24(%esp)
	orl	%ecx, %ebp
	movl	%ebx, 28(%esp)
	movl	%esi, %eax
	movl	%edi, %edx
	je	L1
	movl	%ecx, %edx
	movl	%ebx, %eax
	xorl	%esi, %edx
	xorl	%edi, %eax
	orl	%eax, %edx
	je	L8
	movl	%esi, %eax
	orl	%edi, %eax
	je	L8
	movl	%ecx, %edx
	orl	%edi, %ebx
	orl	%esi, %edx
	movl	%ebx, 4(%esp)
	xorl	%ebx, %ebx
	movl	%edx, (%esp)
	call	___ctzdi2
	movl	28(%esp), %edx
	movl	%eax, %ebp
	movl	24(%esp), %eax
	movl	%ebp, %ecx
	shrdl	%cl, %edx, %eax
	shrl	%cl, %edx
	testb	$32, %cl
	cmovne	%edx, %eax
	cmovne	%ebx, %edx
	shrdl	%cl, %edi, %esi
	xorl	%ebx, %ebx
	shrl	%cl, %edi
	testb	$32, %cl
	movl	%edx, 4(%esp)
	cmovne	%edi, %esi
	cmovne	%ebx, %edi
	movl	%eax, (%esp)
	movl	%edx, 28(%esp)
	movl	%eax, 24(%esp)
	call	___ctzdi2
	movl	28(%esp), %edx
	movl	%eax, %ebx
	movl	24(%esp), %eax
	movl	%ebx, %ecx
	xorl	%ebx, %ebx
	shrdl	%cl, %edx, %eax
	shrl	%cl, %edx
	andl	$32, %ecx
	cmovne	%edx, %eax
	cmovne	%ebx, %edx
	movl	%eax, 24(%esp)
	movl	%edx, 28(%esp)
L6:
	movl	%esi, (%esp)
	movl	%edi, 4(%esp)
	call	___ctzdi2
	movl	28(%esp), %edx
	movl	%eax, %ecx
	xorl	%eax, %eax
	shrdl	%cl, %edi, %esi
	shrl	%cl, %edi
	testb	$32, %cl
	cmovne	%edi, %esi
	cmovne	%eax, %edi
	movl	24(%esp), %eax
	cmpl	%eax, %esi
	movl	%edi, %ebx
	movl	%esi, %ecx
	sbbl	%edx, %edi
	jc	L3
	subl	%eax, %esi
	movl	%ebx, %edi
	sbbl	%edx, %edi
	movl	%edi, %eax
	orl	%esi, %eax
	jne	L6
	movl	24(%esp), %eax
	movl	28(%esp), %edx
	movl	%ebp, %ecx
	xorl	%ebx, %ebx
	shldl	%cl, %eax, %edx
	sall	%cl, %eax
	andl	$32, %ecx
	cmovne	%eax, %edx
	cmovne	%ebx, %eax
L1:
	addl	$44, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
L8:
	movl	24(%esp), %eax
	movl	28(%esp), %edx
	addl	$44, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
L3:
	movl	%eax, %esi
	movl	%edx, %edi
	movl	%ecx, 24(%esp)
	subl	%ecx, %esi
	movl	%ebx, 28(%esp)
	sbbl	%ebx, %edi
	jmp	L6
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: the CMOVcc instruction is not available on the i386 processor!

The code generated by GCC shows 117 instructions in 325 bytes, with 3 calls of the library function __ctzdi2() instead of inlined code for the __builtin_ctzll() builtin, while properly written code uses either only 70 instructions in just 146 bytes or (with the symbol SPEED defined) 75 instructions in 157 bytes:

# Copyright © 2004-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

.file	"gcddi3.s"
.arch	generic32
.code32
.intel_syntax noprefix
.text
				# [esp+16] = high dword of second argument
				# [esp+12] = low dword of second argument
				# [esp+8] = high dword of first argument
				# [esp+4] = low dword of first argument
__gcddi3:
	push	edi
	push	esi
	mov	eax, [esp+12]
	mov	edx, [esp+16]	# edx:eax = first argument
	mov	edi, [esp+20]
	mov	esi, [esp+24]	# esi:edi = second argument
	mov	ecx, eax
	or	ecx, edx	# ecx = low qword of first argument
				#     | high qword of first argument
	jz	.Lsecond	# first argument = 0?

	mov	ecx, edi
	or	ecx, esi	# ecx = low qword of second argument
				#     | high qword of second argument
	jz	.Lfirst		# second argument = 0?

	push	ebx
	mov	ecx, eax
	mov	ebx, edx	# ebx:ecx = first argument
	xor	ecx, edi
	xor	ebx, esi	# ebx:ecx = first argument ^ second argument
	or	ebx, ecx
	jz	.Lequal		# first argument = second argument?

	xor	ebx, ebx	# ebx = exponent of common divisor 2 = 0
	mov	ecx, eax
	or	ecx, edi	# ecx = low dword of first argument
				#     | low dword of second argument
	jnz	.Lcommon	# first argument % 2**32 <> 0?
				# second argument % 2**32 <> 0?
.ifnotdef SPEED
	xchg	eax, edx	# edx:eax = first argument / 2**32
	xchg	edi, esi	# esi:edi = second argument / 2**32
.else
	mov	eax, edx
	mov	edx, ecx	# edx:eax = first argument / 2**32
	mov	edi, esi
	mov	esi, ecx	# esi:edi = second argument / 2**32
.endif
	add	ebx, 32		# ebx = exponent of common divisor 2 = 32
	mov	ecx, eax
	or	ecx, edi	# ecx = high dword of first argument
				#     | high dword of second argument
.Lcommon:
	bsf	ecx, ecx	# ecx = index of least significant '1' bit
	add	ebx, ecx	# ebx = exponent of common divisor 2
	shrd	eax, edx, cl
	shr	edx, cl		# edx:eax = first argument / 2**exponent
				#         = first argument'
	shrd	edi, esi, cl
	shr	esi, cl		# esi:edi = second argument / 2**exponent
				#         = second argument'
	test	eax, eax
	jnz	1f		# first argument' % 2**32 <> 0?
.ifnotdef SPEED
	xchg	eax, edx	# edx:eax = first argument' / 2**32
.else
	mov	eax, edx
	xor	edx, edx	# edx:eax = first argument' / 2**32
.endif
1:
	bsf	ecx, eax	# ecx = index of least significant '1' bit
	shrd	eax, edx, cl
	shr	edx, cl		# edx:eax = first argument' / 2**index
				#         = first argument"
.Lloop:
	test	edi, edi
	jnz	2f		# second argument' % 2**32 <> 0?
.ifnotdef SPEED
	xchg	edi, esi	# esi:edi = second argument' / 2**32
.else
	mov	edi, esi
	xor	esi, esi	# esi:edi = second argument' / 2**32
.endif
2:
	bsf	ecx, edi	# ecx = index of least significant '1' bit
	shrd	edi, esi, cl
	shr	esi, cl		# esi:edi = second argument' / 2**index
				#         = second argument"
.Lsubtract:
	sub	edi, eax
	sbb	esi, edx	# esi:edi = second argument" - first argument"
	jnb	.Lcontinue	# second argument" >= first argument"?
.Lswap:
	add	eax, edi
	adc	edx, esi	# edx:eax = first argument"
				#         + second argument" - first argument"
				#         = second argument"
.ifnotdef SPEED
	neg	esi
	neg	edi
	sbb	esi, 0		# esi:edi = 0
				#         - (second argument" - first argument")
				#         = first argument" - second argument"
.else
	mov	ecx, esi
	xor	esi, esi
	neg	edi
	sbb	esi, ecx	# esi:edi = 0
				#         - (second argument" - first argument")
				#         = first argument" - second argument"
.endif
	jmp	.Lloop
.Lcontinue:
	mov	ecx, edi
	or	ecx, esi
	jnz	.Lloop		# second argument" <> first argument"?

	mov	ecx, ebx	# ecx = exponent of common divisor 2
	shld	edx, eax, cl
	shl	eax, cl		# edx:eax = first argument" * 2**exponent
				#         = greatest common divisor
.Lequal:
	pop	ebx
.Lfirst:
	pop	esi
	pop	edi
	ret
.Lsecond:
	mov	eax, edi
	mov	edx, esi	# edx:eax = second argument
	pop	esi
	pop	edi
	ret

.size	__gcddi3, .-__gcddi3
.type	__gcddi3, @function
.global	__gcddi3
.end
Compile the source file case28.c a second time with GCC, now with the preprocessor macro CTZ=32 defined, and display the generated assembly code:
gcc -DCTZ=32 -m32 -mtune=i386 -o- -O3 -S case28.c
[…]
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$8, %esp
	movl	24(%esp), %esi
	movl	28(%esp), %edi
	movl	32(%esp), %eax
	movl	36(%esp), %edx
	movl	%edi, %ebx
	orl	%esi, %ebx
	je	L1
	movl	%edx, %ebx
	orl	%eax, %ebx
	je	L11
	movl	%edi, %ebx
	xorl	%edx, %ebx
	movl	%esi, %ecx
	xorl	%eax, %ecx
	orl	%ecx, %ebx
	je	L1
	movl	%esi, %ecx
	orl	%eax, %ecx
	je	L17
	xorl	%ebx, %ebx
L3:
	bsfl	%ecx, %ecx
	shrdl	%cl, %edi, %esi
	shrl	%cl, %edi
	testb	$32, %cl
	je	L19
	movl	%edi, %esi
	xorl	%edi, %edi
L19:
	shrdl	%cl, %edx, %eax
	shrl	%cl, %edx
	testb	$32, %cl
	je	L20
	movl	%edx, %eax
	xorl	%edx, %edx
L20:
	addl	%ecx, %ebx
	movl	%esi, %ecx
	testl	%esi, %esi
	jne	L4
	movl	%edi, %esi
	xorl	%edi, %edi
	movl	%esi, %ecx
L4:
	bsfl	%ecx, %ecx
	shrdl	%cl, %edi, %esi
	shrl	%cl, %edi
	testb	$32, %cl
	je	L18
	movl	%edi, %esi
	xorl	%edi, %edi
L18:
	movl	%esi, (%esp)
	movl	%edi, 4(%esp)
L9:
	movl	%eax, %ecx
	testl	%eax, %eax
	jne	L5
	movl	%edx, %eax
	xorl	%edx, %edx
	movl	%eax, %ecx
L5:
	bsfl	%ecx, %ecx
	shrdl	%cl, %edx, %eax
	shrl	%cl, %edx
	xorl	%esi, %esi
	testb	$32, %cl
	cmovne	%edx, %eax
	cmovne	%esi, %edx
	movl	%eax, %esi
	movl	%edx, %edi
	movl	(%esp), %edx
	movl	4(%esp), %ecx
	cmpl	%edx, %eax
	movl	%edi, %eax
	sbbl	%ecx, %eax
	jc	L6
	subl	%edx, %esi
	sbbl	%ecx, %edi
	movl	%esi, %eax
	movl	%edi, %edx
	orl	%esi, %edi
	jne	L9
	movl	(%esp), %eax
	movl	4(%esp), %edx
	movb	%bl, %cl
	shldl	%cl, %eax, %edx
	sall	%cl, %eax
	xorl	%ebx, %ebx
	andl	$32, %ecx
	cmovne	%eax, %edx
	cmovne	%ebx, %eax
L1:
	addl	$8, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	ret
L11:
	movl	%esi, %eax
	movl	%edi, %edx
	addl	$8, %esp
	popl	%ebx
	popl	%esi
	popl	%edi
	ret
L17:
	movl	%edi, %esi
	xorl	%edi, %edi
	movl	%edx, %eax
	xorl	%edx, %edx
	movl	%esi, %ecx
	orl	%eax, %ecx
	movl	$32, %ebx
	jmp	L3
L6:
	movl	%edx, %eax
	movl	%ecx, %edx
	subl	%esi, %eax
	sbbl	%edi, %edx
	movl	%esi, (%esp)
	movl	%edi, 4(%esp)
	jmp	L9
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OOPS: the BSF instruction can’t set bit 25, the 15 highlighted instructions are therefore superfluous!

Now the code generated by GCC shows 116 instructions in 282 bytes, without calls of a library function – better than before, but still outright bad!

Case 29: Superfluous Extraneous Code Generation

Create the text file case29.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

// Widening carry-less multiplication

#ifndef __amd64__
unsigned long long clmul64(unsigned long p, unsigned long q) {
    unsigned long long r = 0;
    unsigned long      s = 1UL << 31;

    do {
        r <<= 1;
        if (q & s)
            r ^= p; // no need for promotion or conversion of p here!
    } while (s >>= 1);

    return r;
}
#else // __amd64__
__uint128_t clmul128(unsigned long long p, unsigned long long q) {
    __uint128_t        r = 0;
    unsigned long long s = 1UL << 63;

    do {
        r <<= 1;
        if (q & s)
            r ^= p; // no need for promotion or conversion of p here!
    } while (s >>= 1);

    return r;
}
#endif // __amd64__
Compile the source file case29.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case29.c
[…]
clmul64:
	pushl	%ebp
	movl	$-2147483648, %ecx
	xorl	%eax, %eax
	xorl	%edx, %edx
	pushl	%edi
	xorl	%edi, %edi
	pushl	%esi
	pushl	%ebx
	movl	24(%esp), %ebp
	movl	$32, %ebx
	movl	20(%esp), %esi
L3:
	shldl	$1, %eax, %edx
	addl	%eax, %eax
	testl	%ecx, %ebp
	je	L2
	xorl	%esi, %eax
	xorl	%edi, %edx
L2:
	shrl	$1, %ecx
	subl	$1, %ebx
	jne	L3
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: 8 (highlighted) superfluous instructions, including 4 extraneous synthesized instructions which use register EBX, out of the total 25 instructions!

Compile the source file case29.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case29.c
[…]
clmul128:
	movq	%rdi, %r10
	xorl	%eax, %eax
	movq	%rsi, %rdi
	xorl	%edx, %edx
	movabsq	$-9223372036854775808, %rcx
	movl	$64, %esi
.L3:
	shldq	$1, %rax, %rdx
	addq	%rax, %rax
	testq	%rcx, %rdi
	je	.L2
	movq	%rax, %r9	#
	xorq	%r10, %r9	#	xorq	%rsi, %rax
	movq	%r9, %rax	#
.L2:
	shrq	$1, %rcx
	subl	$1, %esi
	jne	.L3
	ret
[…]
	.ident	"GCC: (GNU […]) 10.2.0"
OUCH: 7 (highlighted) superfluous instructions (minus the single instruction added as comment) out of the total 17 instructions!

Case 30: Epic Optimiser Failures

Create the text file case30.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#ifndef __amd64__
int ispowerof2(unsigned long long argument) {
#else
int ispowerof2(__uint128_t argument) {
#endif
    return (argument != 0) && ((argument & argument - 1) == 0);
}
Compile the source file case30.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case30.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…]
ispowerof2:
	pushl   %edi		#
	pushl   %esi		#
	pushl   %ebx		#
	movl	20(%esp), %edi	#	mov	ecx, [esp+4]
	movl	16(%esp), %esi	#	mov	edx, [esp+8]
	movl	%edi, %eax	#	mov	eax, ecx
	orl	%esi, %eax	#	or	eax, edx
	je	.L5		#	jz	.L5
	movl	%esi, %eax	#
	movl	%edi, %edx	#
	addl	$-1, %eax	#	add	ecx, -1
	adcl	$-1, %edx	#	adc	edx, -1
	movl	%eax, %ecx	#
	movl	%esi, %eax	#
	movl	%edx, %ebx	#
	movl	%edi, %edx	#
	andl	%ecx, %eax	#	and	ecx, [esp+4]
	xorl	%ecx, %ecx	#	xor	eax, eax
	andl	%ebx, %edx	#	and	edx, [esp+8]
	popl	%ebx		#
	popl	%esi		#
	orl	%edx, %eax	#	or	ecx, edx
	popl	%edi		#
	sete	%cl		#	setz	al
	movl	%ecx, %eax	#
	ret			#
.L5:
	xorl	%ecx, %ecx	#
	popl	%ebx		#
	popl	%esi		#
	movl	%ecx, %eax	#
	popl	%edi		#
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"
OUCH: 19 (highlighted) superfluous instructions (which abuse registers EDI and ESI to play a Shell Game and perform 3 superfluous memory writes) out of the total 32 instructions in 60 bytes, plus an epic failure of the optimiser which doesn’t recognise that register EAX is already 0 at label .L5!

Properly written straightforward code uses but only 13 instructions in 36 bytes and doesn’t clobber any non-volatile register!

Compile the source file case30.c a second time with GCC, again targetting the i386 platform, now with explicit support for the SSE4.1 alias Penryn New Instruction Set, and display the generated assembly code:

gcc -m32 -msse4.1 -o- -O3 -S case30.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…]
ispowerof2:
	movq	4(%esp), %xmm1	#	movq	xmm0, qword ptr [esp+4]
	xorl	%eax, %eax	#	xor	eax, eax
	movdqa	%xmm1, %xmm0	#
	punpcklqdq %xmm1, %xmm0	#
	ptest	%xmm0, %xmm0	#	ptest	xmm0, xmm0
	je	.L1		#	je	.L1
	pcmpeqd	%xmm0, %xmm0	#	pcmpeqq	xmm1, xmm1
	xorl	%eax, %eax	#
	paddq	%xmm1, %xmm0	#	paddq	xmm1, xmm0
	pand	%xmm1, %xmm0	#
	punpcklqdq %xmm0, %xmm0	#
	ptest	%xmm0, %xmm0	#	ptest	xmm1, xmm0
	sete	%al		#	sete	al
.L1:
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"
OUCH: 5 (highlighted) superfluous instructions out of the total 14 instructions in 50 bytes, plus 2 epic failures of the optimiser which doesn’t recognise that the MOVQ instruction clears the upper lane of the SSE register XMM1 and that register EAX is already 0!

Properly written straightforward code uses but only 9 instructions in 32 bytes!

Compile the source file case30.c a third time with GCC, again targetting the i386 platform, now with explicit support for the AVX2 alias Haswell New Instruction Set or the AVX instruction set, and display the generated assembly code:

gcc -m32 -mavx -o- -O3 -S case30.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…]
ispowerof2:
	vmovq	4(%esp), %xmm0		#	movq	xmm0, [esp+4]
	xorl	%eax, %eax		#	xor	eax, eax
	vpunpcklqdq %xmm0, %xmm0, %xmm0	#
	vptest	%xmm0, %xmm0		#	ptest	xmm0, xmm0
	je	.L1			#	jz	.L1
	vpcmpeqd %xmm1, %xmm1, %xmm1	#	pcmpeqq	xmm1, xmm1
	xorl	%eax, %eax		#
	vpaddq	%xmm1, %xmm0, %xmm1	#	paddq	xmm1, xmm0
	vpand	%xmm0, %xmm1, %xmm0	#
	vpunpcklqdq %xmm0, %xmm0, %xmm0	#
	vptest	%xmm0, %xmm0		#	ptest	xmm1, xmm0
	sete	%al			#	setz	al
.L1:
	ret				#	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"
OUCH: 4 (highlighted) superfluous instructions out of the total 13 instructions in 46 bytes, plus 2 epic failures of the optimiser which doesn’t recognise that the MOVQ instruction clears the upper lane of the SSE register XMM1 and that register EAX is already 0!

OUCH: additionally GCC generates AVX instructions without need where SSE4.1 instructions are sufficient!

Compile the source file case30.c a fourth time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case30.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…]
ispowerof2:
	movq	%rsi, %rax	#	mov	rax, rdi
	xorl	%ecx, %ecx	#
	movq	%rdi, %r8	#
	movq	%rsi, %r9	#
	orq	%rdi, %rax	#	or	rax, rsi
	je	.L1		#	jz	.L1
	movq	%rdi, %rax	#	mov	rcx, rdi
	movq	%rsi, %rdx	#	mov	rdx, rsi
	addq	$-1, %rax	#	add	rcx, -1
	adcq	$-1, %rdx	#	add	rdx, -1
	movq	%rax, %rsi	#
	movq	%r8, %rax	#
	xorl	%ecx, %ecx	#	xor	eax, eax
	movq	%rdx, %rdi	#
	movq	%r9, %rdx	#
	andq	%rsi, %rax	#	and	rcx, rdi
	andq	%rdi, %rdx	#	and	rdx, rsi
	orq	%rdx, %rax	#	or	rcx, rdx
	sete	%cl		#	setz	al
.L1:
	movl	%ecx, %eax	#
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"
OUCH: 8 (highlighted) superfluous instructions (which abuse registers R8 and R9 to play a Shell Game) out of the total 21 instructions in 59 bytes, plus an epic failure of the optimiser which doesn’t recognise that register ECX is already 0!

Properly written straightforward code uses but only 13 instructions in 37 bytes!

Case 31: Cargo Cult, or Ignorance is Bliss, plus Epic Optimiser Failure

In the Hardware Specification Errata HSD146, BDM85, SKL029 and 026, all titled POPCNT Instruction May Take Longer to Execute Than Expected, Intel documents a bug in their Haswell, Broadwell, Skylake and Coffee Lake families of CPUs: the POPCNT instruction has a false dependency on its destination register.

Note: a XOR instruction that zeroes the destination register before the POPCNT instruction breaks this false dependency.

Intel processors from the Sandy Bridge, Ivy Bridge and Coffee Lake Refresh families suffer from this bug too; processors of the Alder Lake, Cannon Lake, Cascade Lake, Comet Lake, Ice Lake, Raptor Lake, Rocket Lake and Tiger Lake families are but not affected.

Create the text file case31.c with the following content in an arbitrary, preferable empty directory:

// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int ispowerof2(unsigned long long argument) {
    return __builtin_popcountll(argument) == 1;
}
Compile the source file case31.c with GCC, engaging its optimiser, targetting one of the not affected processor families, and display the generated assembly code:
gcc -m32 -march=cannonlake -o- -O3 -S case31.c
[…]
ispowerof2:
	xorl	%eax, %eax
	xorl	%edx, %edx
	popcntl	4(%esp), %eax
	popcntl	8(%esp), %edx
	addl	%edx, %eax
	cmpl	$1, %eax
	sete	%al
	movzbl	%al, %eax
	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"
OOPS: GCC generates superfluous XOR instructions!

Compile the source file case31.c with GCC, again targetting one of the not affected processor families, but now optimising for size, and display the generated assembly code:

gcc -m32 -march=cannonlake -o- -Os -S case31.c
[…]
ispowerof2:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$16, %esp
	pushl	12(%ebp)
	pushl	8(%ebp)
	call	__popcountdi2
	addl	$16, %esp
	leave
	decl	%eax
	sete	%al
	movzbl	%al, %eax
	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"
OUCH: optimising for speed GCC generates 9 instructions in 24 bytes; optimising for size it but generates 12 instructions in 29 bytes, including a superfluous (highlighted) ADD instruction, in addition to the uncounted instructions and bytes of the __popcountdi2() function!

Case 32: More Epic Optimiser Failures

Create the text file case32.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

int ispowerof2(unsigned long long argument) {
    return (argument != 0) && (__builtin_clzll(argument) + __builtin_ctzll(argument) == 63);
}
Compile the source file case32.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case32.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…]
ispowerof2:
	pushl	%ebx		#
	subl	$8, %esp	#
	movq	16(%esp), %xmm1	#
	movdqa	%xmm1, %xmm0	#
	movd	%xmm1, %edx	#	mov	ecx, [esp+4]
	psrlq	$32, %xmm0	#
	movd	%xmm0, %eax	#	mov	edx, [esp+8]
	movl	%eax, %ebx	#	mov	eax, ecx
	orl	%edx, %ebx	#	or	eax, edx
	je	.L7		#	jz	.L0
	testl	%eax, %eax	#	bsr	eax, edx
	je	.L3		#	jnz	.L1
	bsrl	%eax, %ebx	#	bsr	eax, ecx
	xorl	$31, %ebx	#	xor	eax, 32
				# L1:
				#	xor	eax, 31
.L4:
	subl	$16, %esp	#	bsf	ecx, ecx
	movq	%xmm1, (%esp)	#	jnz	.L2
	call	__ctzdi2	#	bsf	ecx, edx
	addl	$16, %esp	#	xor	ecx, 32
	xorl	%ecx, %ecx	# L2:
	addl	%eax, %ebx	#	add	eax, ecx
	cmpl	$63, %ebx	#	cmp	eax, 63
	sete	%cl		#	sete	al
	addl	$8, %esp	#
	movl	%ecx, %eax	#
	popl	%ebx		# L0:
	ret			#	ret
.L7:
	xorl	%ecx, %ecx	#
	addl	$8, %esp	#
	movl	%ecx, %eax	#
	popl	%ebx		#
	ret			#
.L3:
	bsrl	%edx, %edx	#
	xorl	$31, %edx	#
	leal	32(%edx), %ebx	#
	jmp	.L4		#
[…]
	.ident	"GCC: (GNU […]) 12.3.0"
OUCH: GCC generates 35 instructions in 96 bytes which clobber register EBX, perform 2 superfluous memory writes plus a superfluous memory read and call the __ctzdi2() function instead to use 2 BSF instructions; straightforward code uses but only 18 instructions in 48 bytes without call of a library function!

OUCH: the highlighted SSE2 instructions (ab)used to load the 64-bit argument into the 32-bit registers EAX and EDX are completely braindead!

Compile the source file case32.c a second time with GCC, engaging its optimiser, targetting the i386 platform, now explicitly enabling the LZCNT and TZCNT instructions, and display the generated assembly code:

gcc -m32 -mabm -mbmi -mlzcnt -o- -O3 -S case32.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised code in Intel syntax as comments.
[…]
ispowerof2:
	pushl	%ebx		#
	subl	$8, %esp	#
	movq	16(%esp), %xmm1	#
	movdqa	%xmm1, %xmm0	#
	movd	%xmm1, %edx	#	mov	ecx, [esp+4]
	psrlq	$32, %xmm0	#
	movd	%xmm0, %eax	#	mov	edx, [esp+8]
	movl	%eax, %ebx	#	mov	eax, ecx
	orl	%edx, %ebx	#	or	eax, edx
	je	.L7		#	jz	.L0
	xorl	%ebx, %ebx	#
	lzcntl	%eax, %ebx	#	lzcnt	eax, edx
	testl	%eax, %eax	#
	je	.L10		#	jnc	.L4
				#	lzcnt	eax, ecx
				#	add	eax, 32
.L4:
	subl	$16, %esp	#	tzcnt	ecx, ecx
	movq	%xmm1, (%esp)	#	jnc	.L1
	call	__ctzdi2	#	tzcnt	edx, edx
	addl	$16, %esp	#	add	ecx, edx
	xorl	%ecx, %ecx	# L1:
	addl	%eax, %ebx	#	add	eax, ecx
	cmpl	$63, %ebx	#	cmp	eax, 63
	sete	%cl		#	sete	al
	addl	$8, %esp	#
	movl	%ecx, %eax	#
	popl	%ebx		# L0:
	ret			#	ret
.L7:
	xorl	%ecx, %ecx	#
	addl	$8, %esp	#
	movl	%ecx, %eax	#
	popl	%ebx		#
	ret			#
.L10:
	lzcntl	%edx, %edx	#
	leal	32(%edx), %ebx	#
	jmp	.L4		#
[…]
	.ident	"GCC: (GNU […]) 12.3.0"
OUCH: GCC generates 34 instructions in 94 bytes which clobber register EBX, perform 2 superfluous memory writes plus a superfluous memory read and call the __ctzdi2() function instead to use 2 TZCNT instructions, despite the -mabm and -mbmi options; straightforward code uses but only 17 instructions in 48 bytes without call of a library function!

OUCH: the highlighted SSE2 instructions (ab)used to load the 64-bit argument into the 32-bit registers EAX and EDX are completely braindead!

Compile the source file case32.c a third time with GCC, again targetting the i386 platform, but now optimising for size, and display the generated assembly code:

gcc -m32 -mabm -mbmi -mlzcnt -o- -Os -S case32.c
[…]
ispowerof2:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%edi
	pushl	%esi
	pushl	%ebx
	subl	$12, %esp
	movl	12(%ebp), %edi
	movl	8(%ebp), %esi
	movl	%edi, %edx
	orl	%esi, %edx
	je	.L7
	lzcntl	%edi, %ebx
	testl	%edi, %edi
	jne	.L4
	lzcntl	%esi, %ebx
	addl	$32, %ebx
.L4:
	pushl	%eax
	pushl	%eax
	pushl	%edi
	pushl	%esi
	call	__ctzdi2
	addl	$16, %esp
	addl	%eax, %ebx
	cmpl	$63, %ebx
	sete	%al
	jmp	.L2
.L7:
	xorl	%eax, %eax
.L2:
	leal	-12(%ebp), %esp
	movzbl	%al, %eax
	popl	%ebx
	popl	%esi
	popl	%edi
	popl	%ebp
	ret
[…]
	.ident	"GCC: (GNU […]) 12.3.0"
OUCH: again GCC generates 34 instructions, now in 71 bytes, with a call of the __ctzdi2() function, instead of the straighforward code shown above – this is as bad as code might get!

Case 33: Yet More Epic Optimiser Failures

Create the text file case33.c with the following content in an arbitrary, preferable empty directory:
// Copyleft © 2014-2024, Stefan Kanthak <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>

#include <arpa/inet.h>

unsigned long long htonll(unsigned long long h)
{
#ifndef ALTERNATE
    return htonl(1) == 1 ? h : htonl((unsigned long) (h >> 32))
                             | htonl((unsigned long) h) * 0x100000000u;
#else
    return ((unsigned char *) &h)[0] * 0x0100000000000000u
         | ((unsigned char *) &h)[1] * 0x0001000000000000u
         | ((unsigned char *) &h)[2] * 0x0000010000000000u
         | ((unsigned char *) &h)[3] * 0x0000000100000000u
         | ((unsigned char *) &h)[4] * 0x0000000001000000u
         | ((unsigned char *) &h)[5] * 0x0000000000010000u
         | ((unsigned char *) &h)[6] * 0x0000000000000100u
         | ((unsigned char *) &h)[7];
#endif
}

unsigned long long ntohll(unsigned long long n)
{
#ifdef ALTERNATE
    return ntohl(1) == 1 ? n : ntohl((unsigned long) (n >> 32))
                             | ntohl((unsigned long) n) * 0x100000000u;
#else
    return ((unsigned char *) &n)[0] * 0x0100000000000000u
         | ((unsigned char *) &n)[1] * 0x0001000000000000u
         | ((unsigned char *) &n)[2] * 0x0000010000000000u
         | ((unsigned char *) &n)[3] * 0x0000000100000000u
         | ((unsigned char *) &n)[4] * 0x0000000001000000u
         | ((unsigned char *) &n)[5] * 0x0000000000010000u
         | ((unsigned char *) &n)[6] * 0x0000000000000100u
         | ((unsigned char *) &n)[7];
#endif
}
Compile the source file case33.c with GCC, engaging its optimiser, targetting the i386 platform, and display the generated assembly code:
gcc -m32 -o- -O3 -S case33.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…]
htonll:
	movl	8(%esp), %edx	#	mov	eax, [esp+8]
	movl	4(%esp), %eax	#	mov	edx, [esp+4]
	bswapl	%edx		#	bswap	eax
	bswapl	%eax		#	bswap	edx
	xchgl	%edx, %eax	#
	ret			#	ret
[…]
ntohll:
	pushl	%edi		#
	xorl	%ecx, %ecx	#
	pushl	%esi		#
	pushl	%ebx		#
	movl	16(%esp), %edx	#	mov	edx, [esp+4]
	movl	20(%esp), %eax	#	mov	eax, [esp+8]
	movl	%edx, %edi	#	bswap	eax
	movzbl	%dh, %esi	#	bswap	edx
	sall	$24, %edi	#
	movl	%edi, %ebx	#
	movl	%esi, %edi	#
	xorl	%esi, %esi	#
	orl	%ecx, %esi	#
	movl	%eax, %ecx	#
	sall	$16, %edi	#
	shrl	$24, %ecx	#
	orl	%ebx, %edi	#
	xorl	%ebx, %ebx	#
	orl	%ecx, %esi	#
	movl	%edx, %ecx	#
	orl	%ebx, %edi	#
	shrl	$24, %edx	#
	shrl	$16, %ecx	#
	movzbl	%cl, %ecx	#
	movl	%ecx, %ebx	#
	xorl	%ecx, %ecx	#
	orl	%ecx, %esi	#
	xorl	%ecx, %ecx	#
	sall	$8, %ebx	#
	orl	%esi, %ecx	#
	movl	%eax, %esi	#
	orl	%ebx, %edi	#
	movl	%edx, %ebx	#
	sall	$24, %esi	#
	orl	%edi, %ebx	#
	orl	%esi, %ecx	#
	movl	%esi, %edi	#
	movzbl	%ah, %esi	#
	shrl	$8, %eax	#
	sall	$16, %esi	#
	sarl	$31, %edi	#
	andl	$65280, %eax	#
	orl	%edi, %ebx	#
	movl	%esi, %edi	#
	orl	%esi, %ecx	#
	cltd			#
	sarl	$31, %edi	#
	orl	%ecx, %eax	#
	orl	%edi, %ebx	#
	orl	%ebx, %edx	#
	popl	%ebx		#
	popl	%esi		#
	popl	%edi		#
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 13.2.0"
OUCH: while the code generated for the first function is almost perfect, it is desastrous for the second function!

Compile the source file case33.c a second time with GCC, now targetting the AMD64 platform, and display the generated assembly code:

gcc -m64 -o- -O3 -S case33.c
Note: the left column shows the generated code in AT&T syntax, while the right column shows properly optimised shorter code in Intel syntax as comments.
[…]
htonll:
	movq	%rdi, %rdx	#
	movl	%edi, %eax	#	mov	rax, rdi
	shrq	$32, %rdx	#
	bswapl	%eax		#	bswap	rax
	salq	$32, %rax	#
	bswapl	%edx		#
	movl	%edx, %edx	#
	orq	%rdx, %rax	#
	ret			#	ret
ntohll:
	movq	%rdi, %rdx	#
	movq	%rdi, %rax	#	mov	rax, rdi
	movzbl	%dh, %ecx	#	bswap	rax
	salq	$56, %rax	#
	salq	$48, %rcx	#
	shrq	$40, %rdx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	andl	$65280, %edx	#
	shrq	$56, %rcx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	shrq	$16, %rcx	#
	movzbl	%cl, %ecx	#
	salq	$40, %rcx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	shrq	$24, %rcx	#
	movzbl	%cl, %ecx	#
	salq	$32, %rcx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	shrq	$32, %rcx	#
	sall	$24, %ecx	#
	movslq	%ecx, %rcx	#
	orq	%rcx, %rax	#
	movq	%rdi, %rcx	#
	shrq	$24, %rcx	#
	andl	$16711680, %ecx	#
	orq	%rcx, %rax	#
	orq	%rdx, %rax	#
	ret			#	ret
[…]
	.ident	"GCC: (GNU […]) 13.2.0"
OUCH: now the code generated for both functions is desastrous!

Contact and Feedback

If you miss anything here, have additions, comments, corrections, criticism or questions, want to give feedback, hints or tipps, report broken links, bugs, deficiencies, errors, inaccuracies, misrepresentations, omissions, shortcomings, vulnerabilities or weaknesses, …: don’t hesitate to contact me and feel free to ask, comment, criticise, flame, notify or report!

Use the X.509 certificate to send S/MIME encrypted mail.

Note: email in weird format and without a proper sender name is likely to be discarded!

I dislike HTML (and even weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your nickname.
I abhor top posts and expect inline quotes in replies.

Terms and Conditions

By using this site, you signify your agreement to these terms and conditions. If you do not agree to these terms and conditions, do not use this site!

Data Protection Declaration

This web page records no (personal) data and stores no cookies in the web browser.

The web service is operated and provided by

Telekom Deutschland GmbH
Business Center
D-64306 Darmstadt
Germany
<‍hosting‍@‍telekom‍.‍de‍>
+49 800 5252033

The web service provider stores a session cookie in the web browser and records every visit of this web site with the following data in an access log on their server(s):


Copyright © 1995–2024 • Stefan Kanthak • <‍stefan‍.‍kanthak‍@‍nexgo‍.‍de‍>