Shift & SubtractDivision
__udivmodti4()
, __udivti3()
and
__umodti3()
plus the fast 128÷128-bit signed
integer division routines __divmodti4()
,
__divti3()
and __modti3()
for
AMD64 processors, the fast 64÷64-bit unsigned
integer division routines __udivmoddi4()
,
__udivdi3()
and __umoddi3()
plus the fast
64÷64-bit signed integer division routines
__divdi3()
and __moddi3()
for
i386 processors, as well as the fast compiler helper
routines
_alldiv()
,
_alldvrm()
,
_allmul()
,
_allrem()
, _allshl()
,
_allshr()
,
_aulldiv()
,
_aulldvrm()
, _aullrem()
and
_aullshr()
for the
Microsoft® Visual C
compiler to perform 64-bit integer division and multiplication on
i386 processors.
Note: the fast 128÷64-bit unsigned integer
division routine
_udiv128()
for i386 processors, implemented in
ANSI C
and Assembler, is presented in my related article
Donald Knuth’s Algorithm D
…,
and provided with my
NOMSVCRT.LIB
runtime library; in the latter it is complemented by the
so-called widening
64×64-bit signed and unsigned
integer multiplication routines
_mul128()
and
_umul128()
which yield a 128-bit product.
Note: double word
means
twice the register width
here!
DIV
instruction, which performs
a so-called narrowing128÷64-bit division, producing a 64-bit quotient and a 64-bit remainder from an 128-bit dividend and a 64-bit divisor.
divide by 0 exceptionis raised.
AND
of the divisor and the
divisor−1 is 0, the quotient is equal to the dividend shifted
right by the number of trailing '0' bits of the divisor, while the
remainder is the result of the logical
AND
of the divisor−1 and the
dividend.
DIV
instruction yields their
lower halves.
DIV
instructions
using the so-called longalias
schoolbookdivision (and 64-bit numbers as digits) to avoid an overflow of the quotient.
normalised, i.e. shifted left until its most significant bit is set, which is equivalent to a division by 264−number of leading '0' bits, and its lower half discarded.
normaliseddivisor′ is eventually subtracted from the upper half of the dividend to prevent an overflow, then used to produce the lower half of an intermediate approximate quotient′ with a single
DIV
instruction; if the
normaliseddivisor′ was subtracted from the upper half of the dividend before, the higher half of the intermediate approximate quotient′ is 1, else 0.
normaliseddivisor′, giving the final approximate quotient″, which might be 1 to high due to the discarded lower half of the
normaliseddivisor′ (the lower half of the final approximate quotient″ is 0 and discarded).
Shift & SubtractDivision
shift & subtractalias
binary longdivision algorithm is almost trivial: it’s the
schoolbookalgorithm using bits as digits.
divide by 0 exceptionis raised.
AND
of the divisor and the
divisor−1 is 0, the quotient is equal to the dividend shifted
right by the number o trailing '0' bits of the divisor, while the
remainder is the resu of the logical
AND
of the divisor−1 and the
dividend.
longalias
schoolbookdivision with with the
binary longalias
shift & subtractdivision algorithm:
shift & subtractalias
binary longdivision algorithm.
__udivmodti4()
,
__udivti3()
, __umodti3()
,
__divmodti4()
, __divti3()
and
__modti3()
functions:
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder);
uint128_t __udivti3(uint128_t dividend, uint128_t divisor);
uint128_t __umodti3(uint128_t dividend, uint128_t divisor);
int128_t __divmodti4(int128_t dividend, int128_t divisor, int128_t *remainder);
int128_t __divti3(int128_t dividend, int128_t divisor);
int128_t __modti3(int128_t dividend, int128_t divisor);
Prototype for the __udivmoddi4()
function, and sample
implementation of the 64÷64-bit
shift & subtractdivision for the Microsoft Visual C compiler:
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef _MSC_VER
uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);
#else
#pragma intrinsic(_BitScanReverse64)
typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;
uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder)
{
uint64_t quotient;
uint32_t index1, index2;
if (_BitScanReverse64(&index2, divisor))
if (_BitScanReverse64(&index1, dividend))
#if 0
if (index1 >= index2)
#else
if (dividend >= divisor)
#endif
{
// dividend >= divisor > 0,
// 64 > index1 >= index2 >= 0
// (number of leading '0' bits = 63 - index)
divisor <<= index1 - index2;
quotient = 0;
do
{
quotient <<= 1;
if (dividend >= divisor)
{
dividend -= divisor;
quotient |= 1;
}
divisor >>= 1;
} while (index1 >= ++index2);
if (remainder != 0)
*remainder = dividend;
return quotient;
}
else // divisor > dividend > 0:
// quotient = 0, remainder = dividend
{
if (remainder != 0)
*remainder = dividend;
return 0;
}
else // divisor > dividend == 0:
// quotient = 0, remainder = 0
{
if (remainder != 0)
*remainder = 0;
return 0;
}
else // divisor == 0
{
if (remainder != 0)
*remainder = dividend % divisor;
return dividend / divisor;
}
}
#endif // _MSC_VER
The suffix di4
specifies the number of arguments plus return value and their size:
double integer denotes an 8-byte
QWORD
alias 64-bit
uint64_t
.
Prototype for the __udivmodti4()
function, and sample
implementation of the 128÷128-bit extended precision division
as well as the shift & subtract
division
for the Microsoft Visual C compiler:
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef _MSC_VER
uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder);
#else
typedef unsigned __int32 uint32_t;
typedef unsigned __int64 uint64_t;
#if 0
typedef unsigned __int128 uint128_t;
#else
typedef struct
{
uint64_t low, high;
} uint128_t;
#endif
#if _MSC_VER >= 1920 // MSC 19.20 alias 2019
#pragma intrinsic(__shiftleft128, __shiftright128, _udiv128, _umul128, _BitScanReverse64)
uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder)
{
uint128_t quotient;
#ifndef HYBRID
uint64_t high, low, tmp;
uint32_t index;
if (_BitScanReverse64(&index, divisor.high))
{
high = __shiftleft128(divisor.low, divisor.high, 63 - index);
if (high > dividend.high)
{
tmp = _udiv128(dividend.high, dividend.low, high, &low);
low = __shiftleft128(low, 0, 63 - index);
}
else // prevent overflow
{
tmp = _udiv128(dividend.high - high, dividend.low, high, &low);
low = __shiftleft128(low, 1, 63 - index);
}
quotient.high = 0;
quotient.low = low;
tmp = low * divisor.high;
low = _umul128(low, divisor.low, &high);
high += tmp;
if ((high < tmp) // quotient * divisor >= 2**128 > dividend
|| (high > dividend.high) // quotient * divisor > dividend
|| ((high == dividend.high) && (low > dividend.low)))
{
quotient.low -= 1;
low = _umul128(quotient.low, divisor.low, &high);
high += quotient.low * divisor.high;
}
if (remainder != 0)
{
dividend.high -= high + (dividend.low < low);
dividend.low -= low;
*remainder = dividend;
}
}
#else // HYBRID
uint64_t tmp;
uint32_t index1, index2;
if (_BitScanReverse64(&index2, divisor.high))
if (_BitScanReverse64(&index1, dividend.high))
if (index1 >= index2)
{
// dividend >= divisor >= 2**64,
// 64 > index1 >= index2 >= 0
// (number of leading '0' bits = 63 - index)
divisor.high = __shiftleft128(divisor.low, divisor.high, index1 - index2);
divisor.low <<= index1 - index2;
quotient.high = quotient.low = 0;
do
{
quotient.low <<= 1;
if ((dividend.high > divisor.high)
|| ((dividend.high == divisor.high) && (dividend.low >= divisor.low)))
{
dividend.high -= divisor.high + (dividend.low < divisor.low);
dividend.low -= divisor.low;
quotient.low |= 1;
}
divisor.low = __shiftright128(divisor.low, divisor.high, 1);
divisor.high >>= 1;
} while (index1 >= ++index2);
if (remainder != 0)
*remainder = dividend;
}
else // divisor > dividend >= 2**64:
// quotient = 0, remainder = dividend
{
if (remainder != 0)
*remainder = dividend;
}
else // divisor >= 2**64 > dividend:
// quotient = 0, remainder = dividend
{
if (remainder != 0)
#if 0
{
remainder->high = 0;
remainder->low = dividend.low;
}
#else
*remainder = dividend;
#endif
}
#endif // HYBRID
else // divisor < 2**64
{
if (dividend.high < divisor.low)
{
quotient.high = 0;
quotient.low = _udiv128(dividend.high, dividend.low, divisor.low, &tmp);
}
else // "long" alias "schoolbook" division
{
quotient.high = _udiv128(0, dividend.high, divisor.low, &tmp);
quotient.low = _udiv128(tmp, dividend.low, divisor.low, &tmp);
}
if (remainder != 0)
{
remainder->high = 0;
remainder->low = tmp;
}
}
return quotient;
}
#endif
#endif // _MSC_VER
The suffix ti4
specifies the number of arguments plus return value and their size:
tetra integer denotes a 16-byte
OWORD
alias 128-bit
uint128_t
.
Note: the Microsoft
Visual C compiler does not provide a 128-bit
integer data type; the keyword __int128
is reserved,
but unsupported, its use yields error
C4235.
Note: with the preprocessor macro
HYBRID
defined, the hybrid variant of the division
algorithm is used.
__udivmodti4()
, __udivti3()
and
__umodti3()
functions for AMD64
processors, supporting the
Unix® System V calling
convention, using the extended precision division algorithm:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# * The software is provided "as is" without any warranty, neither express
# nor implied.
# * In no event will the author be held liable for any damage(s) arising
# from the use of the software.
# * Redistribution of the software is allowed only in unmodified form.
# * Permission is granted to use the software solely for personal private
# and non-commercial purposes.
# * An individuals use of the software in his or her capacity or function
# as an agent, (independent) contractor, employee, member or officer of
# a business, corporation or organization (commercial or non-commercial)
# does not qualify as personal private and non-commercial purpose.
# * Without written approval from the author the software must not be used
# for a business, for commercial, corporate, governmental, military or
# organizational purposes of any kind, or in a commercial, corporate,
# governmental, military or organizational environment of any kind.
# Unix System V calling convention for AMD64 platform:
# - first 6 integer or pointer arguments (from left to right) are passed
# in registers RDI/R7, RSI/R6, RDX/R2, RCX/R1, R8 and R9
# (R10 is used as static chain pointer in case of nested functions);
# - surplus arguments are pushed on stack in reverse order (from right to
# left), 8-byte aligned;
# - 128-bit integer arguments are passed as pair of 64-bit integer arguments,
# low part before/below high part;
# - 128-bit integer result is returned in registers RAX/R0 (low part) and
# RDX/R2 (high part);
# - 64-bit integer or pointer result is returned in register RAX/R0;
# - 32-bit integer result is returned in register EAX;
# - registers RBX/R3, RSP/R4, RBP/R5, R12, R13, R14 and R15 must be
# preserved;
# - registers RAX/R0, RCX/R1, RDX/R2, RSI/R6, RDI/R7, R8, R9, R10 (in
# case of normal functions) and R11 are volatile and can be clobbered;
# - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
# before calling other functions (CALL instruction pushes 8 bytes);
# - a "red zone" of 128 bytes below the stack pointer can be clobbered.
# NOTE: raises "division exception" when divisor is 0!
.file "udivmodti4.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = dividend
# rcx:rdx = divisor
__umodti3:
sub rsp, 24
mov r8, rsp # r8 = address of remainder
call __udivmodti4
pop rax
pop rdx # rdx:rax = remainder
pop rcx
ret
# rsi:rdi = dividend
# rcx:rdx = divisor
__udivti3:
xor r8, r8
# rsi:rdi = dividend
# rcx:rdx = divisor
# r8 = oword ptr remainder
__udivmodti4:
cmp rdi, rdx
mov rax, rsi
sbb rax, rcx
jb .trivial # dividend < divisor?
mov r11, rcx # r11 = high qword of divisor
mov r10, rdx # r10 = low qword of divisor
bsr rcx, rcx # rcx = index of most significant '1' bit
# in high qword of divisor
jnz .extended # high qword of divisor <> 0?
# remainder < divisor < 2**64
cmp rsi, rdx
jae .long # high qword of dividend >= (low qword of) divisor?
# dividend < divisor * 2**64: quotient < 2**64
# perform normal division
.normal:
mov rdx, rsi
mov rax, rdi # rdx:rax = dividend
div r10 # rax = (low qword of) quotient,
# rdx = (low qword of) remainder
test r8, r8
jz 0f # address of remainder = 0?
mov [r8], rdx
mov [r8+8], r11 # high qword of remainder = 0
0:
mov rdx, r11 # rdx:rax = quotient
ret
# dividend >= divisor * 2**64: quotient >= 2**64
# perform "long" alias "schoolbook" division
.long:
mov rdx, r11 # rdx = 0
mov rax, rsi # rdx:rax = high qword of dividend
div r10 # rax = high qword of quotient,
# rdx = high qword of remainder'
mov rcx, rax # rcx = high qword of quotient
mov rax, rdi # rax = low qword of dividend
div r10 # rax = low qword of quotient,
# rdx = (low qword of) remainder
test r8, r8
jz 1f # address of remainder = 0?
mov [r8], rdx
mov [r8+8], r11 # high qword of remainder = 0
1:
mov rdx, rcx # rdx:rax = quotient
ret
# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
test r8, r8
jz 2f # address of remainder = 0?
mov [r8], rdi
mov [r8+8], rsi # remainder = dividend
2:
xor eax, eax
xor edx, edx # rdx:rax = quotient = 0
ret
# dividend >= divisor >= 2**64: quotient < 2**64
.extended:
xor ecx, 63 # ecx = number of leading '0' bits
# in (high qword of) divisor
jz .special # divisor >= 2**127?
# perform "extended & adjusted" division
mov r9, r11 # r9 = high qword of divisor
shld r9, r10, cl # r9 = divisor / 2**(index + 1)
# = divisor'
# shl r10, cl
mov rax, rdi
mov rdx, rsi # rdx:rax = dividend
push rbx
.ifnotdef JCCLESS
xor ebx, ebx # rbx = high qword of quotient' = 0
cmp rdx, r9
jb 3f # high qword of dividend < divisor'?
# high qword of dividend >= divisor':
# subtract divisor' from high qword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub rdx, r9 # rdx = high qword of dividend - divisor'
# = high qword of dividend'
inc ebx # rbx = high qword of quotient' = 1
3:
.elseif 0
sub rdx, r9 # rdx = high qword of dividend - divisor'
sbb rbx, rbx # rbx = (high qword of dividend < divisor') ? -1 : 0
and rbx, r9 # rbx = (high qword of dividend < divisor') ? divisor' : 0
add rdx, rbx # rdx = high qword of dividend
# - (high qword of dividend < divisor') ? 0 : divisor'
# = high qword of dividend'
neg rbx # CF = (high qword of dividend < divisor')
sbb ebx, ebx # ebx = (high qword of dividend < divisor') ? -1 : 0
inc ebx # rbx = (high qword of dividend < divisor') ? 0 : 1
# = high qword of quotient'
.elseif 0
sub rdx, r9 # rdx = high qword of dividend - divisor'
cmovb rdx, rsi # = high qword of dividend'
sbb ebx, ebx # ebx = (high qword of dividend < divisor') ? -1 : 0
inc ebx # rbx = (high qword of dividend < divisor') ? 0 : 1
# = high qword of quotient'
.else # JCCLESS
xor ebx, ebx # rbx = high qword of quotient' = 0
sub rdx, r9 # rdx = high qword of dividend - divisor'
cmovb rdx, rsi # = high qword of dividend'
sbb ebx, -1 # rbx = (high qword of dividend < divisor') ? 0 : 1
# = high qword of quotient'
.endif # JCCLESS
# high qword of dividend' < divisor'
div r9 # rax = dividend' / divisor'
# = low qword of quotient',
# rdx = remainder'
shld rbx, rax, cl # rbx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl rax, cl
mov rax, r10 # rax = low qword of divisor
mov r9, r11 # r9 = high qword of divisor
imul r9, rbx # r9 = high qword of divisor * quotient"
mul rbx # rdx:rax = low qword of divisor * quotient"
.ifnotdef JCCLESS
add rdx, r9 # rdx:rax = divisor * quotient"
jnc 4f # divisor * quotient" < 2**64?
# (with carry, it is off by divisor,
# and quotient" is off by 1)
.if 0
sbb rbx, 0 # rbx = quotient" - 1
.else
dec rbx # rbx = quotient" - 1
.endif
sub rax, r10
sbb rdx, r11 # rdx:rax = divisor * (quotient" - 1)
4:
sub rdi, rax
sbb rsi, rdx # rsi:rdi = dividend - divisor * quotient"
# = remainder"
.else # JCCLESS
sub rdi, rax
sbb rsi, rdx # rsi:rdi = dividend
# - low qword of divisor * quotient"
sub rsi, r9 # rsi:rdi = dividend - divisor * quotient"
# = remainder"
.endif # JCCLESS
jnb 5f # remainder" >= 0?
# (with borrow, it is off by divisor,
# and quotient" is off by 1)
.if 0
sbb rbx, 0 # rbx = quotient" - 1
# = quotient
.else
dec rbx # rbx = quotient" - 1
# = quotient
.endif
add rdi, r10
adc rsi, r11 # rsi:rdi = remainder" + divisor
# = remainder
5:
test r8, r8
jz 6f # address of remainder = 0?
mov [r8], rdi
mov [r8+8], rsi # remainder = rsi:rdi
6:
mov rax, rbx # rax = (low qword of) quotient
xor edx, edx # rdx:rax = quotient
pop rbx
ret
# dividend >= divisor >= 2**127:
# quotient = 1, remainder = dividend - divisor
.special:
test r8, r8
jz 7f # address of remainder = 0?
sub rdi, r10
sbb rsi, r11 # rsi:rdi = dividend - divisor
# = remainder
mov [r8], rdi
mov [r8+8], rsi # remainder = dividend
7:
xor eax, eax
xor edx, edx
inc eax # rdx:rax = quotient = 1
ret
.size __udivmodti4, .-__udivmodti4
.type __udivmodti4, @function
.global __udivmodti4
.size __udivti3, .-__udivti3
.type __udivti3, @function
.global __udivti3
.size __umodti3, .-__umodti3
.type __umodti3, @function
.global __umodti3
.end
__udivmodti4()
function for AMD64
processors, supporting the Microsoft calling
convention, using the extended precision division algorithm:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; Microsoft calling convention for AMD64 platform:
; - first 4 integer or pointer arguments (from left to right) are passed
; in registers RCX/R1, RDX/R2, R8 and R9;
; - 16-byte arguments are passed by reference;
; - surplus arguments are pushed on stack in reverse order (from right
; to left), 8-byte aligned;
; - caller allocates memory for 16-byte return value and passes pointer
; to it as (hidden) first argument, thus shifting all other arguments;
; - caller always allocates "home space" for 4 arguments on stack,
; even when less than 4 arguments are passed, but does not need to push
; first 4 arguments;
; - callee can spill first 4 arguments from registers to "home space";
; - callee can clobber "home space";
; - stack is 16-byte aligned: callee must decrement RSP by 8+n*16
; bytes when it calls other functions (CALL instruction pushes 8 bytes);
; - 64-bit integer or pointer result is returned in register RAX/R0;
; - 32-bit integer result is returned in register EAX;
; - registers RAX/R0, RCX/R1, RDX/R2, R8, R9, R10 and R11 are volatile
; and can be clobbered;
; - registers RBX/R3, RSP/R4, RBP/R5, RSI/R6, RDI/R7, R12, R13, R14 and
; R15 must be preserved.
; NOTE: raises "division exception" when divisor is 0!
.code
; rcx = oword ptr quotient
; rdx = oword ptr dividend
; r8 = oword ptr divisor
; r9 = oword ptr remainder
__udivmodti4 proc public
mov rax, [rdx] ; rax = low qword of dividend
mov rdx, [rdx+8] ; rdx = high qword of dividend
mov r10, [r8] ; r10 = low qword of divisor
mov r11, [r8+8] ; r11 = high qword of divisor
mov r8, rcx ; r8 = address of quotient
cmp rax, r10
mov rcx, rdx
sbb rcx, r11
jb trivial ; dividend < divisor?
bsr rcx, r11 ; rcx = index of most significant '1' bit
; in high qword of divisor
jnz extended ; high qword of divisor <> 0?
; divisor < 2**64
cmp rdx, r10
jae long ; high qword of dividend >= (low qword of) divisor?
; dividend < divisor * 2**64: quotient < 2**64
; perform normal division
normal:
div r10 ; rax = (low qword of) quotient,
; rdx = (low qword of) remainder
mov [r8], rax
mov [r8+8], r11 ; high qword of quotient = 0
test r9, r9
jz @f ; address of remainder = 0?
mov [r9], rdx
mov [r9+8], r11 ; high qword of remainder = 0
@@:
mov rax, r8 ; rax = address of quotient
ret
; dividend >= divisor * 2**64: quotient >= 2**64
; perform "long" alias "schoolbook" division
long:
mov rcx, rax ; rcx = low qword of dividend
mov rax, rdx ; rax = high qword of dividend
mov rdx, r11 ; rdx:rax = high qword of dividend
div r10 ; rax = high qword of quotient,
; rdx = high qword of remainder'
xchg rcx, rax ; rcx = high qword of quotient,
; rax = low qword of dividend
div r10 ; rax = low qword of quotient,
; rdx = (low qword of) remainder
mov [r8], rax
mov [r8+8], rcx ; quotient = rcx:rax
test r9, r9
jz @f ; address of remainder = 0?
mov [r9], rdx
mov [r9+8], r11 ; high qword of remainder = 0
@@:
mov rax, r8 ; rax = address of quotient
ret
; dividend < divisor: quotient = 0, remainder = dividend
trivial:
xor ecx, ecx
mov [r8], rcx
mov [r8+8], rcx ; quotient = 0
test r9, r9
jz @f ; address of remainder = 0?
mov [r9], rax
mov [r9+8], rdx ; remainder = dividend
@@:
mov rax, r8 ; rax = address of quotient
ret
; divisor >= 2**64: quotient < 2**64
extended:
xor ecx, 63 ; ecx = number of leading '0' bits
; in (high qword of) divisor
jz special ; divisor >= 2**127?
; perform "extended & adjusted" division
mov [rsp+8], rbx
mov [rsp+16], r12
mov [rsp+24], r13
mov [rsp+32], r14
mov r12, r11 ; r12 = high qword of divisor
shld r12, r10, cl ; r12 = divisor / 2**(index + 1)
; = divisor'
;; shl r10, cl
mov r13, rax
mov r14, rdx ; r14:r13 = high qword of dividend
ifndef JccLess
xor ebx, ebx ; rbx = high qword of quotient' = 0
cmp rdx, r12
jb @f ; high qword of dividend < divisor'?
; high qword of dividend >= divisor':
; subtract divisor' from high qword of dividend to prevent possible
; quotient overflow and set most significant bit of quotient"
sub rdx, r12 ; rdx = high qword of dividend - divisor'
; = high qword of dividend'
inc ebx ; rbx = high qword of quotient' = 1
@@:
elseif 0
sub rdx, r12 ; rdx = high qword of dividend - divisor'
sbb rbx, rbx ; rbx = (high qword of dividend < divisor') ? -1 : 0
and rbx, r12 ; rbx = (high qword of dividend < divisor') ? divisor' : 0
add rdx, rbx ; rdx = high qword of dividend
; - (high qword of dividend < divisor') ? 0 : divisor'
; = high qword of dividend'
neg rbx ; CF = (high qword of dividend < divisor')
sbb ebx, ebx ; ebx = (high qword of dividend < divisor') ? -1 : 0
inc ebx ; rbx = (high qword of dividend < divisor') ? 0 : 1
; = high qword of quotient'
elseif 0
sub rdx, r12 ; rdx = high qword of dividend - divisor'
cmovb rdx, r14 ; = high qword of dividend'
sbb ebx, ebx ; ebx = (high qword of dividend < divisor') ? -1 : 0
inc ebx ; rbx = (high qword of dividend < divisor') ? 0 : 1
; = high qword of quotient'
else ; JccLess
xor ebx, ebx ; rbx = high qword of quotient' = 0
sub rdx, r12 ; rdx = high qword of dividend - divisor'
cmovb rdx, r14 ; = high qword of dividend'
sbb ebx, -1 ; rbx = (high qword of dividend < divisor') ? 0 : 1
; = high qword of quotient'
endif ; JccLess
; high qword of dividend' < divisor'
div r12 ; rax = dividend' / divisor'
; = low qword of quotient',
; rdx = remainder'
shld rbx, rax, cl ; rbx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl rax, cl
ifndef JccLess
mov rax, r10 ; rax = low qword of divisor
mov r12, r11 ; r12 = high qword of divisor
imul r12, rbx ; r12 = high qword of divisor * quotient"
mul rbx ; rdx:rax = low qword of divisor * quotient"
add rdx, r12 ; rdx:rax = divisor * quotient"
jnc @f ; divisor * quotient" < 2**64?
; (with carry, it is off by divisor,
; and quotient" is off by 1)
if 0
sbb rbx, 0 ; rbx = quotient" - 1
else
dec rbx ; rbx = quotient" - 1
endif
sub rax, r10
sbb rdx, r11 ; rdx:rax = divisor * (quotient" - 1)
@@:
sub r13, rax
sbb r14, rdx ; r14:r13 = dividend - divisor * quotient"
; = remainder"
else ; JccLess
mov rax, r10 ; rax = low qword of divisor
mov r12, r11 ; r12 = high qword of divisor
imul r12, rbx ; r12 = high qword of divisor * quotient"
mul rbx ; rdx:rax = low qword of divisor * quotient"
sub r13, rax
sbb r14, rdx ; r14:r13 = dividend
; - low qword of divisor * quotient"
sub r14, r12 ; r14:r13 = dividend - divisor * quotient"
; = remainder"
endif ; JccLess
jnb @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
if 0
sbb rbx, 0 ; rbx = quotient" - 1
; = quotient
else
dec rbx ; rbx = quotient" - 1
; = quotient
endif
add r13, r10
adc r14, r11 ; r14:r13 = remainder" + divisor
; = remainder
@@:
xor eax, eax ; rax = high qword of quotient = 0
mov [r8], rbx
mov [r8+8], rax ; quotient = rax:rbx
test r9, r9
jz @f ; address of remainder = 0?
mov [r9], r13
mov [r9+8], r14 ; remainder = r14:r13
@@:
mov rbx, [rsp+8]
mov r12, [rsp+16]
mov r13, [rsp+24]
mov r14, [rsp+32]
mov rax, r8 ; rax = address of quotient
ret
# dividend >= divisor >= 2**127:
# quotient = 1, remainder = dividend - divisor
special:
mov [r8+8], rcx
inc ecx
mov [r8], rcx ; quotient = 1
test r9, r9
jz @f ; address of remainder = 0?
sub rax, r10
sbb rdx, r11 ; rdx:rax = dividend - divisor
; = remainder
mov [r9], rax
mov [r9+8], rdx
@@:
mov rax, r8 ; rax = address of quotient
ret
__udivmodti4 endp
end
__udivmodti4()
, __udivti3()
and
__umodti3()
functions for AMD64
processors, supporting the Unix System V calling
convention, using the hybrid variant of the division algorithm:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: raises "division exception" when divisor is 0!
.file "udivmodti4.s"
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = dividend
# rcx:rdx = divisor
__umodti3:
sub rsp, 24
mov r8, rsp # r8 = address of remainder
call __udivmodti4
pop rax
pop rdx # rdx:rax = remainder
pop rcx
ret
# rsi:rdi = dividend
# rcx:rdx = divisor
__udivti3:
xor r8, r8
# rsi:rdi = dividend
# rcx:rdx = divisor
# r8 = oword ptr remainder
__udivmodti4:
cmp rdi, rdx
mov rax, rsi
sbb rax, rcx
jb .trivial # dividend < divisor?
bsr r9, rcx # r9 = index of most significant '1' bit
# in high qword of divisor
jz .simple # high qword of divisor = 0?
# dividend >= divisor >= 2**64: quotient < 2**64
mov r11, rcx # r11 = high qword of divisor
bsr rcx, rdx # rcx = index of most significant '1' bit
# in high qword of dividend
# jz .trivial # high qword of dividend = 0?
# perform "shift & subtract" alias "binary long" division
.large:
sub rcx, r9 # rcx = distance of leading '1' bits
# jb .trivial # dividend < divisor?
xor r9, r9 # r9 = (low qword of) quotient' = 0
mov r10, rdx # r10 = low qword of divisor
shld r11, r10, cl
shl r10, cl # r11:r10 = divisor << distance of leading '1' bits
# = divisor'
.loop:
mov rax, rdi
mov rdx, rsi # rdx:rax = dividend'
sub rdi, r10
sbb rsi, r11 # rsi:rdi = dividend' - divisor'
# = dividend",
# CF = (dividend' < divisor')
cmovb rdi, rax
cmovb rsi, rdx # rsi:rdi = (dividend' < divisor') ? dividend' : dividend"
cmc # CF = (dividend' >= divisor')
adc r9, r9 # r9 = quotient' << 1
# + dividend' >= divisor'
# = quotient"
.if 0
shrd r10, r11, 1
shr r11, 1 # r11:r10 = divisor' >> 1
# = divisor"
.else
shr r11, 1
rcr r10, 1 # r11:r10 = divisor' >> 1
# = divisor"
.endif
dec ecx
jns .loop
test r8, r8
jz 0f # address of remainder = 0?
mov [r8], rdi
mov [r8+8], rsi # remainder = dividend"
0:
mov rax, r9 # rax = (low qword of) quotient
xor edx, edx # rdx:rax = quotient
ret
# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
test r8, r8
jz 1f # address of remainder = 0?
mov [r8], rdi
mov [r8+8], rsi # remainder = dividend
1:
xor eax, eax
xor edx, edx # rdx:rax = quotient = 0
ret
# divisor < 2**64
.simple:
mov r9, rdx # r9 = (low qword of) divisor
cmp rsi, rdx
jae .long # high qword of dividend >= (low qword of) divisor?
# dividend < divisor * 2**64: quotient < 2**64
# perform normal division
.normal:
mov rdx, rsi
mov rax, rdi # rdx:rax = dividend
div r9 # rax = (low qword of) quotient,
# rdx = (low qword of) remainder
test r8, r8
jz 2f # address of remainder = 0?
mov [r8], rdx
mov [r8+8], rcx # high qword of remainder = 0
2:
mov rdx, rcx # rdx:rax = quotient
ret
# dividend >= divisor * 2**64: quotient >= 2**64
# perform "long" alias "schoolbook" division
.long:
mov rdx, rcx # rdx = 0
mov rax, rsi # rdx:rax = high qword of dividend
div r9 # rax = high qword of quotient,
# rdx = high qword of remainder'
mov r10, rax # r10 = high qword of quotient
mov rax, rdi # rax = low qword of dividend
div r9 # rax = low qword of quotient,
# rdx = (low qword of) remainder
test r8, r8
jz 3f # address of remainder = 0?
mov [r8], rdx
mov [r8+8], rcx # high qword of remainder = 0
3:
mov rdx, r10 # rdx:rax = quotient
ret
.size __udivmodti4, .-__udivmodti4
.type __udivmodti4, @function
.global __udivmodti4
.size __udivti3, .-__udivti3
.type __udivti3, @function
.global __udivti3
.size __umodti3, .-__umodti3
.type __umodti3, @function
.global __umodti3
.end
__udivmodti4()
function for AMD64
processors, supporting the Microsoft calling
convention, using the hybrid variant of the division algorithm:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: raises "division exception" when divisor is 0!
.code
__udivmodti4 proc public
mov rax, [rdx] ; rax = low qword of dividend
mov rdx, [rdx+8] ; rdx = high qword of dividend
mov r10, [r8] ; r10 = low qword of divisor
mov r11, [r8+8] ; r11 = high qword of divisor
mov r8, rcx ; r8 = address of quotient
cmp rax, r10
mov rcx, rdx
sbb rcx, r11
jb trivial ; dividend < divisor?
bsr rcx, r11 ; rcx = index of most significant '1' bit
, in high qword of divisor
jz simple ; high qword of divisor = 0?
; dividend >= divisor >= 2**64: quotient < 2**64
mov [rsp+8], rbx
bsr rbx, rdx ; rbx = index of most significant '1' bit
; in high qword of dividend
;; jz trivial ; high qword of dividend = 0?
; perform "shift & subtract" alias "binary long" division
large:
sub ebx, ecx ; ebx = distance of leading '1' bits
;; jb trivial ; dividend < divisor?
mov ecx, ebx
xor ebx, ebx ; rbx = (low qword of) quotient' = 0
shld r11, r10, cl
shl r10, cl ; r11:r10 = divisor << distance of leading '1' bits
; = divisor'
mov [rsp+16], r12
mov [rsp+24], r13
@@:
mov r12, rax
mov r13, rdx ; r13:r12 = dividend'
sub rax, r10
sbb rdx, r11 ; rdx:rax = dividend' - divisor'
; = dividend",
; CF = (dividend' < divisor')
cmovb rax, r12
cmovb rdx, r13 ; rdx:rax = (dividend' < divisor') ? dividend' : dividend"
cmc ; CF = (dividend' >= divisor')
adc rbx, rbx ; rbx = quotient' << 1
; + dividend' >= divisor'
; = quotient
if 0
shrd r10, r11, 1
shr r11, 1 ; r11:r10 = divisor' >> 1
; = divisor
else
shr r11, 1
rcr r10, 1 ; r11:r10 = divisor' >> 1
; = divisor
endif
dec ecx
jns @b
mov r12, [rsp+16]
mov r13, [rsp+24]
xor ecx, ecx
mov [r8], rbx
mov [r8+8], rcx ; high qword of quotient = 0
test r9, r9
jz @f ; address of remainder = 0?
mov [r9], rax
mov [r9+8], rdx ; remainder = dividend"
@@:
mov rax, r8 ; rax = address of quotient
mov rbx, [rsp+8]
ret
; dividend < (2**64 <=) divisor: quotient = 0, remainder = dividend
trivial:
xor ecx, ecx
mov [r8], rcx
mov [r8+8], rcx ; quotient = 0
test r9, r9
jz @f ; address of remainder = 0?
mov [r9], rax
mov [r9+8], rdx ; remainder = dividend
@@:
mov rax, r8 ; rax = address of quotient
mov rbx, [rsp+8]
ret
; divisor < 2**64
simple:
cmp rdx, r10
jae long ; high qword of dividend >= (low qword of) divisor?
; dividend < divisor * 2**64: quotient < 2**64
; perform normal division
normal:
div r10 ; rax = (low qword of) quotient,
; rdx = (low qword of) remainder
mov [r8], rax
mov [r8+8], r11 ; high qword of quotient = 0
test r9, r9
jz @f ; address of remainder = 0?
mov [r9], rdx
mov [r9+8], r11 ; high qword of remainder = 0
@@:
mov rax, r8 ; rax = address of quotient
ret
; dividend >= divisor * 2**64: quotient >= 2**64
; perform "long" alias "schoolbook" division
long:
mov rcx, rax ; rcx = low qword of dividend
mov rax, rdx ; rax = high qword of dividend
mov rdx, r11 ; rdx:rax = high qword of dividend
div r10 ; rax = high qword of quotient,
; rdx = high qword of remainder'
xchg rcx, rax ; rcx = high qword of quotient,
; rax = low qword of dividend
div r10 ; rax = low qword of quotient,
; rdx = (low qword of) remainder
mov [r8], rax
mov [r8+8], rcx ; quotient = rcx:rax
test r9, r9
jz @f ; address of remainder = 0?
mov [r9], rdx
mov [r9+8], r11 ; high qword of remainder = 0
@@:
mov rax, r8 ; rax = address of quotient
ret
__udivmodti4 endp
end
__divmodti4()
function for AMD64
processors, supporting the
Unix® System V calling
convention, wrapping the __udivmodti4()
function:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: returns ±2**127 for -2**127 / -1!
.file "divmodti4.s"
.extern __udivmodti4
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = dividend
# rcx:rdx = divisor
# r8 = oword ptr remainder
__divmodti4:
mov rax, rsi
sar rax, 63 # rax = (dividend < 0) ? -1 : 0
xor rdi, rax
xor rsi, rax # rsi:rdi = (dividend < 0) ? ~dividend : dividend
sub rdi, rax
sbb rsi, rax # rsi:rdi = (dividend < 0) ? -dividend : dividend
# = |dividend|
mov r9, rcx
sar r9, 63 # r9 = (divisor < 0) ? -1 : 0
xor rdx, r9
xor rcx, r9 # rcx:rdx = (divisor < 0) ? ~divisor : divisor
sub rdx, r9
sbb rcx, r9 # rcx:rdx = (divisor < 0) ? -divisor : divisor
# = |divisor|
push r8
push rax
xor rax, r9 # rax = (dividend < 0) ^ (divisor < 0) ? -1 : 0
# = (quotient < 0) ? -1 : 0
push rax
call __udivmodti4 # rdx:rax = |quotient|
pop r9 # r9 = (quotient < 0) ? -1 : 0
xor rax, r9
xor rdx, r9 # rdx:rax = (quotient < 0) ? |~quotient| : |quotient|
sub rax, r9
sbb rdx, r9 # rdx:rax = (quotient < 0) ? |-quotient| : |quotient|
# = quotient
pop r9 # r9 = (dividend < 0) ? -1 : 0
# = (remainder < 0) ? -1 : 0
pop r8 # r8 = address of |remainder|
test r8, r9
jz 0f # address of remainder = 0?
# remainder >= 0?
neg qword ptr [r8+8]
neg qword ptr [r8]
sbb qword ptr [r8+8], 0
# [r9] = -|remainder|
# = remainder
0:
ret
.size __divmodti4, .-__divmodti4
.type __divmodti4, @function
.global __divmodti4
.end
__divmodti4()
function for AMD64
processors, supporting the Microsoft calling
convention, wrapping the __udivmodti4()
function:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: returns ±2**127 for -2**127 / -1!
.code
; rcx = oword ptr quotient
; rdx = oword ptr dividend
; r8 = oword ptr divisor
; r9 = oword ptr remainder
__divmodti4 proc public
mov rax, [rdx+8]
mov r10, [rdx] ; rax:r10 = dividend
cqo ; rdx = (dividend < 0) ? -1 : 0
push rdx ; = (remainder < 0) ? -1 : 0
xor r10, rdx
xor rax, rdx ; rax:r10 = (dividend < 0) ? ~dividend : dividend
sub r10, rdx
sbb rax, rdx ; rax:r10 = (dividend < 0) ? -dividend : dividend
; = |dividend|
mov [rsp+16], r10
mov [rsp+24], rax
mov rax, [r8+8]
mov r8, [r8] ; rax:r8 = divisor
cqo ; rdx = (divisor < 0) ? -1 : 0
push rdx
xor r8, rdx
xor rax, rdx ; rax:r8 = (divisor < 0) ? ~divisor : divisor
sub r8, rdx
sbb rax, rdx ; rax:r8 = (divispr < 0) ? -divisor : divisor
; = |divisor|
mov [rsp+40], r8
mov [rsp+48], rax
; rcx = address of |quotient|
lea rdx, [rsp+24] ; rdx = address of |dividend|
lea r8, [rsp+40] ; r8 = address of |divisor|
push r9 ; r9 = address of |remainder|
extern __udivmodti4 :proc
call __udivmodti4 ; rax = address of |quotient|
pop r9 ; r9 = address of |remainder|
pop r10 ; r10 = (divisor < 0) ? -1 : 0
pop r11 ; r11 = (dividend < 0) ? -1 : 0
; = (remainder < 0) ? -1 : 0
test r9, r11
jz @f ; address of remainder = 0?
; remainder >= 0?
neg qword ptr [r9+8]
neg qword ptr [r9]
sbb qword ptr [r9+8], 0
; [r9] = -|remainder|
; = remainder
@@:
xor r10, r11 ; r10 = (divisor < 0) ^ (dividend < 0) ? -1 : 0
; = (quotient < 0) ? -1 : 0
jz @f ; quotient >= 0?
neg qword ptr [rax+8]
neg qword ptr [rax]
sbb qword ptr [rax+8], 0
; [rax] = -|quotient|
; = quotient
@@:
ret
__divmodti4 endp
end
__divti3()
function for AMD64
processors, supporting the
Unix® System V calling
convention, using the __udivmodti4()
function:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: returns ±2**127 for -2**127 / -1!
.file "divti3.s"
.extern __udivmodti4
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = dividend
# rcx:rdx = divisor
__divti3:
mov rax, rsi
sar rax, 63 # rax = (dividend < 0) ? -1 : 0
xor rdi, rax
xor rsi, rax # rsi:rdi = (dividend < 0) ? ~dividend : dividend
sub rdi, rax
sbb rsi, rax # rsi:rdi = (dividend < 0) ? -dividend : dividend
# = |dividend|
mov r8, rcx
sar r8, 63 # r8 = (divisor < 0) ? -1 : 0
xor rdx, r8
xor rcx, r8 # rcx:rdx = (divisor < 0) ? ~divisor : divisor
sub rdx, r8
sbb rcx, r8 # rcx:rdx = (divisor < 0) ? -divisor : divisor
# = |divisor|
xor rax, r8 # rax = (dividend < 0) ^ (divisor < 0) ? -1 : 0
# = (quotient < 0) ? -1 : 0
push rax
xor r8, r8 # r8 = address of |remainder| = 0
call __udivmodti4 # rdx:rax = |quotient|
pop rcx # rcx = (quotient < 0) ? -1 : 0
xor rax, rcx
xor rdx, rcx # rdx:rax = (quotient < 0) ? |~quotient| : |quotient|
sub rax, rcx
sbb rdx, rcx # rdx:rax = (quotient < 0) ? |-quotient| : |quotient|
# = quotient
ret
.size __divti3, .-__divti3
.type __divti3, @function
.global __divti3
.end
__modti3()
function for AMD64
processors, supporting the
Unix® System V calling
convention, using the __udivmodti4()
function:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.file "modti3.s"
.extern __udivmodti4
.arch generic64
.code64
.intel_syntax noprefix
.text
# rsi:rdi = dividend
# rcx:rdx = divisor
__modti3:
mov rax, rcx
sar rax, 63 # rax = (divisor < 0) ? -1 : 0
xor rdx, rax
xor rcx, rax # rcx:rdx = (divisor < 0) ? ~divisor : divisor
sub rdx, rax
sbb rcx, rax # rcx:rdx = (divisor < 0) ? -divisor : divisor
# = |divisor|
mov rax, rsi
sar rax, 63 # rax = (dividend < 0) ? -1 : 0
xor rdi, rax
xor rsi, rax # rsi:rdi = (dividend < 0) ? ~dividend : dividend
sub rdi, rax
sbb rsi, rax # rsi:rdi = (dividend < 0) ? -dividend : dividend
# = |dividend|
push rax
sub rsp, 16
mov r8, rsp # r8 = address of |remainder|
call __udivmodti4 # rdx:rax = |quotient|
pop rax
pop rdx # rdx:rax = |remainder|
pop rcx # rcx = (dividend < 0) ? -1 : 0
xor rax, rcx
xor rdx, rcx # rdx:rax = (dividend < 0) ? |~remainder| : |remainder|
sub rax, rcx
sbb rdx, rcx # rdx:rax = (dividend < 0) ? |-remainder| : |remainder|
# = remainder
ret
.size __modti3, .-__modti3
.type __modti3, @function
.global __modti3
.end
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifdef _MSC_VER
typedef unsigned __int64 uint64_t;
typedef __int64 int64_t;
#if 0
typedef __int128 int128_t;
#else
typedef struct
{
uint64_t low;
int64_t high;
} int128_t;
#endif
int __cmpti2(int128_t comparand, int128_t comparator);
int128_t __absti2(int128_t value);
int128_t __absvti2(int128_t value);
int128_t __negti2(int128_t negend);
int128_t __negvti2(int128_t negend);
int128_t __ashlti3(int128_t value, int count);
int128_t __ashrti3(int128_t value, int count);
int128_t __maxti3(int128_t left, int128_t right);
int128_t __minti3(int128_t left, int128_t right);
int128_t __addti3(int128_t augend, int128_t addend);
int128_t __addvti3(int128_t augend, int128_t addend);
int128_t __multi3(int128_t multiplicand, int128_t multiplier);
int128_t __mulvti3(int128_t multiplicand, int128_t multiplier);
int128_t __subti3(int128_t minuend, int128_t subtrahend);
int128_t __subvti3(int128_t minuend, int128_t subtrahend);
#if 0
typedef unsigned __int128 uint128_t;
#else
typedef struct
{
uint64_t low, high;
} uint128_t;
#endif
int __clzti2(uint128_t value);
int __ctzti2(uint128_t value);
int __parityti2(uint128_t value);
int __popcountti2(uint128_t value);
int __ucmpti2(uint128_t comparand, uint128_t comparator);
uint128_t __bswapti2(uint128_t value);
uint128_t __reverseti2(uint128_t value);
uint128_t __lshrti3(uint128_t value, int count);
uint128_t __rotlti3(uint128_t value, int count);
uint128_t __rotrti3(uint128_t value, int count);
uint128_t __umaxti3(uint128_t left, uint128_t right);
uint128_t __uminti3(uint128_t left, uint128_t right);
uint128_t __uaddti3(uint128_t augend, uint128_t addend);
uint128_t __uaddvti3(uint128_t augend, uint128_t addend);
uint128_t __umulti3(uint128_t multiplicand, uint128_t multiplier);
uint128_t __umulvti3(uint128_t multiplicand, uint128_t multiplier);
uint128_t __usubti3(uint128_t minuend, uint128_t subtrahend);
uint128_t __usubvti3(uint128_t minuend, uint128_t subtrahend);
uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder);
#ifdef INTERN
#pragma intrinsic(_BitScanForward64, _BitScanReverse64)
__inline
int __clzti2(uint128_t value)
{
int index;
if (_BitScanReverse64(&index, value.high))
return index ^ 63;
if (_BitScanReverse64(&index, value.low))
return index ^ 127;
return 128;
}
__inline
int __ctzti2(uint128_t value)
{
int index;
if (_BitScanForward64(&index, value.low))
return index;
if (_BitScanForward64(&index, value.high))
return index + 64;
return 128;
}
__inline
int __parityti2(uint128_t value)
{
unsigned long long ull = value.low ^ value.high;
unsigned long ul = ull ^ (ull >> 32);
ul ^= ul >> 16;
ul ^= ul >> 8;
ul ^= ul >> 4;
return (0x69966996 >> ul) & 1;
}
__inline
int __popcountti2(uint128_t value)
{
unsigned long long low = value.low, high = value.high;
low -= (low >> 1) & 0x5555555555555555;
high -= (high >> 1) & 0x5555555555555555;
low = (low & 0x3333333333333333)
+ ((low >> 2) & 0x3333333333333333);
high = (high & 0x3333333333333333)
+ ((high >> 2) & 0x3333333333333333);
low += low >> 4;
high += high >> 4;
low &= 0x0F0F0F0F0F0F0F0F;
high &= 0x0F0F0F0F0F0F0F0F;
low *= 0x0101010101010101;
high *= 0x0101010101010101;
return (low + high) >> 56;
}
__inline
int __cmpti2(int128_t comparand, int128_t comparator)
{
if (comparand.high == comparator.high)
return (comparand.low > comparator.low)
- (comparand.low < comparator.low);
return (comparand.high > comparator.high)
- (comparand.high < comparator.high);
}
__inline
int __ucmpti2(uint128_t comparand, uint128_t comparator)
{
if (comparand.high == comparator.high)
return (comparand.low > comparator.low)
- (comparand.low < comparator.low);
return (comparand.high > comparator.high)
- (comparand.high < comparator.high);
}
#pragma intrinsic(_byteswap_uint64)
__inline
uint128_t __bswapti2(uint128_t value)
{
uint128_t result;
result.low = _byteswap_uint64(value.high);
result.high = _byteswap_uint64(value.low);
return result;
}
__inline
uint128_t __reverseti2(uint128_t value)
{
uint128_t result;
result.low = _byteswap_uint64(value.high);
result.high = _byteswap_uint64(value.low);
result.low = ((result.low >> 4) & 0x0F0F0F0F0F0F0F0F)
| ((result.low << 4) & 0xF0F0F0F0F0F0F0F0);
result.high = ((result.high >> 4) & 0x0F0F0F0F0F0F0F0F)
| ((result.high << 4) & 0xF0F0F0F0F0F0F0F0);
result.low = ((result.low >> 2) & 0x3333333333333333)
| ((result.low << 2) & 0xCCCCCCCCCCCCCCCC);
result.high = ((result.high >> 2) & 0x3333333333333333)
| ((result.high << 2) & 0xCCCCCCCCCCCCCCCC);
result.low = ((result.low >> 1) & 0x5555555555555555)
| ((result.low << 1) & 0xAAAAAAAAAAAAAAAA);
result.high = ((result.high >> 1) & 0x5555555555555555)
| ((result.high << 1) & 0xAAAAAAAAAAAAAAAA);
return result;
}
__inline
int128_t __absti2(int128_t value)
{
if (value.high < 0)
{
value.low = 0 - value.low;
value.high = 0 - value.high
- (0 < value.low);
}
return value;
}
__inline
int128_t __absvti2(int128_t value)
{
if (value.high < 0)
{
value.low = 0 - value.low;
value.high = 0 - value.high
- (0 < value.low);
}
// overflow if value is negative
if (value.high < 0)
__ud2();
return value;
}
__inline
int128_t __negti2(int128_t negend)
{
int128_t negation;
negation.low = 0 - negend.low;
negation.high = 0 - negend.high
- (0 < negend.low);
return negation;
}
__inline
int128_t __negvti2(int128_t negend)
{
int128_t negation;
negation.low = 0 - negend.low;
negation.high = 0 - negend.high
- (0 < negend.low);
// overflow if negend and negation are negative
if ((negend.high & negation.high) < 0)
__ud2();
return negation;
}
__inline
int128_t __addti3(int128_t augend, int128_t addend)
{
int128_t sum;
sum.low = augend.low + addend.low;
sum.high = augend.high + addend.high
+ (sum.low < augend.low);
return sum;
}
__inline
int128_t __addvti3(int128_t augend, int128_t addend)
{
int128_t sum;
sum.low = augend.low + addend.low;
sum.high = augend.high + addend.high
+ (sum.low < augend.low);
// overflow if both augend and addend have opposite sign of sum,
// which is equivalent to augend has sign of addend
// and addend has opposite sign of sum
// (or augend has opposite sign of sum)
if (((augend.high ^ sum.high) & (addend.high ^ sum.high)) < 0)
__ud2();
return sum;
}
__inline
uint128_t __uaddti3(uint128_t augend, uint128_t addend)
{
uint128_t sum;
sum.low = augend.low + addend.low;
sum.high = augend.high + addend.high
+ (sum.low < augend.low);
return sum;
}
__inline
uint128_t __uaddvti3(uint128_t augend, uint128_t addend)
{
uint128_t sum;
sum.low = augend.low + addend.low;
sum.high = augend.high + addend.high
+ (sum.low < augend.low);
if (sum.high < augend.high)
__ud2();
return sum;
}
__inline
int128_t __subti3(int128_t minuend, int128_t subtrahend)
{
int128_t difference;
difference.low = minuend.low - subtrahend.low;
difference.high = minuend.high - subtrahend.high
- (minuend.low < subtrahend.low);
return difference;
}
__inline
int128_t __subvti3(int128_t minuend, int128_t subtrahend)
{
int128_t difference;
difference.low = minuend.low - subtrahend.low;
difference.high = minuend.high - subtrahend.high
- (minuend.low < subtrahend.low);
// overflow if minuend has opposite sign of subtrahend
// and minuend has opposite sign of difference
// (or subtrahend has sign of difference)
if (((minuend.high ^ subtrahend.high) & (minuend.high ^ difference.high)) < 0)
__ud2();
return difference;
}
__inline
uint128_t __usubti3(uint128_t minuend, uint128_t subtrahend)
{
uint128_t difference;
difference.low = minuend.low - subtrahend.low;
difference.high = minuend.high - subtrahend.high
- (minuend.low < subtrahend.low);
return difference;
}
__inline
uint128_t __usubvti3(uint128_t minuend, uint128_t subtrahend)
{
uint128_t difference;
difference.low = minuend.low - subtrahend.low;
difference.high = minuend.high - subtrahend.high
- (minuend.low < subtrahend.low);
if (minuend.high < subtrahend.high + (minuend.low < subtrahend.low))
__ud2();
return difference;
}
__inline
int128_t __maxti3(int128_t left, int128_t right)
{
if ((left.high < right.high)
|| ((left.high == right.high) && (left.low < right.low)))
return right;
return left;
}
__inline
uint128_t __umaxti3(uint128_t left, uint128_t right)
{
if ((left.high < right.high)
|| ((left.high == right.high) && (left.low < right.low)))
return right;
return left;
}
__inline
int128_t __minti3(int128_t left, int128_t right)
{
if ((left.high > right.high)
|| ((left.high == right.high) && (left.low > right.low)))
return right;
return left;
}
__inline
uint128_t __uminti3(uint128_t left, uint128_t right)
{
if ((left.high > right.high)
|| ((left.high == right.high) && (left.low > right.low)))
return right;
return left;
}
#pragma intrinsic(__shiftleft128, __shiftright128)
__inline
int128_t __ashlti3(int128_t value, int count)
{
int128_t result;
if (count < 64)
{
result.low = value.low << count;
result.high = __shiftleft128(value.low, value.high, count);
}
else if (count < 128)
{
result.low = 0;
result.high = value.high << (count - 64);
}
else
result.low = result.high = 0;
return result;
}
__inline
int128_t __ashrti3(int128_t value, int count)
{
int128_t result;
if (count < 64)
{
result.low = __shiftright128(value.low, value.high, count);
result.high = value.high >> count;
}
else if (count < 128)
{
result.low = value.high >> (count - 64);
#if 1
result.high = value.high >> 63;
#else
result.high = value.high < 0 ? -1 : 0;
#endif
}
else
#if 1
result.low = result.high = value.high >> 63;
#else
result.low = result.high = value.high < 0 ? -1 : 0;
#endif
return result;
}
__inline
uint128_t __lshrti3(uint128_t value, int count)
{
uint128_t result;
if (count < 64)
{
result.low = __shiftright128(value.low, value.high, count);
result.high = value.high >> count;
}
else if (count < 128)
{
result.low = value.high >> (count - 64);
result.high = 0;
}
else
result.low = result.high = 0;
return result;
}
__inline
uint128_t __rotlti3(uint128_t value, int count)
{
uint128_t result;
if ((count & 64) == 0)
{
result.low = __shiftleft128(value.high, value.low, count);
result.high = __shiftleft128(value.low, value.high, count);
}
else
{
result.low = __shiftleft128(value.low, value.high, count);
result.high = __shiftleft128(value.high, value.low, count);
}
return result;
}
__inline
uint128_t __rotrti3(uint128_t value, int count)
{
uint128_t result;
if ((count & 64) == 0)
{
result.low = __shiftright128(value.low, value.high, count);
result.high = __shiftright128(value.high, value.low, count);
}
else
{
result.low = __shiftright128(value.high, value.low, count);
result.high = __shiftright128(value.low, value.high, count);
}
return result;
}
#pragma intrinsic(_mul128, _umul128)
__inline
int128_t __multi3(int128_t multiplicand, int128_t multiplier)
{
int128_t product;
product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
product.high += multiplicand.low * multiplier.high
+ multiplicand.high * multiplier.low;
return product;
}
__inline
int128_t __mulvti3(int128_t multiplicand, int128_t multiplier)
{
int128_t product, tmp;
#if 0
if (multiplicand.high == (long long) multiplicand.low >> 63)
{ // -2**63 <= multiplicand < 2**63
if (multiplier.high == (long long) multiplier.low >> 63)
{ // -2**63 <= multiplier < 2**63
product.low = _mul128(multiplicand.low, multiplier.low, &product.high);
return product;
}
product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
tmp.low = _umul128(multiplicand.low, multiplier.high, &tmp.high);
if (multiplier.high < 0)
tmp.high -= multiplicand.low;
if ((long long) multiplicand.low < 0)
{
tmp.high -= multiplier.high
+ (tmp.low < multiplier.low);
tmp.low -= multiplier.low;
}
tmp.low += product.high;
tmp.high += tmp.low < (unsigned long long) product.high;
product.high = tmp.low;
if (tmp.high == (long long) tmp.low >> 63)
return product;
}
if (multiplier.high == (long long) multiplier.low >> 63)
{ // -2**63 <= multiplier < 2**63
product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
tmp.low = _umul128(multiplicand.high, multiplier.low, &tmp.high);
if (multiplicand.high < 0)
tmp.high -= multiplier.low;
if ((long long) multiplier.low < 0)
{
tmp.high -= multiplicand.high
+ (tmp.low < multiplicand.low);
tmp.low -= multiplicand.low;
}
tmp.low += product.high;
tmp.high += tmp.low < (unsigned long long) product.high;
product.high = tmp.low;
if (tmp.high == (long long) tmp.low >> 63)
return product;
}
product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
if (multiplicand.high < 0)
{
if (multiplier.high < 0)
{
if (((multiplicand.high & multiplier.high) == -1)
&& ((multiplicand.low | multiplier.low) != 0))
{
product.high -= multiplicand.low + multiplier.low;
if (product.high >= 0)
return product;
}
}
else
{
if ((multiplicand.high == -1) && (multiplier.high == 0))
{
product.high -= multiplier.low;
if (product.high < 0)
return product;
}
}
}
else
{
if (multiplier.high < 0)
{
if ((multiplicand.high == 0) && (multiplier.high == -1))
{
product.high -= multiplicand.low;
if (product.high < 0)
return product;
}
}
else
{
if ((multiplicand.high == 0) && (multiplier.high == 0))
{
if (product.high >= 0)
return product;
}
}
}
__ud2();
#else
int overflow, sign = (multiplicand.high ^ multiplier.high) < 0;
if (multiplicand.high < 0)
{
multiplicand.low = 0 - multiplicand.low;
multiplicand.high = 0 - multiplicand.high
- (0 < multiplicand.low);
}
if (multiplier.high < 0)
{
multiplier.low = 0 - multiplier.low;
multiplier.high = 0 - multiplier.high
- (0 < multiplier.low);
}
overflow = (multiplicand.high != 0) & (multiplier.high != 0);
tmp.low = _umul128(multiplicand.low, multiplier.high, &tmp.high);
overflow |= tmp.high != 0;
tmp.low += _umul128(multiplicand.high, multiplier.low, &tmp.high);
overflow |= tmp.high != 0;
product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
product.high += tmp.low;
overflow |= (unsigned long long) product.high < tmp.low;
if (sign != 0)
{
product.low = 0 - product.low;
product.high = 0 - product.high
- (0 < product.low);
overflow |= product.high >= 0;
}
else
overflow |= product.high < 0;
if (overflow != 0)
__ud2();
#endif
return product;
}
__inline
uint128_t __umulti3(uint128_t multiplicand, uint128_t multiplier)
{
uint128_t product;
product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
product.high += multiplicand.low * multiplier.high
+ multiplicand.high * multiplier.low;
return product;
}
__inline
uint128_t __umulvti3(uint128_t multiplicand, uint128_t multiplier)
{
uint128_t product, tmp;
int overflow = (multiplicand.high != 0) & (multiplier.high != 0);
tmp.low = _umul128(multiplicand.high, multiplier.low, &tmp.high);
overflow |= tmp.high != 0;
tmp.low += _umul128(multiplicand.low, multiplier.high, &tmp.high);
overflow |= tmp.high != 0;
product.low = _umul128(multiplicand.low, multiplier.low, &product.high);
product.high += tmp.low;
overflow |= product.high < tmp.low;
if (overflow != 0)
__ud2();
return product;
}
#if _MSC_VER >= 1920 // MSC 19.20 alias 2019
#pragma intrinsic(__shiftleft128, __shiftright128, _udiv128, _umul128, _BitScanReverse64)
uint128_t __udivmodti4(uint128_t dividend, uint128_t divisor, uint128_t *remainder)
{
uint128_t quotient;
uint64_t n0 = dividend.low, n1 = dividend.high, n2;
uint64_t d0 = divisor.low, d1 = divisor.high;
uint64_t p0, p1;
unsigned bm, bn;
if (!_BitScanReverse64(&bn, d1))
{ // *:q = n:n / 0:d
if (d0 > n1)
quotient.high = 0;
else // q:q = n:n / 0:d
quotient.high = _udiv128(0, n1, d0, &n1);
quotient.low = _udiv128(n1, n0, d0, &n0);
if (remainder != 0)
{
remainder->high = 0;
remainder->low = n0;
}
}
else
if (d1 > n1)
{ // 0:0 = n:n / d:d
quotient.low = quotient.high = 0;
if (remainder != 0)
*remainder = dividend;
}
else
{ // 0:q = n:n / d:d
bm = 63 - bn;
if (bm == 0)
{
// from "dividend.high >= divisor.high"
// and "most significant bit of divisor.high is set"
// follows "most significant bit of dividend.high is set"
// and thus "quotient.low is either 0 or 1"
//
// this special case is necessary, not an optimization!
// the condition on the next line takes advantage of that
// (due to program flow) dividend.high >= divisor.high
if ((n1 > d1) || (n0 >= d0))
{
n1 -= d1 + (n0 < d0);
n0 -= d0;
quotient.low = 1;
}
else
quotient.low = 0;
quotient.high = 0;
if (remainder != 0)
{
remainder->high = n1;
remainder->low = n0;
}
}
else
{ // normalize
#if 0
n2 = n1 >> ++bn;
n1 <<= bm;
n1 |= n0 >> bn;
#else
n2 = __shiftleft128(n1, 0, bm);
n1 = __shiftleft128(n0, n1, bm);
#endif
n0 <<= bm;
#if 0
d1 <<= bm;
d1 |= d0 >> bn;
#else
d1 = __shiftleft128(d0, d1, bm);
#endif
d0 <<= bm;
quotient.low = _udiv128(n2, n1, d1, &n1);
quotient.high = 0;
p0 = _umul128(quotient.low, d0, &p1);
if ((p1 > n1) || ((p1 == n1) && (p0 > n0)))
{
p1 -= d1 + (p0 < d0);
p0 -= d0;
quotient.low -= 1;
}
if (remainder != 0)
{
n1 -= p1 + (n0 < p0);
n0 -= p0;
#if 0
remainder->low = (n0 >> bm) | (n1 << bn);
#else
remainder->low = __shiftright128(n0, n1, bm);
#endif
remainder->high = n1 >> bm;
}
}
}
return quotient;
}
#endif
#endif // INTERN
#endif // _MSC_VER
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.code
; rcx = oword ptr |argument|
; rdx = oword ptr argument
__absti2 proc public
mov r8, [rdx]
mov rax, [rdx+8] ; rax:r8 = argument
cqo ; rdx = (argument < 0) ? -1 : 0
xor r8, rdx
xor rax, rdx ; rax:r8 = (argument < 0) ? ~argument : argument
sub r8, rdx
sbb rax, rdx ; rax:r8 = (argument < 0) ? -argument : argument
; = |argument|
mov [rcx], r8
mov [rcx+8], rax
mov rax, rcx ; rax = address of |argument|
ret
__absti2 endp
; rcx = oword ptr |argument|
; rdx = oword ptr argument
__absvti2 proc public
mov r8, [rdx]
mov rax, [rdx+8] ; rax:r8 = argument
cqo ; rdx = (argument < 0) ? -1 : 0
xor r8, rdx
xor rax, rdx ; rax:r8 = (argument < 0) ? ~argument : argument
sub r8, rdx
sbb rax, rdx ; rax:r8 = (argument < 0) ? -argument : argument
; = |argument|
jo short overflow ; |argument| = argument = ±2**127?
mov [rcx], r8
mov [rcx+8], rax
mov rax, rcx ; rax = address of |argument|
ret
overflow:
ud2
__absvti2 endp
; rcx = oword ptr result
; rdx = oword ptr argument
__bswapti2 proc public
mov rax, [rdx]
mov rdx, [rdx+8] ; rdx:rax = argument
movbe [rcx+8], rax
movbe [rcx], rdx
mov rax, rcx ; rax = address of result
ret
__bswapti2 endp
; rcx = oword ptr result
; rdx = oword ptr argument
__reverseti2 proc public
mov rax, [rdx]
mov rdx, [rdx+8] ; rdx:rax = argument
mov r9, 0AAAAAAAAAAAAAAAAh
lea r10, [rax+rax]
lea r11, [rdx+rdx] ; r11:r10 = argument << 1
and rax, r9
and rdx, r9 ; rdx:rax = argument
; & 0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
and r10, r9
and r11, r9 ; r11:r10 = (argument << 1)
; & 0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
shr rax, 1
shr rdx, 1 ; rdx:rax = (argument >> 1)
; & 0x55555555555555555555555555555555
or rax, r10
or rdx, r11 ; rdx:rax = ((argument >> 1)
; & 0x55555555555555555555555555555555)
; | ((argument << 1)
; & 0xAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA)
; = argument'
mov r9, 0CCCCCCCCCCCCCCCCh
if 0
lea r10, [4*rax]
lea r11, [4*rdx] ; r11:r10 = argument' << 2
else
mov r10, rax
mov r11, rdx ; r11:r10 = argument'
shl r10, 2
shl r11, 2 ; r11:r10 = argument' << 2
endif
and rax, r9
and rdx, r9 ; rdx:rax = argument'
; & 0xCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
and r10, r9
and r11, r9 ; r11:r10 = (argument' << 2)
; & 0xCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
shr rax, 2
shr rdx, 2 ; rdx:rax = (argument' >> 2)
; & 0x33333333333333333333333333333333
or rax, r10
or rdx, r11 ; rdx:rax = ((argument' >> 2)
; & 0x33333333333333333333333333333333)
; | ((argument' << 2)
; & 0xCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC)
; = argument"
mov r9, 0F0F0F0F0F0F0F0F0h
mov r10, rax
mov r11, rdx ; r11:r10 = argument"
shl r10, 4
shl r11, 4 ; r11:r10 = argument" << 4
and rax, r9
and rdx, r9 ; rdx:rax = argument"
; & 0xF0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0
and r10, r9
and r11, r9 ; r11:r10 = (argument" << 4)
; & 0xF0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0
shr rax, 4
shr rdx, 4 ; rdx:rax = (argument" >> 4)
; & 0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F
or rax, r10
or rdx, r11 ; rdx:rax = ((argument" >> 4)
; & 0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F)
; | ((argument" << 4)
; & 0xF0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0)
movbe [rcx+8], rax
movbe [rcx], rdx
mov rax, rcx ; rax = address of result
ret
__reverseti2 endp
; rcx = oword ptr argument
__clzti2 proc public
mov eax, 64
bsr rdx, [rcx+8] ; rdx = index of most significant '1' bit
jnz short @f ; high qword of argument <> 0?
add eax, eax
bsr rdx, [rcx] ; rdx = index of most significant '1' bit - 64
jz short return
@@:
stc
sbb eax, edx ; rax = 127 - index of most significant '1' bit
return:
ret
__clzti2 endp
; rcx = oword ptr argument
__ctzti2 proc public
mov eax, 64
bsf rdx, [rcx+8] ; rdx = index of least significant '1' bit - 64
cmovz edx, eax ; rdx = (high qword of argument <> 0)
; ? index of least significant '1' bit - 64 : 64
add edx, eax ; rdx = (high qword of argument <> 0)
; ? index of least significant '1' bit : 128
bsf rax, [rcx] ; rax = index of least significant '1' bit
cmovz eax, edx
ret
__ctzti2 endp
; rcx = oword ptr argument
__parityti2 proc public
mov rax, [rcx]
xor rax, [rcx+8]
shld rcx, rax, 32
xor eax, ecx
shld ecx, eax, 16
xor ecx, eax
xor eax, eax
xor cl, ch
setpo al ; rax = {0, 1}
ret
__parityti2 endp
; rcx = oword ptr argument
__popcountti2 proc public
mov rax, [rcx]
mov rdx, [rcx+8] ; rdx:rax = argument
mov rcx, 5555555555555555h
mov r10, rax
mov r11, rdx ; r11:r10 = argument
shr rax, 1
shr rdx, 1 ; rdx:rax = argument >> 1
and rax, rcx
and rdx, rcx ; rdx:rax = (argument >> 1)
; & 0x55555555555555555555555555555555
sub r10, rax
sub r11, rdx ; r11:r10 = argument
; - ((argument >> 1)
; & 0x55555555555555555555555555555555)
; = argument'
mov rcx, 3333333333333333h
mov rax, r10
mov rdx, r11 ; rdx:rax = argument'
and r10, rcx
and r11, rcx ; r11:r10 = argument'
; & 0x33333333333333333333333333333333
shr rax, 2
shr rdx, 2 ; rdx:rax = argument' >> 2
and rax, rcx
and rdx, rcx ; rdx:rax = (argument' >> 2)
; & 0x33333333333333333333333333333333
add r10, rax
add r11, rdx ; r11:r10 = (argument'
; & 0x33333333333333333333333333333333)
; + ((argument' >> 2)
; & 0x33333333333333333333333333333333)
; = argument"
mov rcx, 0F0F0F0F0F0F0F0Fh
mov rax, r10
mov rdx, r11 ; rdx:rax = argument"
shr r10, 4
shr r11, 4 ; r11:r10 = argument" >> 4
add rax, r10
add rdx, r11 ; rdx:rax = argument" + (argument" >> 4)
and rax, rcx
and rdx, rcx ; rdx:rax = (argument" + (argument" >> 4))
; & 0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F
mov rcx, 0101010101010101h
imul rax, rcx
imul rdx, rcx ; rdx:rax = ((argument" + (argument" >> 4))
; & 0x0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F)
; * 0x01010101010101010101010101010101
add rax, rdx
shr rax, 56 ; rax = [0, 128]
ret
__popcountti2 endp
; rcx = oword ptr comparand
; rdx = oword ptr comparator
__cmpti2 proc public
mov r8, [rcx]
mov rcx, [rcx+8] ; rcx:r8 = comparand
sub r8, [rdx]
sbb rcx, [rdx+8] ; rcx:r8 = comparand - comparator
setg ah ; ah = (comparand > comparator) ? 1 : 0
setl al ; al = (comparand < comparator) ? 1 : 0
stc
sbb al, ah ; al = (comparand < comparator)
; - (comparand > comparator)
; - 1
; = {0, -1, -2}
movsx eax, al
neg eax ; rax = (comparand > comparator)
; - (comparand < comparator)
; + 1
; = {0, 1, 2}
ret
__cmpti2 endp
; rcx = oword ptr comparand
; rdx = oword ptr comparator
__ucmpti2 proc public
xor eax, eax ; rax = 0
mov r8, [rcx]
mov rcx, [rcx+8] ; rcx:r8 = comparand
sub r8, [rdx]
sbb rcx, [rdx+8] ; rcx:r8 = comparand - comparator
seta al ; rax = (comparand > comparator)
sbb eax, -1 ; rax = (comparand > comparator)
; - (comparand < comparator)
; + 1
; = {0, 1, 2}
ret
__ucmpti2 endp
; rcx = oword ptr negation
; rdx = oword ptr negend
__negti2 proc public
xor eax, eax
mov r8, [rdx]
neg r8
sbb rax, [rdx+8]
mov [rcx], r8
mov [rcx+8], rax
mov rax, rcx ; rax = address of negation
ret
__negti2 endp
; rcx = oword ptr negation
; rdx = oword ptr negend
__negvti2 proc public
xor eax, eax
mov r8, [rdx]
neg r8
sbb rax, [rdx+8]
jo short overflow ; negation = negend = ±2**127?
mov [rcx], r8
mov [rcx+8], rax
mov rax, rcx ; rax = address of negation
ret
overflow:
ud2
__negvti2 endp
; rcx = oword ptr sum
; rdx = oword ptr augend
; r8 = oword ptr addend
__addti3 proc public
__uaddti3 proc public
mov rax, [rdx]
mov rdx, [rdx+8] ; rdx:rax = augend
add rax, [r8]
adc rdx, [r8+8] ; rdx:rax = augend + addend
; = sum
mov [rcx], rax
mov [rcx+8], rdx
mov rax, rcx ; rax = address of sum
ret
__uaddti3 endp
__addti3 endp
; rcx = oword ptr sum
; rdx = oword ptr augend
; r8 = oword ptr addend
__addvti3 proc public
mov rax, [rdx]
mov rdx, [rdx+8] ; rdx:rax = augend
add rax, [r8]
adc rdx, [r8+8] ; rdx:rax = augend + addend
; = sum
jo short overflow ; sum < -2**127?
; sum >= 2**127?
mov [rcx], rax
mov [rcx+8], rdx
mov rax, rcx ; rax = address of sum
ret
overflow:
ud2
__addvti3 endp
; rcx = oword ptr sum
; rdx = oword ptr augend
; r8 = oword ptr addend
__uaddvti3 proc public
mov rax, [rdx]
mov rdx, [rdx+8] ; rdx:rax = augend
add rax, [r8]
adc rdx, [r8+8] ; rdx:rax = augend + addend
; = sum
jc short overflow ; sum >= 2**128?
mov [rcx], rax
mov [rcx+8], rdx
mov rax, rcx ; rax = address of sum
ret
overflow:
ud2
__uaddvti3 endp
; rcx = oword ptr difference
; rdx = oword ptr minuend
; r8 = oword ptr subtrahend
__subti3 proc public
__usubti3 proc public
mov rax, [rdx]
mov rdx, [rdx+8] ; rdx:rax = minuend
sub rax, [r8]
sbb rdx, [r8+8] ; rdx:rax = minuend - subtrahend
; = difference
mov [rcx], rax
mov [rcx+8], rdx
mov rax, rcx ; rax = address of difference
ret
__usubti3 endp
__subti3 endp
; rcx = oword ptr difference
; rdx = oword ptr minuend
; r8 = oword ptr subtrahend
__subvti3 proc public
mov rax, [rdx]
mov rdx, [rdx+8] ; rdx:rax = minuend
sub rax, [r8]
sbb rdx, [r8+8] ; rdx:rax = minuend - subtrahend
; = difference
jo short overflow ; difference < -2**127?
; difference >= 2**127?
mov [rcx], rax
mov [rcx+8], rdx
mov rax, rcx ; rax = address of difference
ret
overflow:
ud2
__subvti3 endp
; rcx = oword ptr difference
; rdx = oword ptr minuend
; r8 = oword ptr subtrahend
__usubvti3 proc public
mov rax, [rdx]
mov rdx, [rdx+8] ; rdx:rax = minuend
sub rax, [r8]
sbb rdx, [r8+8] ; rdx:rax = minuend - subtrahend
; = difference
jb short overflow ; difference < 0?
mov [rcx], rax
mov [rcx+8], rdx
mov rax, rcx ; rax = address of difference
ret
overflow:
ud2
__usubvti3 endp
; rcx = oword ptr maximum
; rdx = oword ptr left argument
; r8 = oword ptr right argument
__maxti3 proc public
mov r11, [rdx+8]
mov r10, [rdx] ; r11:r10 = left argument
mov r9, [r8+8]
mov r8, [r8] ; r9:r8 = right argument
cmp r10, r8
mov rax, r11
sbb rax, r9
cmovl r11, r9
cmovl r10, r8 ; r11:r10 = (left argument < right argument)
; ? right argument : left argument
mov r11, [rcx+8]
mov r10, [rcx]
mov rax, rcx ; rax = address of maximum
ret
__maxti3 endp
; rcx = oword ptr maximum
; rdx = oword ptr left argument
; r8 = oword ptr right argument
__umaxti3 proc public
mov r11, [rdx+8]
mov r10, [rdx] ; r11:r10 = left argument
mov r9, [r8+8]
mov r8, [r8] ; r9:r8 = right argument
cmp r10, r8
mov rax, r11
sbb rax, r9
cmovb r11, r9
cmovb r10, r8 ; r11:r10 = (left argument < right argument)
; ? right argument : left argument
mov r11, [rcx+8]
mov r10, [rcx]
mov rax, rcx ; rax = address of maximum
ret
__umaxti3 endp
; rcx = oword ptr maximum
; rdx = oword ptr left argument
; r8 = oword ptr right argument
__minti3 proc public
mov r11, [rdx+8]
mov r10, [rdx] ; r11:r10 = left argument
mov r9, [r8+8]
mov r8, [r8] ; r9:r8 = right argument
cmp r10, r8
mov rax, r11
sbb rax, r9
cmovg r11, r9
cmovg r10, r8 ; r11:r10 = (left argument > right argument)
; ? right argument : left argument
mov r11, [rcx+8]
mov r10, [rcx]
mov rax, rcx ; rax = address of minimum
ret
__minti3 endp
; rcx = oword ptr maximum
; rdx = oword ptr left argument
; r8 = oword ptr right argument
__uminti3 proc public
mov r11, [rdx+8]
mov r10, [rdx] ; r11:r10 = left argument
mov r9, [r8+8]
mov r8, [r8] ; r9:r8 = right argument
cmp r10, r8
mov rax, r11
sbb rax, r9
cmova r11, r9
cmova r10, r8 ; r11:r10 = (left argument > right argument)
; ? right argument : left argument
mov r11, [rcx+8]
mov r10, [rcx]
mov rax, rcx ; rax = address of minimum
ret
__uminti3 endp
; rcx = oword ptr result
; rdx = oword ptr value
; r8 = qword count
__ashlti3 proc public
__lshlti3 proc public
mov rax, rcx ; rax = address of result
mov rcx, r8 ; rcx = count
mov r8, [rdx]
mov r9, [rdx+8] ; r9:r8 = value
ifdef JccLess
xor edx, edx
shld r9, r8, cl
shl r8, cl ; r9:r8 = value << count % 64
cmp ecx, 63
cmova r9, r8
cmova r8, rdx
cmp ecx, 127
cmova r9, rdx ; r9:r8 = value << count
mov [rax], r8
mov [rax+8], r9
ret
else ; JccLess
cmp ecx, 127
ja short zero ; count > 127?
cmp ecx, 63
ja short @f ; count > 63?
shld r9, r8, cl
shl r8, cl ; r9:r8 = value << count % 64
mov [rax], r8
mov [rax+8], r9
ret
@@:
shl r8, cl
xor ecx, ecx
mov [rax], rcx
mov [rax+8], r8
ret
zero:
xor ecx, ecx
mov [rax], rcx
mov [rax+8], rcx
ret
endif ; JccLess
__lshlti3 endp
__ashlti3 endp
; rcx = oword ptr result
; rdx = oword ptr value
; r8 = qword count
__ashrti3 proc public
mov rax, rcx ; rax = address of result
mov rcx, r8 ; rcx = count
mov r8, [rdx]
mov r9, [rdx+8] ; r9:r8 = value
ifdef JccLess
mov rdx, r9
sar rdx, 63
shrd r8, r9, cl
sar r9, cl ; r9:r8 = value >> count % 64
cmp ecx, 63
cmova r8, r9
cmova r9, rdx
cmp ecx, 127
cmova r8, rdx ; r9:r8 = value >> count
mov [rax], r8
mov [rax+8], r9
ret
else ; JccLess
cmp ecx, 127
ja short sign ; count > 127?
cmp ecx, 63
ja short @f ; count > 63?
shrd r8, r9, cl
sar r9, cl ; r9:r8 = value >> count % 64
mov [rax], r8
mov [rax+8], r9
ret
@@:
mov r8, r9
sar r8, cl
sar r9, 63
mov [rax], r8
mov [rax+8], r9
ret
sign:
sar r9, 63
mov [rax], r9
mov [rax+8], r9
ret
endif ; JccLess
__ashrti3 endp
; rcx = oword ptr result
; rdx = oword ptr value
; r8 = qword count
__lshrti3 proc public
mov rax, rcx ; rax = address of result
mov rcx, r8 ; rcx = count
mov r8, [rdx]
mov r9, [rdx+8] ; r9:r8 = value
ifdef JccLess
xor edx, edx
shrd r8, r9, cl
shr r9, cl ; r9:r8 = value >> count % 64
cmp ecx, 63
cmova r8, r9
cmova r9, rdx
cmp ecx, 127
cmova r8, rdx ; r9:r8 = value >> count
mov [rax], r8
mov [rax+8], r9
ret
else ; JccLess
cmp ecx, 127
ja short zero ; count > 127?
cmp ecx, 63
ja short @f ; count > 63?
shrd r8, r9, cl
shr r9, cl ; r9:r8 = value >> count % 64
mov [rax], r8
mov [rax+8], r9
ret
@@:
shr r9, cl
xor ecx, ecx
mov [rax], r9
mov [rax+8], rcx
ret
zero:
xor ecx, ecx
mov [rax], rcx
mov [rax+8], rcx
ret
endif ; JccLess
__lshrti3 endp
; rcx = oword ptr result
; rdx = oword ptr value
; r8 = qword count
__rotlti3 proc public
mov rax, rcx ; rax = address of result
mov rcx, r8 ; rcx = count
mov r8, [rdx]
mov r9, [rdx+8] ; r9:r8 = value
mov rdx, r8
shld r8, r9, cl
shld r9, rdx, cl ; r9:r8 = value << (count % 64)
; | value >> (64 - count % 64)
test cl, 64
jz short @f
xchg r8, r9 ; r9:r8 = value << (count % 128)
; | value >> (128 - count % 128)
@@:
mov [rax], r8
mov [rax+8], r9
ret
__rotlti3 endp
; rcx = oword ptr result
; rdx = oword ptr value
; r8 = qword count
__rotrti3 proc public
mov rax, rcx ; rax = address of result
mov rcx, r8 ; rcx = count
mov r8, [rdx]
mov r9, [rdx+8] ; r9:r8 = value
mov rdx, r8
shrd r8, r9, cl
shrd r9, rdx, cl ; r9:r8 = value >> (count % 64)
; | value << (64 - count % 64)
test cl, 64
jz short @f
xchg r8, r9 ; r9:r8 = value >> (count % 128)
; | value << (128 - count % 128)
@@:
mov [rax], r8
mov [rax+8], r9
ret
__rotrti3 endp
; rcx = oword ptr product
; rdx = oword ptr multiplicand
; r8 = oword ptr multiplier
__multi3 proc public
__umulti3 proc public
mov r11, [rdx+8] ; r11 = high qword of multiplicand
mov r10, [rdx] ; r10 = low qword of multiplicand
mov r9, [r8+8] ; r9 = high qword of multiplier
mov r8, [r8] ; r8 = low qword of multiplier
mov rax, r10
mul r8 ; rdx:rax = low qword of multiplicand
; * low qword of multiplier
imul r8, r11 ; r8 = low qword of multiplier
; * high qword of multiplicand
imul r9, r10 ; r9 = high qword of multiplier
; * low qword of multiplicand
add rdx, r8
add rdx, r9 ; rdx:rax = product % 2**128
mov [rcx], rax
mov [rcx+8], rdx
mov rax, rcx ; rax = address of product
ret
__umulti3 endp
__multi3 endp
; rcx = oword ptr product
; rdx = oword ptr multiplicand
; r8 = oword ptr multiplier
__mulvti3 proc public
mov r11, [rdx+8] ; r11 = high qword of multiplicand
mov r10, [rdx] ; r10 = low qword of multiplicand
mov r9, [r8+8] ; r9 = high qword of multiplier
mov r8, [r8] ; r8 = low qword of multiplier
mov [rsp+8], rcx
mov [rsp+16], rbx
mov rax, r11
cqo ; rdx = (multiplicand < 0) ? -1 : 0
mov rcx, rdx ; rcx = (multiplicand < 0) ? -1 : 0
xor r10, rdx
xor r11, rdx ; r11:r10 = (multiplicand < 0) ? ~multiplicand : multiplicand
sub r10, rdx
sbb r11, rdx ; r11:r10 = (multiplicand < 0) ? -multiplicand : multiplicand
; = |multiplicand|
mov rax, r9
cqo ; rdx = (multiplier < 0) ? -1 : 0
xor rcx, rdx ; rcx = (multiplier < 0) <> (multiplicand < 0) ? -1 : 0
; = (product < 0) ? -1 : 0
xor r8, rdx
xor r9, rdx ; r9:r8 = (multiplier < 0) ? ~multiplier : multiplier
sub r8, rdx
sbb r9, rdx ; r9:r8 = (multiplier < 0) ? -multiplier : multiplier
; = |multiplier|
xor ebx, ebx ; rbx = 0
cmp rbx, r11
sbb eax, eax ; eax = (high qword of |multiplicand| = 0) ? 0 : -1
; = (|multiplicand| < 2**64) ? 0 : -1
cmp rbx, r9
sbb ebx, ebx ; ebx = (high qword of |multiplier| = 0) ? 0 : -1
; = (|multiplier| < 2**64) ? 0 : -1
and ebx, eax ; ebx = (|multiplicand| < 2**64)
; & (|multiplier| < 2**64) ? 0 : -1
; = (|product| < 2**128) ? 0 : -1
mov rax, r11
mul r8 ; rdx:rax = high qword of |multiplicand|
; * low qword of |multiplier|
adc ebx, ebx ; ebx = (|product| < 2**128) ? 0 : *
mov r11, rax
mov rax, r10
mul r9 ; rdx:rax = low qword of |multiplicand|
; * high qword of |multiplier|
adc ebx, ebx ; ebx = (|product| < 2**128) ? 0 : *
add r11, rax ; r11 = high qword of |multiplicand|
; * low qword of |multiplier|
; + low qword of |multiplicand|
; * high qword of |multiplier|
;; adc ebx, ebx ; ebx = (|product| < 2**128) ? 0 : *
mov rax, r10
mul r8 ; rdx:rax = low qword of |multiplicand|
; * low qword of |multiplier|
add rdx, r11 ; rdx:rax = |product % 2**128|
; = |product| % 2**128
adc ebx, ebx ; ebx = (|product| < 2**128) ? 0 : *
if 0
xor rax, rcx
xor rdx, rcx ; rdx:rax = (product < 0) ? product % 2**128 - 1 : product % 2**128
sub rax, rcx
sbb rdx, rcx ; rdx:rax = product % 2**128
xor rcx, rdx ; rcx = (multiplicand < 0)
; ^ (multiplier < 0)
; ^ (product < 0) ? negative : positive
add rcx, rcx
else
add rax, rcx
adc rdx, rcx ; rdx:rax = (product < 0) ? ~product % 2**128 : product % 2**128
mov r11, rdx ; r11 = (multiplicand < 0)
; ^ (multiplier < 0)
; ^ (product < 0) ? negative : positive
xor rax, rcx
xor rdx, rcx ; rdx:rax = product % 2**128
add r11, r11
endif
adc ebx, ebx ; ebx = (-2**127 <= product < 2**127) ? 0 : *
jnz short overflow ; product < -2**127?
; product >= 2**127?
mov rcx, [rsp+8]
mov rbx, [rsp+16]
mov [rcx], rax
mov [rcx+8], rdx
mov rax, rcx ; rax = address of product
ret
overflow:
ud2
__mulvti3 endp
; rcx = oword ptr product
; rdx = oword ptr multiplicand
; r8 = oword ptr multiplier
__umulvti3 proc public
mov r11, [rdx+8] ; r11 = high qword of multiplicand
mov r10, [rdx] ; r10 = low qword of multiplicand
mov r9, [r8+8] ; r9 = high qword of multiplier
mov r8, [r8] ; r8 = low qword of multiplier
ifndef JccLess
test r11, r11
jz short @f ; multiplicand < 2**64?
test r9, r9
jnz short overflow ; multiplier >= 2**64?
@@:
mov rax, r11
mul r8 ; rdx:rax = high qword of multiplicand
; * low qword of multiplier
jc short overflow ; product >= 2**128?
mov r11, rax
mov rax, r10
mul r9 ; rdx:rax = low qword of multiplicand
; * high qword of multiplier
jc short overflow ; product >= 2**128?
add r11, rax ; r11 = high qword of multiplicand
; * low qword of multiplier
; + low qword of multiplicand
; * high qword of multiplier
;; jc short overflow
mov rax, r10
mul r8 ; rdx:rax = low qword of multiplicand
; * low qword of multiplier
add rdx, r11 ; rdx:rax = product % 2**128
jc short overflow ; product >= 2**128?
else ; JccLess
mov [rsp+8], rbx
if 0
mov rax, r11
mul r9 ; rdx:rax = high qword of multiplicand
; * high qword of multiplier
sbb ebx, ebx ; ebx = (product < 2**192) ? 0 : -1
neg rax
adc ebx, ebx ; ebx = (product < 2**128) ? 0 : *
else
xor eax, eax
cmp rax, r11
sbb ebx, ebx ; ebx = (high qword of multiplicand = 0) ? 0 : -1
; = (multiplicand < 2**64) ? 0 : -1
cmp rax, r9
sbb eax, eax ; eax = (high qword of multiplier = 0) ? 0 : -1
; = (multiplier < 2**64) ? 0 : -1
and ebx, eax ; ebx = (multiplicand < 2**64)
; & (multiplier < 2**64) ? 0 : -1
; = (product < 2**128) ? 0 : -1
endif
mov rax, r11
mul r8 ; rdx:rax = high qword of multiplicand
; * low qword of multiplier
adc ebx, ebx ; ebx = (product < 2**128) ? 0 : *
mov r11, rax
mov rax, r10
mul r9 ; rdx:rax = low qword of multiplicand
; * high qword of multiplier
adc ebx, ebx ; ebx = (product < 2**128) ? 0 : *
add r11, rax ; r11 = high qword of multiplicand
; * low qword of multiplier
; + low qword of multiplicand
; * high qword of multiplier
;; adc ebx, ebx ; ebx = (product < 2**128) ? 0 : *
mov rax, r10
mul r8 ; rdx:rax = low qword of multiplicand
; * low qword of multiplier
add rdx, r11 ; rdx:rax = product % 2**128
adc ebx, ebx ; ebx = (product < 2**128) ? 0 : *
jnz short overflow ; product >= 2**128?
mov rbx, [rsp+8]
endif ; JccLess
mov [rcx], rax
mov [rcx+8], rdx
mov rax, rcx ; rax = address of product
ret
overflow:
ud2
__umulvti3 endp
end
__udivmoddi4()
function for AMD64
processors, supporting the Microsoft calling
convention, using the shift & subtractdivision:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: raises "division exception" when divisor is 0!
.code
; rcx = dividend
; rdx = divisor
; r8 = oword ptr remainder
__udivmoddi4 proc public
cmp rcx, rdx
jb short trivial ; dividend < divisor?
bsr rax, rdx ; rax = index of most significant '1' bit of divisor
jz short error ; divisor = 0?
mov r9, rcx ; r9 = dividend
bsr rcx, rcx ; rcx = index of most significant '1' bit of dividend
jz short zero ; dividend = 0?
sub ecx, eax ; ecx = distance of leading '1' bits
;; jb short trivial ; dividend < divisor?
shl rdx, cl ; rdx = divisor << distance of leading '1' bits
; = divisor'
xor eax, eax ; eax = quotient' = 0
@@:
mov r10, r9 ; r10 = dividend'
sub r9, rdx ; r9 = dividend' - divisor'
; = dividend"
; CF = (dividend' < divisor')
cmovb r9, r10 ; r9 = (dividend' < divisor') ? dividend' : dividend"
cmc ; CF = (dividend' >= divisor')
adc rax, rax ; rax = quotient' << 1
; + (dividend' >= divisor')
; = quotient
shr rdx, 1 ; rdx = divisor' >> 1
; = divisor
dec ecx
jns short @b
test r8, r8
jz short @f ; address of remainder = 0?
mov [r8], r9 ; remainder = dividend"
@@:
ret
; dividend < divisor: quotient = 0, remainder = dividend
trivial:
test r8, r8
jz short @f ; address of remainder = 0?
mov [r8], rcx ; remainder = dividend
@@:
xor eax, eax ; rax = quotient = 0
ret
; dividend = 0: quotient = 0, remainder = 0
zero:
test r8, r8
jz short @f ; address of remainder = 0?
mov [r8], r9 ; remainder = 0
@@:
mov rax, r9 ; rax = quotient = 0
ret
; divisor = 0
error:
div rdx
ret
__udivmoddi4 endp
end
__udivmoddi4()
,
__udivdi3()
, __umoddi3()
,
__divmoddi4()
, __divdi3()
and
__moddi3()
functions:
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
uint64_t __udivmoddi4(uint64_t dividend, uint64_t divisor, uint64_t *remainder);
uint64_t __udivdi3(uint64_t dividend, uint64_t divisor);
uint64_t __umoddi3(uint64_t dividend, uint64_t divisor);
int64_t __divmoddi4(int64_t dividend, int64_t divisor, int64_t *remainder);
int64_t __divdi3(int64_t dividend, int64_t divisor);
int64_t __moddi3(int64_t dividend, int64_t divisor);
The suffixes di4
and
di3
specify the
number of arguments plus return value and their size:
double integer denotes an 8-byte
QWORD
alias 64-bit
uint64_t
.
Note: the compiler helper routines for the Microsoft Visual C compiler use non-standard calling or naming conventions and can therefore not be prototyped; they are for internal use by the compiler only!
Note: the other code following here supports the
common, so-called cdecl
calling and naming convention used by
C compilers on Linux®,
Unix, Windows™, plus
other operating systems, and runs on 35 year old
Intel® 80386 micro-processors.
Note: the branch-free code paths, which are
assembled when the macro JCCLESS
is defined, actually
yield lower performance!
__udivmoddi4()
function for i386
processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+20] = (optional) qword ptr remainder
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
__udivmoddi4 proc public
push ebx
mov ebx, [esp+20] ; ebx = high dword of divisor
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx
jb short trivial ; (high dword of) dividend < (high dword of) divisor?
bsr eax, ebx ; eax = index of most significant '1' bit
; in high dword of divisor
jz short simple ; high dword of divisor = 0?
; dividend >= divisor >= 2**32: quotient < 2**32
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of dividend
;; jz short trivial ; dividend < 2**32 (<= divisor)?
; perform "shift & subtract" alias "binary long" division
sub ecx, eax ; ecx = distance of leading '1' bits
;; jb short trivial ; dividend < divisor?
mov eax, [esp+16] ; eax = low dword of divisor
shld ebx, eax, cl
shl eax, cl ; ebx:eax = divisor'
push esi
push edi
mov esi, [esp+16] ; edx:esi = dividend
push ebp
xor ebp, ebp ; ebp = quotient = 0
next:
mov edi, edx ; edi = high dword of dividend
cmp esi, eax
sbb edi, ebx
jb short @f ; dividend < divisor'?
sub esi, eax
sbb edx, ebx ; edx:esi = dividend - divisor'
; = dividend'
@@:
cmc ; CF = (dividend >= divisor')
adc ebp, ebp ; ebp = quotient << 1
; + dividend >= divisor'
; = quotient'
if 0
shrd eax, ebx, 1
shr ebx, 1 ; ebx:eax = divisor' >> 1
; = divisor"
else
shr ebx, 1
rcr eax, 1 ; ebx:eax = divisor' >> 1
; = divisor"
endif
dec ecx
jns short next
mov ecx, [esp+36] ; ecx = address of remainder
test ecx, ecx
jz short @f ; address of remainder = 0?
mov [ecx+4], edx
mov [ecx], esi ; [ecx] = remainder
@@:
xor edx, edx
mov eax, ebp ; edx:eax = quotient
pop ebp
pop edi
pop esi
pop ebx
ret
; dividend < (2**32 <=) divisor: quotient = 0, remainder = dividend
trivial:
mov ecx, [esp+24] ; ecx = address of remainder
test ecx, ecx
jz short @f ; address of remainder = 0?
mov eax, [esp+8] ; eax = low dword of dividend,
; edx:eax = remainder
mov [ecx+4], edx
mov [ecx], eax ; [ecx] = remainder
@@:
xor edx, edx
xor eax, eax ; edx:eax = quotient = 0
pop ebx
ret
; remainder < divisor < 2**32
simple:
mov ecx, [esp+16] ; ecx = (low dword of) divisor
cmp ecx, edx
ja short normal ; divisor > high dword of dividend?
; perform "long" alias "schoolbook" division
long:
mov eax, edx ; eax = high dword of dividend
mov edx, ebx ; edx = 0,
; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
push eax ; [esp] = high dword of quotient
mov eax, [esp+12] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov ecx, [esp+28] ; ecx = address of remainder
test ecx, ecx
jz short @f ; address of remainder = 0?
mov [ecx+4], ebx ; [ecx+4] = 0 = high dword of remainder
mov [ecx], edx ; [ecx] = (low dword of) remainder
@@:
pop edx ; edx:eax = quotient
pop ebx
ret
; perform normal division
normal:
mov eax, [esp+8] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
mov ecx, [esp+24] ; ecx = address of remainder
test ecx, ecx
jz short @f ; address of remainder = 0?
mov [ecx+4], ebx ; [ecx+4] = 0 = high dword of remainder
mov [ecx], edx ; [ecx] = (low dword of) remainder
@@:
mov edx, ebx ; edx = 0,
; edx:eax = quotient
pop ebx
ret
__udivmoddi4 endp
end
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# Common "cdecl" calling and naming convention for i386 platform:
# - arguments are pushed on stack in reverse order (from right to left),
# 4-byte aligned;
# - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
# low part below high part;
# - 64-bit integer result is returned in registers EAX (low part) and
# EDX (high part);
# - 32-bit integer or pointer result is returned in register EAX;
# - registers EAX, ECX and EDX are volatile and can be clobbered;
# - registers EBX, ESP, EBP, ESI and EDI must be preserved;
# - function names are prefixed with an underscore.
# NOTE: raises "division exception" when divisor is 0!
.file "udivmoddi4.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+20] = (optional) qword ptr remainder
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__udivmoddi4:
mov ecx, [esp+8] # ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] # edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+8] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
mov eax, [esp+4] # edx:eax = dividend
div ecx # eax = (low dword of) quotient,
# edx = (low dword of) remainder
mov ecx, [esp+20] # ecx = address of remainder
test ecx, ecx
jz 0f # address of remainder = 0?
mov [ecx], edx # [ecx] = (low dword of) remainder
xor edx, edx
mov [ecx+4], edx
0:
xor edx, edx # edx:eax = quotient
ret
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
push eax # [esp] = high dword of quotient
mov eax, [esp+8] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
mov ecx, [esp+24] # ecx = address of remainder
test ecx, ecx
jz 1f # address of remainder = 0?
mov [ecx], edx # [ecx] = (low dword of) remainder
xor edx, edx
mov [ecx+4], edx
1:
pop edx # edx:eax = quotient
ret
# dividend < divisor: quotient = 0, remainder = dividend
.trivial:
mov ecx, [esp+20] # ecx = address of remainder
test ecx, ecx
jz 2f # address of remainder = 0?
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = dividend
mov [ecx], eax
mov [ecx+4], edx # [ecx] = remainder = dividend
2:
xor eax, eax
xor edx, edx # edx:eax = quotient = 0
ret
# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor >= 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx
jb 3f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
3:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx # CF = (high dword of dividend < divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
.if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+12] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
push ebx # [esp] = quotient"
mov eax, [esp+20] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
imul ebx, [esp+24] # ebx = high dword of divisor * quotient"
mov ecx, [esp+16] # ecx = high dword of dividend
sub ecx, ebx # ecx = high dword of dividend
# - high dword of divisor * quotient"
mov ebx, [esp+12] # ebx = low dword of dividend
sub ebx, eax
sub ecx, edx # ecx:ebx = dividend - divisor * quotient"
# = remainder"
.ifnotdef JCCLESS
pop eax # eax = quotient"
jnb 4f # remainder" >= 0?
# (with borrow, it is off by divisor,
# and quotient" is off by 1)
add ebx, [esp+16]
adc ecx, [esp+20] # ecx:ebx = remainder" + divisor
# = remainder
dec eax # eax = quotient" - 1
# = quotient
4:
.else # JCCLESS
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
cdq # edx = (remainder" < 0) ? -1 : 0
add [esp], eax # edi = quotient" - (remainder" < 0)
# = (low dword of) quotient
and eax, [esp+20]
and edx, [esp+24] # edx:eax = (remainder" < 0) ? divisor : 0
add ebx, eax
adc ecx, edx # ecx:ebx = remainder" + divisor
# = remainder
pop eax # eax = (low dword of) quotient
.endif # JCCLESS
mov edx, [esp+24] # edx = address of remainder
test edx, edx
jz 5f # address of remainder = 0?
mov [edx], ebx
mov [edx+4], ecx # [edx] = remainder
5:
xor edx, edx # edx:eax = quotient
pop ebx
ret
# dividend >= divisor >= 2**63:
# quotient = 1, remainder = dividend - divisor
.special:
or ecx, [esp+20] # ecx = address of remainder
jz 6f # address of remainder = 0?
.if 0
neg edx
neg eax
sbb edx, 0 # edx:eax = -divisor
add eax, [esp+4]
adc edx, [esp+8] # edx:eax = dividend - divisor
# = remainder
.else
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = dividend
sub eax, [esp+12]
sbb edx, [esp+16] # edx:eax = dividend - divisor
# = remainder
.endif
mov [ecx], eax
mov [ecx+4], edx # [ecx] = remainder
6:
xor eax, eax
xor edx, edx
inc eax # edx:eax = quotient = 1
ret
.size __udivmoddi4, .-__udivmoddi4
.type __udivmoddi4, @function
.global __udivmoddi4
.end
Microsoft Visual C compiler helper
routine _aulldvrm()
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; "stdcall" calling and naming convention for i386 platform:
; - arguments are pushed on stack in reverse order (from right to left),
; 4-byte aligned;
; - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
; low part below high part;
; - 64-bit integer result is returned in registers EAX (low part) and
; EDX (high part);
; - 32-bit integer or pointer result is returned in register EAX;
; - registers EAX, ECX and EDX are volatile and can be clobbered;
; - registers EBX, EBP, ESI and EDI must be preserved;
; - register ESP (the stack pointer) must be restored by callee;
; - function names are prefixed with an underscore and suffixed with an
; at-sign, followed by the total number of bytes for all arguments.
; NOTE: returns quotient in EDX:EAX and remainder in EBX:ECX
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
_aulldvrm proc public
mov ecx, [esp+8] ; ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] ; edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+8] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+4] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
mov ecx, edx ; ecx = (low dword of) remainder
xor ebx, ebx ; ebx:ecx = remainder
xor edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
mov ebx, eax ; ebx = high dword of quotient
mov eax, [esp+4] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov ecx, edx ; ecx = (low dword of) remainder
mov edx, ebx ; edx:eax = quotient
xor ebx, ebx ; ebx:ecx = remainder
ret 16 ; callee restores stack
; dividend < divisor: quotient = 0, remainder = dividend
trivial:
mov ecx, [esp+4]
mov ebx, [esp+8] ; ebx:ecx = remainder = dividend
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret 16 ; callee restores stack
; dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor >= 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+8] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+8] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+8] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+12] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
mov ecx, [esp+16] ; ecx = high dword of divisor
imul ecx, ebx ; ecx = high dword of divisor * quotient"
push ebx ; [esp] = quotient"
mov ebx, [esp+12] ; ebx = high dword of dividend
sub ebx, ecx ; ebx = high dword of dividend
; - high dword of divisor * quotient"
mov ecx, [esp+8] ; ecx = low dword of dividend
sub ecx, eax
sub ebx, edx ; ebx:ecx = dividend - divisor * quotient"
; = remainder"
ifndef JccLess
pop eax ; eax = quotient"
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ecx, [esp+12]
adc ebx, [esp+16] ; ebx:ecx = remainder" + divisor
; = remainder
dec eax ; eax = quotient" - 1
; = (low dword of) quotient
@@:
else ; JccLess
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
add [esp], eax ; [esp] = quotient" - (remainder" < 0)
; = (low dword of) quotient
and eax, [esp+16]
and edx, [esp+20] ; edx:eax = (remainder" < 0) ? divisor : 0
add ecx, eax
adc ebx, edx ; ebx:ecx = remainder" + divisor
; = remainder
pop eax ; eax = (low dword of) quotient
endif ; JccLess
xor edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend >= divisor >= 2**63:
; quotient = 1, remainder = dividend - divisor
special:
mov ecx, [esp+4]
mov ebx, [esp+8] ; ebx:ecx = dividend
sub ecx, eax
sbb ebx, edx ; ebx:ecx = dividend - divisor
; = remainder
xor eax, eax
xor edx, edx
inc eax ; edx:eax = quotient = 1
ret 16 ; callee restores stack
_aulldvrm endp
end
__udivdi3()
function for i386 processors:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: raises "division exception" when divisor is 0!
.file "udivdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__udivdi3:
mov ecx, [esp+8] # ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] # edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+8] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
mov eax, [esp+4] # edx:eax = dividend
div ecx # eax = (low dword of) quotient,
# edx = (low dword of) remainder
xor edx, edx # edx:eax = quotient
ret
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
push eax # [esp] = high dword of quotient
mov eax, [esp+8] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
pop edx # edx:eax = quotient
ret
# dividend < divisor: quotient = 0
.trivial:
xor eax, eax
xor edx, edx # edx:eax = quotient = 0
ret
# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor >= 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx
jb 0f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
0:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx # CF = (high dword of dividend < divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
.if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+12] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
mov eax, [esp+16] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
.ifnotdef JCCLESS
mov ecx, [esp+20] # ecx = high dword of divisor
imul ecx, ebx # ecx = high dword of divisor * quotient"
add edx, ecx # edx:eax = divisor * quotient"
jc 1f # divisor * quotient" >= 2**64?
mov ecx, [esp+12] # ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx # CF = (dividend < divisor * quotient")
# = (remainder" < 0)
1:
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
add eax, ebx # eax = quotient" - (remainder" < 0)
# = (low dword of) quotient
xor edx, edx # edx:eax = quotient
.else # JCCLESS
mov ecx, [esp+12] # ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx # ecx:... = dividend
# - low dword of divisor * quotient"
mov eax, [esp+20] # eax = high dword of divisor
imul eax, ebx # eax = high dword of divisor * quotient"
.if 0
sub ecx, eax # ecx:... = dividend - divisor * quotient"
# = remainder"
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
add eax, ebx # eax = quotient" - (remainder" < 0)
# = (low dword of) quotient
xor edx, edx # edx:eax = quotient
.else
xor edx, edx # edx = high dword of quotient = 0
sub ecx, eax # ecx:... = dividend - divisor * quotient"
# = remainder"
mov eax, ebx # eax = quotient"
sbb eax, edx # eax = quotient" - (remainder" < 0)
# = (low dword of) quotient
.endif
.endif # JCCLESS
pop ebx
ret
# dividend >= divisor >= 2**63: quotient = 1
.special:
xor eax, eax
xor edx, edx
inc eax # edx:eax = quotient = 1
ret
.size __udivdi3, .-__udivdi3
.type __udivdi3, @function
.global __udivdi3
.end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
__udivdi3 proc public
mov ecx, [esp+8] ; ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] ; edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+8] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+4] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
xor edx, edx ; edx:eax = quotient
ret
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
push eax ; [esp] = high dword of quotient
mov eax, [esp+8] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
pop edx ; edx:eax = quotient
ret
; dividend < divisor: quotient = 0
trivial:
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret
; dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor >= 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; quotient overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend - divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
.if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
.else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
.endif
add edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+12] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+16] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
ifndef JccLess
mov ecx, [esp+20] ; ecx = high dword of divisor
imul ecx, ebx ; ecx = high dword of divisor * quotient"
add edx, ecx ; edx:eax = divisor * quotient"
jc short @f ; divisor * quotient" >= 2**64?
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx ; CF = (dividend < divisor * quotient")
; = (remainder" < 0)
@@:
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) quotient
xor edx, edx ; edx:eax = quotient
else ; JccLess
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx ; ecx:... = dividend
; - low dword of divisor divisor * quotient"
mov eax, [esp+20] ; eax = high dword of divisor
imul eax, ebx ; eax = high dword of divisor * quotient"
if 0
sub ecx, eax ; ecx:... = dividend - divisor * quotient"
; = remainder"
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) quotient
xor edx, edx ; edx:eax = quotient
else
xor edx, edx ; edx = high dword of quotient = 0
sub ecx, eax ; ecx:... = dividend - divisor * quotient"
; = remainder"
mov eax, ebx ; eax = quotient"
sbb eax, edx ; eax = quotient" - (remainder" < 0)
; = (low dword of) quotient
endif
endif ; JccLess
pop ebx
ret
; dividend >= divisor >= 2**63: quotient = 1
special:
xor eax, eax
xor edx, edx
inc eax ; edx:eax = quotient = 1
ret
__udivdi3 endp
end
Microsoft Visual C compiler helper
routine
_aulldiv()
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
_aulldiv proc public
mov ecx, [esp+8] ; ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] ; edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+8] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+4] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
xor edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
push eax ; [esp] = high dword of quotient
mov eax, [esp+8] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
pop edx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend < divisor: quotient = 0
trivial:
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret 16 ; callee restores stack
; dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor >= 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+12] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+16] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
ifndef JccLess
mov ecx, [esp+20] ; ecx = high dword of divisor
imul ecx, ebx ; ecx = high dword of divisor * quotient"
add edx, ecx ; edx:eax = divisor * quotient"
jc short @f ; divisor * quotient" >= 2**64?
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx ; CF = (dividend < divisor * quotient")
; = (remainder" < 0)
@@:
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) quotient
xor edx, edx ; edx:eax = quotient
else ; JccLess
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx ; ecx:... = dividend
; - low dword of divisor * quotient"
mov eax, [esp+20] ; eax = high dword of divisor
imul eax, ebx ; eax = high dword of divisor * quotient"
if 0
sub ecx, eax ; ecx:... = dividend - divisor * quotient"
; = remainder"
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) quotient
xor edx, edx ; edx:eax = quotient
else
xor edx, edx ; edx = high dword of quotient = 0
sub ecx, eax ; ecx:... = dividend - divisor * quotient"
; = remainder"
mov eax, ebx ; eax = quotient"
sbb eax, edx ; eax = quotient" - (remainder" < 0)
; = (low dword of) quotient
endif
endif ; JccLess
pop ebx
ret 16 ; callee restores stack
; dividend >= divisor >= 2**63: quotient = 1
special:
xor eax, eax
xor edx, edx
inc eax ; edx:eax = quotient = 1
ret 16 ; callee restores stack
_aulldiv endp
end
__umoddi3()
function for i386 processors:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: raises "division exception" when divisor is 0!
.file "umoddi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__umoddi3:
mov ecx, [esp+8] # ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] # edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+8] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
mov eax, [esp+4] # edx:eax = dividend
div ecx # eax = (low dword of) quotient,
# edx = (low dword of) remainder
mov eax, edx # eax = (low dword of) remainder
xor edx, edx # edx:eax = remainder
ret
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
mov eax, [esp+4] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
mov eax, edx # eax = (low dword of) remainder
xor edx, edx # edx:eax = remainder
ret
# dividend < divisor: remainder = dividend
.trivial:
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = remainder = dividend
ret
# dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor >= 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx
jb 0f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
0:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+12] # edx = high dword of dividend
cmp edx, ebx # CF = (high dword of dividend < divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+12] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
mov eax, [esp+16] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
imul ebx, [esp+20] # ebx = high dword of divisor * quotient"
mov ecx, [esp+12] # ecx = high dword of dividend
sub ecx, ebx # ecx = high dword of dividend
# - high dword of divisor * quotient"
mov ebx, [esp+8] # ebx = low dword of dividend
sub ebx, eax
sbb ecx, edx # ecx:ebx = dividend - divisor * quotient"
# = remainder"
.ifnotdef JCCLESS
jnb 1f # remainder" >= 0?
# (with borrow, it is off by divisor,
# and quotient" is off by 1)
add ebx, [esp+16]
adc ecx, [esp+20] # ecx:ebx = remainder" + divisor
# = remainder
1:
mov eax, ebx
mov edx, ecx # edx:eax = remainder
.else # JCCLESS
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
cdq # edx = (remainder" < 0) ? -1 : 0
and eax, [esp+16]
and edx, [esp+20] # edx:eax = (remainder" < 0) ? divisor : 0
add eax, ebx
adc edx, ecx # edx:eax = remainder" + divisor
# = remainder
.endif # JCCLESS
pop ebx
ret
# dividend >= divisor >= 2**63: remainder = dividend - divisor
.special:
.if 0
mov eax, [esp+4]
mov edx, [esp+8] # edx:eax = dividend
sub eax, [esp+12]
sbb edx, [esp+16] # edx:eax = dividend - divisor
# = remainder
.else
neg edx
neg eax
sbb edx, ecx # edx:eax = -divisor
add eax, [esp+4]
adc edx, [esp+8] # edx:eax = dividend - divisor
# = remainder
.endif
ret
.size __umoddi3, .-__umoddi3
.type __umoddi3, @function
.global __umoddi3
.end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
__umoddi3 proc public
mov ecx, [esp+8] ; ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] ; edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+8] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+4] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
mov eax, edx ; eax = (low dword of) remainder
xor edx, edx ; edx:eax = remainder
ret
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
mov eax, [esp+4] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov eax, edx ; eax = (low dword of) remainder
xor edx, edx ; edx:eax = remainder
ret
; dividend < divisor: remainder = dividend
trivial:
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = remainder = dividend
ret
; dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor >= 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; quotient overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend - divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+12] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+16] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
imul ebx, [esp+20] ; ebx = high dword of divisor * quotient"
mov ecx, [esp+12] ; ecx = high dword of dividend
sub ecx, ebx ; ecx = high dword of dividend
; - high dword of divisor * quotient"
mov ebx, [esp+8] ; ebx = low dword of dividend
sub ebx, eax
sbb ecx, edx ; ecx:ebx = dividend - divisor * quotient"
; = remainder"
ifndef JccLess
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ebx, [esp+16]
adc ecx, [esp+20] ; ecx:ebx = remainder" + divisor
; = remainder
@@:
mov eax, ebx
mov edx, ecx ; edx:eax = remainder
else ; JccLess
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
and eax, [esp+16]
and edx, [esp+20] ; edx:eax = (remainder" < 0) ? divisor : 0
add eax, ebx
adc edx, ecx ; edx:eax = remainder" + divisor
; = remainder
endif ; JccLess
pop ebx
ret
; dividend >= divisor >= 2**63: remainder = dividend - divisor
special:
if 0
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = dividend
sub eax, [esp+12]
sbb edx, [esp+16] ; edx:eax = dividend - divisor
; = remainder
else
neg edx
neg eax
sbb edx, ecx ; edx:eax = -divisor
add eax, [esp+4]
adc edx, [esp+8] ; edx:eax = dividend - divisor
; = remainder
endif
ret
__umoddi3 endp
end
Microsoft Visual C compiler helper
routine _aullrem()
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
_aullrem proc public
mov ecx, [esp+8] ; ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] ; edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+8] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+4] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
mov eax, edx ; eax = (low dword of) remainder
xor edx, edx ; edx:eax = remainder
ret 16 ; callee restores stack
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
mov eax, [esp+4] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov eax, edx ; eax = (low dword of) remainder
xor edx, edx ; edx:eax = remainder
ret 16 ; callee restores stack
; dividend < divisor: remainder = dividend
trivial:
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = remainder = dividend
ret 16 ; callee restores stack
; dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor >= 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+12] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+16] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
imul ebx, [esp+20] ; ebx = high dword of divisor * quotient"
mov ecx, [esp+12] ; ecx = high dword of dividend
sub ecx, ebx ; ecx = high dword of dividend
; - high dword of divisor * quotient"
mov ebx, [esp+8] ; ebx = low dword of dividend
sub ebx, eax
sbb ecx, edx ; ecx:ebx = dividend - divisor * quotient"
; = remainder"
ifndef JccLess
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ebx, [esp+16]
adc ecx, [esp+20] ; ecx:ebx = remainder" + divisor
; = remainder
@@:
mov eax, ebx
mov edx, ecx ; edx:eax = remainder
else ; JccLess
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
and eax, [esp+16]
and edx, [esp+20] ; edx:eax = (remainder" < 0) ? divisor : 0
add eax, ebx
adc edx, ecx ; edx:eax = remainder" + divisor
; = remainder
endif ; JccLess
pop ebx
ret 16 ; callee restores stack
; dividend >= divisor >= 2**63: remainder = dividend - divisor
special:
if 0
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = dividend
sub eax, [esp+12]
sbb edx, [esp+16] ; edx:eax = dividend - divisor
; = remainder
else
neg edx
neg eax
sbb edx, ecx ; edx:eax = -divisor
add eax, [esp+4]
adc edx, [esp+8] ; edx:eax = dividend - divisor
; = remainder
endif
ret 16 ; callee restores stack
_aullrem endp
end
__divmoddi4()
function for i386
processors:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: returns ±2**63 for -2**63 / -1 and 0 for -2**63 % -1!
# NOTE: raises "division exception" when divisor is 0!
.file "divmoddi4.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+20] = (optional) qword ptr remainder
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__divmoddi4:
mov eax, [esp+16]
mov ecx, [esp+12] # eax:ecx = divisor
cdq # edx = (divisor < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (divisor < 0) ? ~divisor : divisor
sub ecx, edx
sbb eax, edx # eax:ecx = (divisor < 0) ? -divisor : divisor
# = |divisor|
push ebx
push edx # [esp] = (divisor < 0)
push [esp+28] # [esp] = address of remainder
push eax
push ecx # [esp] = |divisor|
mov eax, [esp+28]
mov ecx, [esp+24] # eax:ecx = dividend
cdq # edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx # eax:ecx = (dividend < 0) ? -dividend : dividend
# = |dividend|
mov ebx, edx # ebx = (dividend < 0) ? -1 : 0
# = (remainder < 0) ? -1 : 0
push eax
push ecx # [esp] = |dividend|
call __udivmoddi4 # edx:eax = |quotient|
add esp, 16
pop ecx # ecx = address of remainder
test ecx, ebx
jz 0f # address of remainder = 0?
# remainder >= 0?
neg dword ptr [ecx+4]
neg dword ptr [ecx]
sbb dword ptr [ecx+4], 0 # [ecx] = remainder
0:
pop ecx # ecx = (divisor < 0) ? -1 : 0
xor ecx, ebx # ecx = (divisor < 0) ^ (dividend < 0) ? -1 : 0
# = (quotient < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx # edx:eax = (quotient < 0) ? |~quotient| : |quotient|
sub eax, ecx
sbb edx, ecx # edx:eax = (quotient < 0) ? |-quotient| : |quotient|
# = quotient
pop ebx
ret
.size __divmoddi4, .-__divmoddi4
.type __divmoddi4, @function
.global __divmoddi4
.end
Microsoft Visual C compiler helper
routine _alldvrm()
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: returns quotient in EDX:EAX and remainder in EBX:ECX
; NOTE: returns ±2**63 for -2**63 / -1!
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
_alldvrm proc public
; determine sign of dividend and compute |dividend|
mov edx, [esp+8]
mov eax, [esp+4] ; edx:eax = dividend
mov ebx, edx
sar ebx, 31 ; ebx = (dividend < 0) ? -1 : 0
; = (remainder < 0) ? -1 : 0
xor eax, ebx
xor edx, ebx ; edx:eax = (dividend < 0) ? ~dividend : dividend
sub eax, ebx
sbb edx, ebx ; edx:eax = (dividend < 0) ? -dividend : dividend
; = |dividend|
mov [esp+4], eax ; write |dividend| back on stack
mov [esp+8], edx
; determine sign of divisor and compute |divisor|
mov edx, [esp+16]
mov eax, [esp+12] ; edx:eax = divisor
mov ecx, edx
sar ecx, 31 ; ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx ; edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx ; edx:eax = (divisor < 0) ? -divisor : divisor
; = |divisor|
mov [esp+12], eax ; write |divisor| back on stack
mov [esp+16], edx
xor ecx, ebx ; ecx = (divisor < 0) ^ (dividend < 0) ? -1 : 0
; = (quotient < 0) ? -1 : 0
push ecx ; [esp] = (quotient < 0) ? -1 : 0
push ebx ; [esp] = (remainder < 0) ? -1 : 0
mov ecx, [esp+16] ; ecx = high dword of dividend
cmp [esp+12], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+16] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
xor ebx, ebx ; ebx = high dword of quotient = 0
jmp short next
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
mov ebx, eax ; ebx = high dword of quotient
next:
mov eax, [esp+12] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov ecx, edx ; ecx = (low dword of) |remainder|
mov edx, ebx ; edx:eax = |quotient|
;; xor ebx, ebx ; ebx:ecx = |remainder|
if 0
mov ebx, [esp+4] ; ebx = (quotient < 0) ? -1 : 0
xor eax, ebx
xor edx, ebx
sub eax, ebx
sbb edx, ebx ; edx:eax = quotient
pop ebx ; ebx = (remainder < 0) ? -1 : 0
xor ecx, ebx
sub ecx, ebx
sbb ebx, ebx ; ebx:ecx = remainder
else
pop ebx ; ebx = (remainder < 0) ? -1 : 0
xor ecx, ebx
sub ecx, ebx
sbb ebx, ebx ; ebx:ecx = remainder
xor eax, [esp]
xor edx, [esp]
sub eax, [esp]
sbb edx, [esp] ; edx:eax = quotient
endif
add esp, 4
ret 16 ; callee restores stack
; dividend < divisor: quotient = 0, remainder = dividend
trivial:
pop eax ; eax = (remainder < 0) ? -1 : 0
mov ecx, [esp+8]
mov ebx, [esp+12] ; ebx:ecx = |remainder| = |dividend|
xor ecx, eax
xor ebx, eax
sub ecx, eax
sbb ebx, eax ; ebx:ecx = remainder
pop edx ; edx = (quotient < 0) ? -1 : 0
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret 16 ; callee restores stack
; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor = 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+16] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
push ebx ; [esp] = quotient"
mov eax, [esp+24] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
imul ebx, [esp+28] ; ebx = high dword of divisor * quotient"
add edx, ebx ; edx:eax = divisor * quotient"
mov ecx, [esp+16]
mov ebx, [esp+20] ; ebx:ecx = dividend
sub ecx, eax
sbb ebx, edx ; ebx:ecx = dividend - divisor * quotient"
; = remainder"
ifndef JccLess
pop eax ; eax = quotient"
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ecx, [esp+20]
adc ebx, [esp+24] ; ebx:ecx = remainder" + divisor
; = |remainder|
dec eax ; eax = quotient" - 1
; = low dword of |quotient|
@@:
else ; JccLess
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
add [esp], eax ; [esp] = quotient" - (remainder" < 0)
; = low dword of |quotient|
and eax, [esp+24]
and edx, [esp+28] ; edx:eax = (remainder" < 0) ? divisor : 0
add ecx, eax
adc ebx, edx ; ebx:ecx = remainder" + divisor
; = |remainder|
pop eax ; eax = (low dword of) |quotient|
endif ; JccLess
;; xor edx, edx ; edx:eax = |quotient|
pop edx ; edx = (remainder < 0) ? -1 : 0
xor ecx, edx
xor ebx, edx
sub ecx, edx
sbb ebx, edx ; ebx:ecx = remainder
pop edx ; edx = (quotient < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend = divisor = -2**63: quotient = 1, remainder = 0
special:
pop ebx ; ebx = sign of remainder = -1
inc ebx
;; xor ecx, ecx ; ebx:ecx = remainder = 0
pop eax ; eax = sign of quotient = 0
inc eax ; eax = (low dword of) quotient = 1
xor edx, edx ; edx:eax = quotient = 1
ret 16 ; callee restores stack
_alldvrm endp
end
__divdi3()
function for i386 processors:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: returns ±2**63 for -2**63 / -1!
# NOTE: raises "division exception" when divisor is 0!
.file "divdi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__divdi3:
# determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] # eax:ecx = dividend
cdq # edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # eax:ecx = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx # eax:ecx = (dividend < 0) ? -dividend : dividend
# = |dividend|
mov [esp+4], ecx # write |dividend| back on stack
mov [esp+8], eax
push edx # [esp] = (dividend < 0) ? -1 : 0
# determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] # edx:eax = divisor
mov ecx, edx
sar ecx, 31 # ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx # edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx # edx:eax = (divisor < 0) ? -divisor : divisor
# = |divisor|
mov [esp+16], eax # write |divisor| back on stack
mov [esp+20], edx
xor [esp], ecx # [esp] = (dividend < 0) ^ (divisor < 0) ? -1 : 0
# = (quotient < 0) ? -1 : 0
mov ecx, [esp+12] # ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+12] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
mov eax, [esp+8] # edx:eax = dividend
div ecx # eax = (low dword of) quotient,
# edx = (low dword of) remainder
# xor edx, edx # edx:eax = |quotient|
jmp .quotient
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
push eax # [esp] = high dword of quotient
mov eax, [esp+12] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
pop edx # edx:eax = |quotient|
pop ecx # ecx = (quotient < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx # edx:eax = quotient
ret
# dividend < divisor: quotient = 0
.trivial:
pop ecx # ecx = (quotient < 0) ? -1 : 0
xor eax, eax
xor edx, edx # edx:eax = quotient = 0
ret
# 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor = 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+16] # edx = high dword of dividend
cmp edx, ebx
jb 0f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
0:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+16] # edx = high dword of dividend
cmp edx, ebx # CF = (high dword of dividend < divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+16] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
mov eax, [esp+20] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
mov ecx, [esp+24] # ecx = high dword of divisor
imul ecx, ebx # ecx = high dword of divisor * quotient"
add edx, ecx # edx:eax = divisor * quotient"
# jc 1f # divisor * quotient" >= 2**64?
mov ecx, [esp+16] # ecx = high dword of dividend
cmp [esp+12], eax
sbb ecx, edx # CF = (dividend < divisor * quotient")
# = (remainder" < 0)
1:
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
add eax, ebx # eax = quotient" - (remainder" < 0)
# = (low dword of) |quotient|
# xor edx, edx # edx:eax = |quotient|
pop ebx
.quotient:
pop edx # edx = (quotient < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx # edx:eax = quotient
ret
# dividend = divisor = -2**63: quotient = 1
.special:
pop eax # eax = sign of quotient = 0
inc eax # eax = (low dword of) quotient = 1
xor edx, edx # edx:eax = quotient = 1
ret
.size __divdi3, .-__divdi3
.type __divdi3, @function
.global __divdi3
.end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: returns ±2**63 for -2**63 / -1!
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
__divdi3 proc public
; determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] ; eax:ecx = dividend
cdq ; edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx ; eax:ecx = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx ; eax:ecx = (dividend < 0) ? -dividend : dividend
; = |dividend|
mov [esp+4], ecx ; write |dividend| back on stack
mov [esp+8], eax
push edx ; [esp] = (dividend < 0) ? -1 : 0
; determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] ; edx:eax = divisor
mov ecx, edx
sar ecx, 31 ; ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx ; edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx ; edx:eax = (divisor < 0) ? -divisor : divisor
; = |divisor|
mov [esp+16], eax ; write |divisor| back on stack
mov [esp+20], edx
xor [esp], ecx ; [esp] = (dividend < 0) ? -1 : 0 ^ (divisor < 0) ? -1 : 0
; = (quotient < 0) ? -1 : 0
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+12] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+8] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
;; xor edx, edx ; edx:eax = |quotient|
jmp short quotient
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
push eax ; [esp] = high dword of quotient
mov eax, [esp+12] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
pop edx ; edx:eax = |quotient|
pop ecx ; ecx = (quotient < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx ; edx:eax = quotient
ret
; dividend < divisor: quotient = 0
trivial:
pop ecx ; ecx = (quotient < 0) ? -1 : 0
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret
; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor = 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; quotient overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend - divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+16] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+20] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
mov ecx, [esp+24] ; ecx = high dword of divisor
imul ecx, ebx ; ecx = high dword of divisor * quotient"
add edx, ecx ; edx:eax = divisor * quotient"
mov ecx, [esp+16] ; ecx = high dword of dividend
cmp [esp+12], eax
sbb ecx, edx ; CF = (dividend < divisor * quotient")
; = (remainder" < 0)
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) |quotient|
;; xor edx, edx ; edx:eax = |quotient|
pop ebx
quotient:
pop edx ; edx = (quotient < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx ; edx:eax = quotient
ret
; dividend = divisor = -2**63: quotient = 1
special:
pop eax ; eax = sign of quotient = 0
inc eax ; eax = (low dword of) quotient = 1
xor edx, edx ; edx:eax = quotient = 1
ret
__divdi3 endp
end
Microsoft Visual C compiler helper
routine
_alldiv()
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: returns ±2**63 for -2**63 / -1!
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
_alldiv proc public
; determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] ; eax:ecx = dividend
cdq ; edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx ; eax:ecx = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx ; eax:ecx = (dividend < 0) ? -dividend : dividend
; = |dividend|
mov [esp+4], ecx ; write |dividend| back on stack
mov [esp+8], eax
push edx ; [esp] = (dividend < 0) ? -1 : 0
; determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] ; edx:eax = divisor
mov ecx, edx
sar ecx, 31 ; ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx ; edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx ; edx:eax = (divisor < 0) ? -divisor : divisor
; = |divisor|
mov [esp+16], eax ; write |divisor| back on stack
mov [esp+20], edx
xor [esp], ecx ; [esp] = (dividend < 0) ^ (divisor < 0) ? -1 : 0
; = (quotient < 0) ? -1 : 0
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+12] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+8] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
;; xor edx, edx ; edx:eax = |quotient|
jmp short quotient
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
push eax ; [esp] = high dword of quotient
mov eax, [esp+12] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
pop edx ; edx:eax = |quotient|
pop ecx ; ecx = (quotient < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend < divisor: quotient = 0
trivial:
pop ecx ; ecx = (quotient < 0) ? -1 : 0
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret 16 ; callee restores stack
; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor = 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+16] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+20] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
mov ecx, [esp+24] ; ecx = high dword of divisor
imul ecx, ebx ; ecx = high dword of divisor * quotient"
add edx, ecx ; edx:eax = divisor * quotient"
mov ecx, [esp+16] ; ecx = high dword of dividend
cmp [esp+12], eax
sbb ecx, edx ; CF = (dividend - divisor * quotient")
; = (remainder" < 0)
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) |quotient|
;; xor edx, edx ; edx:eax = |quotient|
pop ebx
quotient:
pop edx ; edx = (quotient < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend = divisor = -2**63: quotient = 1
special:
pop eax ; eax = sign of quotient = 0
inc eax ; eax = (low dword of) quotient = 1
xor edx, edx ; edx:eax = quotient = 1
ret 16 ; callee restores stack
_alldiv endp
end
__moddi3()
function for i386 processors:
# Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
# NOTE: raises "division exception" when divisor is 0!
.file "moddi3.s"
.arch generic32
.code32
.intel_syntax noprefix
.text
# [esp+16] = high dword of divisor
# [esp+12] = low dword of divisor
# [esp+8] = high dword of dividend
# [esp+4] = low dword of dividend
__moddi3:
# determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] # eax:ecx = dividend
cdq # edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx # ecx:eax = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx # ecx:eax = (dividend < 0) ? -dividend : dividend
# = |dividend|
mov [esp+4], ecx # write |dividend| back on stack
mov [esp+8], eax
push edx # [esp] = (dividend < 0) ? -1 : 0
# determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] # edx:eax = divisor
mov ecx, edx
sar ecx, 31 # ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx # edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx # edx:eax = (divisor < 0) ? -divisor : divisor
# = |divisor|
mov [esp+16], eax # write |divisor| back on stack
mov [esp+20], edx
mov ecx, [esp+12] # ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb .trivial # dividend < divisor?
bsr ecx, edx # ecx = index of most significant '1' bit
# in high dword of divisor
jnz .extended # high dword of divisor <> 0?
# remainder < divisor < 2**32
mov ecx, eax # ecx = (low dword of) divisor
mov eax, [esp+12] # eax = high dword of dividend
cmp eax, ecx
jae .long # high dword of dividend >= divisor?
# perform normal division
.normal:
mov edx, eax # edx = high dword of dividend
jmp .next
# perform "long" alias "schoolbook" division
.long:
# xor edx, edx # edx:eax = high dword of dividend
div ecx # eax = high dword of quotient,
# edx = high dword of remainder'
.next:
mov eax, [esp+8] # eax = low dword of dividend
div ecx # eax = low dword of quotient,
# edx = (low dword of) remainder
mov eax, edx # eax = (low dword of) |remainder|
# xor edx, edx # edx:eax = |remainder|
pop edx # edx = (remainder < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx # edx:eax = remainder
ret
# dividend < divisor: remainder = dividend
.trivial:
mov eax, [esp+8]
mov edx, [esp+12] # edx:eax = |remainder| = |dividend|
jmp .remainder
# 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
.extended:
xor ecx, 31 # ecx = number of leading '0' bits
# in (high dword of) divisor
jz .special # divisor = 2**63?
# perform "extended & adjusted" division
shld edx, eax, cl # edx = divisor / 2**(index + 1)
# = divisor'
# shl eax, cl
push ebx
mov ebx, edx # ebx = divisor'
.ifnotdef JCCLESS
xor eax, eax # eax = high dword of quotient' = 0
mov edx, [esp+16] # edx = high dword of dividend
cmp edx, ebx
jb 0f # high dword of dividend < divisor'?
# high dword of dividend >= divisor':
# subtract divisor' from high dword of dividend to prevent possible
# quotient overflow and set most significant bit of quotient"
sub edx, ebx # edx = high dword of dividend - divisor'
# = high dword of dividend'
inc eax # eax = high dword of quotient' = 1
0:
push eax # [esp] = high dword of quotient'
.else # JCCLESS
mov edx, [esp+16] # edx = high dword of dividend
cmp edx, ebx # CF = (high dword of dividend - divisor')
sbb eax, eax # eax = (high dword of dividend < divisor') ? -1 : 0
inc eax # eax = (high dword of dividend < divisor') ? 0 : 1
# = high dword of quotient'
push eax # [esp] = high dword of quotient'
if 0
neg eax # eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx # eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax # edx = high dword of dividend
# - (high dword of dividend < divisor') ? 0 : divisor'
# = high dword of dividend'
.endif # JCCLESS
mov eax, [esp+16] # eax = low dword of dividend
# = low dword of dividend'
div ebx # eax = dividend' / divisor'
# = low dword of quotient',
# edx = remainder'
pop ebx # ebx = high dword of quotient'
shld ebx, eax, cl # ebx = quotient' / 2**(index + 1)
# = dividend / divisor'
# = quotient"
# shl eax, cl
mov eax, [esp+20] # eax = low dword of divisor
mul ebx # edx:eax = low dword of divisor * quotient"
imul ebx, [esp+24] # ebx = high dword of divisor * quotient"
add edx, ebx # edx:eax = divisor * quotient"
mov ecx, [esp+12]
mov ebx, [esp+16] # ebx:ecx = dividend
sub ecx, eax
sbb ebx, edx # ebx:ecx = dividend - divisor * quotient"
# = remainder"
.ifnotdef JCCLESS
jnb 1f # remainder" >= 0?
# (with borrow, it is off by divisor,
# and quotient" is off by 1)
add ecx, [esp+20]
adc ebx, [esp+24] # ebx:ecx = remainder" + divisor
# = remainder
1:
mov eax, ecx
mov edx, ebx # edx:eax = |remainder|
.else # JCCLESS
sbb eax, eax # eax = (remainder" < 0) ? -1 : 0
cdq # edx = (remainder" < 0) ? -1 : 0
and eax, [esp+20]
and edx, [esp+24] # edx:eax = (remainder" < 0) ? divisor : 0
add eax, ecx
adc edx, ebx # edx:eax = remainder" + divisor
# = remainder
.endif # JCCLESS
pop ebx
.remainder:
pop ecx # ecx = (remainder < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx # edx:eax = remainder
ret
# dividend = divisor = -2**63: remainder = 0
.special:
pop eax # ebx = sign of remainder = -1
inc eax
xor edx, edx # edx:eax = remainder = 0
ret
.size __moddi3, .-__moddi3
.type __moddi3, @function
.global __moddi3
.end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
__moddi3 proc public
; determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] ; eax:ecx = dividend
cdq ; edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx ; ecx:eax = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx ; ecx:eax = (dividend < 0) ? -dividend : dividend
; = |dividend|
mov [esp+4], ecx ; write |dividend| back on stack
mov [esp+8], eax
push edx ; [esp] = (dividend < 0) ? -1 : 0
; determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] ; edx:eax = divisor
mov ecx, edx
sar ecx, 31 ; ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx ; edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx ; edx:eax = (divisor < 0) ? -divisor : divisor
; = |divisor|
mov [esp+16], eax ; write |divisor| back on stack
mov [esp+20], edx
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+12] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
jmp short next
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
next:
mov eax, [esp+8] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov eax, edx ; eax = (low dword of) |remainder|
;; xor edx, edx ; edx:eax = |remainder|
pop edx ; edx = (remainder < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx ; edx:eax = remainder
ret
; dividend < divisor: remainder = dividend
trivial:
mov eax, [esp+8]
mov edx, [esp+12] ; edx:eax = |remainder| = |dividend|
jmp short remainder
; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor = 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; quotient overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+16] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; eax = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+20] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
imul ebx, [esp+24] ; ebx = high dword of divisor * quotient"
add edx, ebx ; edx:eax = divisor * quotient"
mov ecx, [esp+12]
mov ebx, [esp+16] ; ebx:ecx = dividend
sub ecx, eax
sbb ebx, edx ; ebx:ecx = dividend - divisor * quotient"
; = remainder"
ifndef JccLess
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ecx, [esp+20]
adc ebx, [esp+24] ; ebx:ecx = remainder" + divisor
; = remainder
@@:
mov eax, ecx
mov edx, ebx ; edx:eax = |remainder|
else ; JccLess
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
and eax, [esp+20]
and edx, [esp+24] ; edx:eax = (remainder" < 0) ? divisor : 0
add eax, ecx
adc edx, ebx ; edx:eax = remainder" + divisor
; = remainder
endif ; JccLess
pop ebx
remainder:
pop ecx ; ecx = (remainder < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx ; edx:eax = remainder
ret
; dividend = divisor = -2**63: remainder = 0
special:
pop eax ; eax = sign of remainder = -1
inc eax
xor edx, edx ; edx:eax = remainder = 0
ret
__moddi3 endp
end
Microsoft Visual C compiler helper
routine _allrem()
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: raises "division exception" when divisor is 0!
.386
.model flat, C
.code
; [esp+16] = high dword of divisor
; [esp+12] = low dword of divisor
; [esp+8] = high dword of dividend
; [esp+4] = low dword of dividend
_allrem proc public
; determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] ; eax:ecx = dividend
cdq ; edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx ; ecx:eax = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx ; ecx:eax = (dividend < 0) ? -dividend : dividend
; = |dividend|
mov [esp+4], ecx ; write |dividend| back on stack
mov [esp+8], eax
push edx ; [esp] = (dividend < 0) ? -1 : 0
; determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] ; edx:eax = divisor
mov ecx, edx
sar ecx, 31 ; ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx ; edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx ; edx:eax = (divisor < 0) ? -divisor : divisor
; = |divisor|
mov [esp+16], eax ; write |divisor| back on stack
mov [esp+20], edx
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+12] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
jmp short next
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
next:
mov eax, [esp+8] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov eax, edx ; eax = (low dword of) |remainder|
;; xor edx, edx ; edx:eax = |remainder|
pop edx ; edx = (remainder < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx ; edx:eax = remainder
ret 16 ; callee restores stack
; dividend < divisor: remainder = dividend
trivial:
mov eax, [esp+8]
mov edx, [esp+12] ; edx:eax = |remainder| = |dividend|
jmp short remainder
; 2**63 %gt;= dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor = 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JccLess
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JccLess
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JccLess
mov eax, [esp+16] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld eax, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+20] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
imul ebx, [esp+24] ; ebx = high dword of divisor * quotient"
add edx, ebx ; edx:eax = divisor * quotient"
mov ecx, [esp+12]
mov ebx, [esp+16] ; ebx:ecx = dividend
sub ecx, eax
sbb ebx, edx ; ebx:ecx = dividend - divisor * quotient"
; = remainder"
ifndef JccLess
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ecx, [esp+20]
adc ebx, [esp+24] ; ebx:ecx = remainder" + divisor
; = remainder
@@:
mov eax, ecx
mov edx, ebx ; edx:eax = |remainder|
else ; JccLess
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
and eax, [esp+20]
and edx, [esp+24] ; edx:eax = (remainder" < 0) ? divisor : 0
add eax, ecx
adc edx, ebx ; edx:eax = remainder" + divisor
; = |remainder|
endif ; JccLess
pop ebx
remainder:
pop ecx ; ecx = (remainder < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx ; edx:eax = remainder
ret 16 ; callee restores stack
; dividend = divisor = -2**63: remainder = 0
special:
pop eax ; eax = sign of remainder = -1
inc eax
xor edx, edx ; edx:eax = remainder = 0
ret 16 ; callee restores stack
_allrem endp
end
__muldi3()
alias __umuldi3()
function for
i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; [esp+16] = high dword of multiplier
; [esp+12] = low dword of multiplier
; [esp+8] = high dword of multiplicand
; [esp+4] = low dword of multiplicand
__muldi3 proc public
__umuldi3 proc public
push ebx
mov edx, [esp+16] ; edx = low dword of multiplier
mov ecx, [esp+12] ; ecx = high dword of multiplicand
imul ecx, edx ; ecx = high dword of multiplicand
; * low dword of multiplier
mov eax, [esp+8] ; eax = low dword of multiplicand
mov ebx, [esp+20] ; ebx = high dword of multiplier
imul ebx, eax ; ebx = high dword of multiplier
; * low dword of multiplicand
mul edx ; edx:eax = low dword of multiplicand
; * low dword of multiplier
add ecx, ebx ; ecx = high dword of multiplicand
; * low dword of multiplier
; + high dword of multiplier
; * low dword of multiplicand
add edx, ecx ; edx:eax = product % 2**64
pop ebx
ret
__umuldi3 endp
__muldi3 endp
end
Microsoft Visual C compiler helper
routine
_allmul()
,
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; [esp+16] = high dword of multiplier
; [esp+12] = low dword of multiplier
; [esp+8] = high dword of multiplicand
; [esp+4] = low dword of multiplicand
_allmul proc public
mov eax, [esp+4] ; eax = low dword of multiplicand
mov edx, [esp+8] ; edx = high dword of multiplicand
imul edx, [esp+12] ; edx = high dword of multiplicand
; * low dword of multiplier
mov ecx, [esp+16] ; ecx = high dword of multiplier
imul ecx, eax ; ecx = high dword of multiplier
; * low dword of multiplicand
add ecx, edx ; ecx = high dword of multiplier
; * low dword of multiplicand
; + high dword of multiplicand
; * low dword of multiplier
mul dword ptr [esp+12]
; edx:eax = low dword of multiplicand
; * low dword of multiplier
add edx, ecx ; edx:eax = product % 2**64
ret 16 ; callee restores stack
_allmul endp
end
_allshl()
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: applies shift count % 64
.386
.model flat, C
.code
; edx:eax = value
; ecx = count
_allshl proc public
test cl, 32
jnz short @f ; count > 31?
shld edx, eax, cl
shl eax, cl
ret
@@:
mov edx, eax
shl edx, cl
xor eax, eax
ret
_allshl endp
end
Microsoft Visual C compiler helper
routine _allshr()
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: applies shift count % 64
.386
.model flat, C
.code
; edx:eax = value
; ecx = count
_allshr proc public
test cl, 32
jnz short @f ; count > 31?
shrd eax, edx, cl
sar edx, cl
ret
@@:
mov eax, edx
sar eax, cl
sar edx, 31
ret
_allshr endp
end
Microsoft Visual C compiler helper
routine _aullshr()
for i386 processors:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: applies shift count % 64
.386
.model flat, C
.code
; edx:eax = value
; ecx = count
_aullshr proc public
test cl, 32
jnz short @f ; count > 31?
shrd eax, edx, cl
shr edx, cl
ret
@@:
mov eax, edx
shr eax, cl
xor edx, edx
ret
_aullshr endp
end
INTEGER.CAB
:
the console programs *.com
measure execution times in
nano-seconds, while the console programs *.exe
measure
processor clock cycles.
The makefile
INTEGER.MAK
for Microsoft’s
NMAKE.EXE
performs all following steps, using but slightly different
filenames; it contains the sources presented above and below as
inline files
and was used to create the cabinet file
INTEGER.CAB
.
For division, the left columns show the execution times for 128-bit or 64-bit uniform distributed pseudo-random dividend and divisor, i.e. the (rather unlikely) special case with numbers of (almost) equal magnitude, while the right columns show the execution times for 128-bit to 65-bit or 64-bit to 33-bit pseudo-random dividend and divisor respectively, i.e. the (more likely) general case with numbers of different magnitude.
128÷128-bit division | 64÷64-bit division | ||||||||||
__udivmodti4() |
__udivmodti4() |
__udivmodti4() |
__udivmoddi4() |
DIV |
|||||||
---|---|---|---|---|---|---|---|---|---|---|---|
eSKamation | LLVM | eSKamation | eSKamation | AMD, Intel | |||||||
AMD EPYC™ 7713 | 9 | 10 | 25 | 130 | 11 | 33 | 13 | 23 | 3 | 3 | |
AMD Ryzen™9 3900XT | 19 | 19 | 39 | 190 | 20 | 56 | 21 | 39 | 13 | 16 | |
AMD Ryzen™5 3600 | 20 | 20 | 41 | 201 | 21 | 59 | 22 | 41 | 14 | 17 | |
AMD Ryzen™7 2700X | 20 | 19 | 44 | 212 | 23 | 63 | 21 | 41 | 14 | 17 | |
AMD Radeon™R3 | 38 | 41 | 56 | 300 | 26 | 78 | 25 | 59 | 11 | 22 | |
Intel Core i5-9500 | 74 | t.b.s. | t.b.s. | t.b.s. | 15 | t.b.s. | 13 | t.b.s. | 30 | t.b.s. | |
Intel Core i7-8550U | 31 | 32 | 24 | 122 | 11 | 37 | 12 | 21 | 28 | 28 | |
Intel Core i5-7400 | 55 | 56 | 41 | 214 | 18 | 65 | 15 | 32 | 28 | 29 | |
Intel Core i5-6600 | 91 | t.b.s. | t.b.s. | t.b.s. | 20 | t.b.s. | 16 | t.b.s. | 37 | t.b.s. | |
Intel Core i5-4670 | 53 | 55 | 44 | 217 | 22 | 74 | 20 | 39 | 31 | 32 | |
Intel Core™2 Duo P8700 | 60 | 60 | 62 | 296 | 27 | 117 | 33 | 60 | 28 | 29 |
64÷64-bit division | 64×64-bit multiplication | |||||||||||
__udivmoddi4() |
__udivmoddi4() |
_aulldvrm() |
_aulldvrm() |
_aullmul() |
__umuldi3() |
__umuldi3() |
||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
eSKamation | LLVM | eSKamation | Microsoft | Microsoft | LLVM | eSKamation | ||||||
AMD EPYC™ 7713 | 9 | 8 | 34 | 63 | 11 | 10 | 47 | 28 | 3 | 4 | • | |
AMD Ryzen™9 3900XT | 17 | 13 | 53 | 99 | 19 | 14 | 70 | 41 | 4 | 7 | • | |
AMD Ryzen™5 3600 | 18 | 13 | 56 | 105 | 20 | 14 | 73 | 44 | 3 | 7 | • | |
AMD Ryzen™7 2700X | 19 | 15 | 61 | 114 | 22 | 17 | 82 | 58 | 5 | 9 | 1 | |
AMD Radeon™R3 | 42 | 29 | 79 | 165 | 45 | 31 | 101 | 71 | 6 | 13 | 1 | |
Intel Core i5-9500 | 16 | t.b.s. | t.b.s. | t.b.s. | t.b.s. | t.b.s. | 101 | t.b.s. | 5 | t.b.s. | • | |
Intel Core i7-8550U | 10 | 9 | 26 | 52 | 11 | 9 | 72 | 46 | 3 | 5 | 1 | |
Intel Core i5-7400 | 16 | 14 | 41 | 83 | 19 | 15 | 115 | 78 | 5 | 8 | 1 | |
Intel Core i5-6600 | 19 | t.b.s. | t.b.s. | t.b.s. | t.b.s. | t.b.s. | 125 | t.b.s. | 5 | t.b.s. | • | |
Intel Core i5-4670 | 21 | 19 | 50 | 97 | 24 | 19 | 124 | 84 | 8 | 10 | 1 | |
Intel Core™2 Duo P8700 | 25 | 19 | 82 | 145 | 30 | 24 | 136 | 98 | 9 | 18 | 1 |
Note: the deviation of the measurements for my own
__udivmoddi4()
and
_aulldvrm()
division routines is due to their different calling convention.
The extended precision
algorithm shows (almost) constant
runtime, independent of the magnitude of dividend and divisor, and
the best overall performance.
On
AMD Ryzen
processors, the 128÷128-bit division routine using the
hybrid
algorithm is always slower than the
__udivmodti4()
routine using the extended precision
algorithm – more
than 3 times slower with dividend and divisor of (large) different
magnitude!
Contrary to this, on Intel Core processors, the
__udivmodti4()
128÷128-bit division routine using the hybrid
algorithm is up to 3 times faster than the routine
using the extended precision
algorithm – but
only with dividend and divisor of (nearly) equal
magnitude, and slower otherwise.
On modern
Core processors, the
__udivmoddi4()
64÷64-bit division routines
presented above run twice as fast as their native
64-bit DIV
instruction,
especially in 32-bit mode!
The 64÷64-bit division routine _aulldvrm()
for
32-bit i386 processors, which Microsoft
dares to ship with Windows, for example in
NTDLL.DLL
and their
MSVCRT
libraries, but sucks: it is 4 to 6 times slower
than a properly implemented division routine!
The same holds for their 64×64-bit multiplication routine
_allmul()
alias _aullmul()
for 32-bit i386
processors, which consumes up to 9 clock cycles.
The 64×64-bit multiplication routine __muldi3()
shipped by
LLVM
in their clang_rt.builtins-i386.lib
library is even
worse and wastes up to 18 clock cycles, while their 64÷64-bit
division routine
__udivmoddi4()
and their 128÷128-bit division routine
__udivmodti4()
suck similarly: they are 3 to 13 (in words:
thirteen) times slower than
properly implemented division routines!
In their own true false words:
The builtins library provides optimized implementations of this and other low-level routines, either in target-independent C form, or as a heavily-optimized assembly.
__udivmodti4()
function.
With the preprocessor macro NATIVE
defined, the second
C program measures the execution time for one billion
divisions of uniform distributed 64-bit pseudo-random numbers and
one billion divisions of pseudo-random numbers in the interval from
264−1 to 232 with the
DIV
instruction, else with the
shift & subtract
algorithm, both disguised
as the __udivmoddi4()
function.
Note: with the preprocessor macro
CYCLES
defined, both programs measure the execution
time in processor clock cycles and run on 64-bit editions of
Windows Vista® and newer versions, else
they measure the execution time in nano-seconds and run on 64-bit
editions of Windows™ XP and newer
versions.
// Copyright © 2018-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#include <stdint.h>
#include <stdio.h>
#include <time.h>
extern
__uint128_t __udivmodti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder);
__attribute__ ((noinline))
static
__uint128_t __unopti4(__uint128_t dividend, __uint128_t divisor, __uint128_t *remainder)
{
if (remainder != NULL)
*remainder = divisor;
return dividend;
}
__attribute__ ((always_inline))
static
__uint128_t lfsr128r(void)
{
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA,
// initialised with bit-vector of prime numbers:
// 2**prime is set for each prime in [0, 127]
static __uint128_t lfsr = (__uint128_t) 0x800228A202088288 << 64 | 0x28208A20A08A28AC;
const __uint128_t poly = (__uint128_t) 0xB64E4D3FA8E7331B << 64 | 0xD871FA30D46D4DBA;
const __uint128_t mask = 0 - (lfsr & 1);
return lfsr = (lfsr >> 1) ^ (poly & mask);
}
__attribute__ ((always_inline))
static
__uint128_t lfsr128l(void)
{
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D,
// initialised with 2**128 / golden ratio
static __uint128_t lfsr = (__uint128_t) 0x9E3779B97F4A7C15 << 64 | 0xF39CC0605CEDC834;
const __uint128_t poly = (__uint128_t) 0x5DB2B62B0C5F8E1B << 64 | 0xD8CCE715FCB2726D;
const __uint128_t sign = (__int128_t) lfsr >> 127;
return lfsr = (lfsr << 1) ^ (poly & sign);
}
__attribute__ ((always_inline))
static
__uint128_t lfsr64(void)
{
// 64-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xAD93D23594C935A9 (CRC-64 "Jones"),
// initialised with 2**64 / golden ratio
static uint64_t lfsr = 0x9E3779B97F4A7C15;
const uint64_t sign = (int64_t) lfsr >> 63;
return lfsr = (lfsr << 1) ^ (0xAD93D23594C935A9 & sign);
}
__attribute__ ((always_inline))
static
__uint128_t lfsr32(void)
{
// 32-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xDB710641 (CRC-32 IEEE),
// initialised with 2**32 / golden ratio
static uint32_t lfsr = 0x9E3779B9;
const uint32_t sign = (int32_t) lfsr >> 31;
return lfsr = (lfsr << 1) ^ (0xDB710641 & sign);
}
int main(void)
{
clock_t t0, t1, t2, tt;
uint32_t m, n;
__uint128_t dividend, divisor = ~0, remainder;
volatile __uint128_t quotient;
for (m = 0u; m < 64u; m += m + 1u)
{
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128l();
dividend >>= dividend & m;
quotient = __unopti4(dividend, divisor, NULL);
divisor = lfsr128r();
divisor >>= divisor & m;
quotient = __unopti4(dividend, divisor, &remainder);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128l();
dividend >>= dividend & m;
quotient = __udivmodti4(dividend, divisor, NULL);
divisor = lfsr128r();
divisor >>= divisor & m;
quotient = __udivmodti4(dividend, divisor, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128l();
quotient = __unopti4(dividend, divisor, NULL);
divisor = lfsr64();
quotient = __unopti4(dividend, divisor, &remainder);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128l();
quotient = __udivmodti4(dividend, divisor, NULL);
divisor = lfsr64();
quotient = __udivmodti4(dividend, divisor, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128l();
quotient = __unopti4(dividend, divisor, NULL);
divisor = lfsr32();
quotient = __unopti4(dividend, divisor, &remainder);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr128l();
quotient = __udivmodti4(dividend, divisor, NULL);
divisor = lfsr32();
quotient = __udivmodti4(dividend, divisor, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
t0 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr64();
quotient = __unopti4(dividend, divisor, NULL);
divisor = lfsr32();
quotient = __unopti4(dividend, divisor, &remainder);
}
t1 = clock();
for (n = 500000000u; n > 0u; n--)
{
dividend = lfsr64();
quotient = __udivmodti4(dividend, divisor, NULL);
divisor = lfsr32();
quotient = __udivmodti4(dividend, divisor, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu nano-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
t0 = clock();
for (n = 500000u; n > 0u; n--)
{
quotient = __unopti4(~0, 3, NULL);
quotient = __unopti4(~0, 3, &remainder);
}
t1 = clock();
for (n = 500000u; n > 0u; n--)
{
quotient = __udivmodti4(~0, 3, NULL);
quotient = __udivmodti4(~0, 3, &remainder);
}
t2 = clock();
tt = t2 - t0;
t2 -= t1;
t1 -= t0;
t0 = t2 - t1;
printf("\n"
"__unopti4() %4lu.%06lu 0\n"
"__udivmodti4() %4lu.%06lu %4lu.%06lu\n"
" %4lu.%06lu micro-seconds\n",
t1 / CLOCKS_PER_SEC, (t1 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t2 / CLOCKS_PER_SEC, (t2 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
t0 / CLOCKS_PER_SEC, (t0 % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC,
tt / CLOCKS_PER_SEC, (tt % CLOCKS_PER_SEC) * 1000000u / CLOCKS_PER_SEC);
}
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
typedef ULONGLONG QWORD;
typedef struct
{
QWORD qwLow, qwHigh;
} OWORD;
const struct
{
OWORD owDividend, owDivisor, owQuotient, owRemainder;
} owTable[] = {{0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{1ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{2ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL, 0ULL, 0ULL},
{2ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{2ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
{2ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
{~0ULL, 0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, 0ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
{0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
{0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{1ULL, 1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
{1ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL},
{1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{1ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 1ULL},
{1ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
{1ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
{1ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
{~0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL},
{~0ULL, 1ULL, ~0ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL},
{~0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
{~0ULL, 1ULL, ~0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
{~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
{~0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
{~0ULL, 0xFULL, 0xFULL, 0ULL, 0x1111111111111111ULL, 1ULL, 0ULL, 0ULL},
{~0xFULL, ~1ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, ~0xFULL, ~1ULL},
{0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL},
{0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
{0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 2ULL, 0ULL},
{0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{0ULL, ~0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
{0ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
{1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL},
{1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL, 0ULL},
{1ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 3ULL, 0ULL},
{1ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
{1ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{1ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, ~0ULL},
{~0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL},
{~0ULL, ~0ULL, ~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
{~0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, ~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, ~0ULL, 1ULL, 3ULL, 0x5555555555555555ULL, 0ULL, 0xAAAAAAAAAAAAAAAAULL, 0ULL},
{~0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
{~0ULL, ~0ULL, ~0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, ~0ULL, ~1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
{0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x0000000001110001ULL, 0ULL, 0x00000000003EB455ULL, 0ULL},
{0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x0000000001110001ULL, 0ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x00000000003EB455ULL, 0ULL}};
#pragma intrinsic(_umul128)
__declspec(noinline)
OWORD __umulti3(OWORD owMultiplicand, OWORD owMultiplier)
{
OWORD owProduct;
owProduct.qwLow = _umul128(owMultiplicand.qwLow, owMultiplier.qwLow, &owProduct.qwHigh);
owProduct.qwHigh += owMultiplicand.qwLow * owMultiplier.qwHigh
+ owMultiplicand.qwHigh * owMultiplier.qwLow;
return owProduct;
}
__declspec(noinline)
OWORD __unopti4(OWORD owDividend, OWORD owDivisor, OWORD *owRemainder)
{
if (owRemainder != NULL)
*owRemainder = owDivisor;
return owDividend;
}
OWORD __udivmodti4(OWORD owDividend, OWORD owDivisor, OWORD *owRemainder);
#pragma intrinsic(__shiftleft128, __shiftright128)
__forceinline
VOID lfsr128l(OWORD *ow)
{
#ifndef XORSHIFT
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D
QWORD qw = (LONGLONG) ow->qwHigh >> 63;
ow->qwHigh = __shiftleft128(ow->qwLow, ow->qwHigh, 1)
^ (qw & 0x5DB2B62B0C5F8E1BULL);
ow->qwLow = (qw & 0xD8CCE715FCB2726DULL) ^ (ow->qwLow << 1);
#elif 1
// 128-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
QWORD qw = ow->qwHigh;
ow->qwHigh = ow->qwLow;
ow->qwLow ^= ow->qwLow << 33;
qw ^= qw << 28;
ow->qwLow ^= ow->qwLow >> 31;
qw ^= qw >> 29;
ow->qwLow ^= qw;
#else
// 128-bit linear feedback shift register (XorShift form)
// using shift constants from Melissa O'Neill
ow->qwHigh ^= __shiftleft128(ow->qwLow, ow->qwHigh, 26);
ow->qwLow ^= ow->qwLow << 26;
ow->qwLow ^= __shiftright128(ow->qwLow, ow->qwHigh, 61);
ow->qwHigh ^= ow->qwHigh >> 61;
ow->qwHigh ^= __shiftleft128(ow->qwLow, ow->qwHigh, 37);
ow->qwLow ^= ow->qwLow << 37;
#endif
}
__forceinline
VOID lfsr128r(OWORD *ow)
{
#ifndef XORSHIFT
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA
QWORD qw = 0ULL - (ow->qwLow & 1ULL);
ow->qwLow = __shiftright128(ow->qwLow, ow->qwHigh, 1)
^ (qw & 0xD871FA30D46D4DBAULL);
ow->qwHigh = (qw & 0xB64E4D3FA8E7331BULL) ^ (ow->qwHigh >> 1);
#elif 1
// 128-bit linear feedback shift register (XorShift form)
// using shift constants from Sebastiano Vigna
QWORD qw = ow->qwHigh;
ow->qwHigh = ow->qwLow;
qw ^= qw << 23;
ow->qwLow ^= ow->qwLow >> 26;
qw ^= qw >> 17;
ow->qwLow ^= qw;
#else
// 128-bit linear feedback shift register (XorShift form)
// using shift constants from Melissa O'Neill
ow->qwHigh ^= __shiftleft128(ow->qwLow, ow->qwHigh, 11);
ow->qwLow ^= ow->qwLow << 11;
ow->qwLow ^= __shiftright128(ow->qwLow, ow->qwHigh, 61);
ow->qwHigh ^= ow->qwHigh >> 61;
ow->qwHigh ^= __shiftleft128(ow->qwLow, ow->qwHigh, 45);
ow->qwLow ^= ow->qwLow << 45;
#endif
}
__forceinline
VOID scale128(OWORD *owOut, OWORD *owIn)
{
owOut->qwLow = __shiftright128(owIn->qwLow, owIn->qwHigh, owIn->qwLow /* & 63 */);
owOut->qwHigh = owIn->qwHigh >> (owIn->qwLow /* & 63 */);
}
__declspec(safebuffers)
BOOL CDECL PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
CHAR szFormat[1024];
DWORD dwFormat;
DWORD dwOutput;
va_list vaInput;
va_start(vaInput, lpFormat);
dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
va_end(vaInput);
if ((dwFormat == 0UL)
|| !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
return FALSE;
return dwOutput == dwFormat;
}
__declspec(noreturn)
__declspec(safebuffers)
VOID CDECL mainCRTStartup(VOID)
{
DWORD dw, dwCPUID[12];
QWORD qwT0, qwT1, qwT2, qwT3;
QWORD qwTx, qwTy, qwTz;
OWORD owDividend, owDivisor = {~0ULL, ~0ULL}, owQuotient, owRemainder;
// 2**128 / square root of 2
OWORD owLeft = {0x597D89B3754ABE9FULL, 0xB504F333F9DE6484ULL};
// 2**128 / square root of 3
OWORD owRight = {0x0C7C0F257D92BE83ULL, 0x93CD3A2C8198E269ULL};
HANDLE hThread = GetCurrentThread();
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
if (hOutput == INVALID_HANDLE_VALUE)
ExitProcess(GetLastError());
if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
ExitProcess(GetLastError());
__cpuid(dwCPUID, 0x80000000UL);
if (*dwCPUID >= 0x80000004UL)
{
__cpuid(dwCPUID, 0x80000002UL);
__cpuid(dwCPUID + 4, 0x80000003UL);
__cpuid(dwCPUID + 8, 0x80000004UL);
}
else
__movsb(dwCPUID, "unidentified processor", sizeof("unidentified processor"));
PrintFormat(hOutput, "\r\nTesting 128-bit division...\r\n");
for (dw = 1UL; dw < sizeof(owTable) / sizeof(*owTable); dw++)
{
PrintFormat(hOutput, "\r%lu", dw);
#if 0
if ((owTable[dw].owDivisor.qwHigh | owTable[dw].owDivisor.qwLow) == 0ULL)
continue;
#endif
owQuotient = __udivmodti4(owTable[dw].owDividend, owTable[dw].owDivisor, &owRemainder);
if ((owQuotient.qwHigh != owTable[dw].owQuotient.qwHigh)
|| (owQuotient.qwLow != owTable[dw].owQuotient.qwLow))
PrintFormat(hOutput,
"\t0x%016I64X:%016I64X\a / %016I64X:%016I64X\r\n"
"\t0x%016I64X:%016I64X\r\n"
"\t0x%016I64X:%016I64X\r\n",
owTable[dw].owDividend.qwHigh, owTable[dw].owDividend.qwLow,
owTable[dw].owDivisor.qwHigh, owTable[dw].owDivisor.qwLow,
owTable[dw].owQuotient.qwHigh, owTable[dw].owQuotient.qwLow,
owQuotient.qwHigh, owQuotient.qwLow);
if ((owRemainder.qwHigh != owTable[dw].owRemainder.qwHigh)
|| (owRemainder.qwLow != owTable[dw].owRemainder.qwLow))
PrintFormat(hOutput,
"\t0x%016I64X:%016I64X\a %% %016I64X:%016I64X\r\n"
"\t0x%016I64X:%016I64X\r\n"
"\t0x%016I64X:%016I64X\r\n",
owTable[dw].owDividend.qwHigh, owTable[dw].owDividend.qwLow,
owTable[dw].owDivisor.qwHigh, owTable[dw].owDivisor.qwLow,
owTable[dw].owRemainder.qwHigh, owTable[dw].owRemainder.qwLow,
owRemainder.qwHigh, owRemainder.qwLow);
}
PrintFormat(hOutput, "\r\nTiming 128-bit division on %.48hs\r\n", dwCPUID);
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(&owLeft);
owQuotient = __unopti4(owLeft, owRight, NULL);
lfsr128r(&owRight);
owQuotient = __unopti4(owLeft, owRight, &owRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(&owLeft);
owQuotient = __udivmodti4(owLeft, owRight, NULL);
lfsr128r(&owRight);
owQuotient = __udivmodti4(owLeft, owRight, &owRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(&owLeft);
owQuotient = __umulti3(owLeft, owRight);
lfsr128r(&owRight);
owQuotient = __umulti3(owLeft, owRight);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
qwTz = qwT3 - qwT0;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
qwTy = qwT3 - qwT1;
qwTx = qwT2 - qwT1;
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"__unopti4() %6I64u.%09I64u 0\r\n"
"__udivmodti4() %6I64u.%09I64u %6I64u.%09I64u\r\n"
"__umulti3() %6I64u.%09I64u %6I64u.%09I64u\r\n"
" %6I64u.%09I64u clock cycles\r\n",
qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
qwTx / 1000000000ULL, qwTx % 1000000000ULL,
qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
qwTy / 1000000000ULL, qwTy % 1000000000ULL,
qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
PrintFormat(hOutput,
"\r\n"
"__unopti4() %6I64u.%07I64u 0\r\n"
"__udivmodti4() %6I64u.%07I64u %6I64u.%07I64u\r\n"
"__umulti3() %6I64u.%07I64u %6I64u.%07I64u\r\n"
" %6I64u.%07I64u nano-seconds\r\n",
qwT1 / 10000000ULL, qwT1 % 10000000ULL,
qwT2 / 10000000ULL, qwT2 % 10000000ULL,
qwTx / 10000000ULL, qwTx % 10000000ULL,
qwT3 / 10000000ULL, qwT3 % 10000000ULL,
qwTy / 10000000ULL, qwTy % 10000000ULL,
qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(&owLeft);
scale128(&owDividend, &owLeft);
owQuotient = __unopti4(owDividend, owDivisor, NULL);
lfsr128r(&owRight);
scale128(&owDivisor, &owRight);
owQuotient = __unopti4(owDividend, owDivisor, &owRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(&owLeft);
scale128(&owDividend, &owLeft);
owQuotient = __udivmodti4(owDividend, owDivisor, NULL);
lfsr128r(&owRight);
scale128(&owDivisor, &owRight);
owQuotient = __udivmodti4(owDividend, owDivisor, &owRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(&owLeft);
scale128(&owDividend, &owLeft);
owQuotient = __umulti3(owDividend, owDivisor);
lfsr128r(&owRight);
scale128(&owDivisor, &owRight);
owQuotient = __umulti3(owDividend, owDivisor);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
qwTz = qwT3 - qwT0;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
qwTy = qwT3 - qwT1;
qwTx = qwT2 - qwT1;
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"__unopti4() %6I64u.%09I64u 0\r\n"
"__udivmodti4() %6I64u.%09I64u %6I64u.%09I64u\r\n"
"__umulti3() %6I64u.%09I64u %6I64u.%09I64u\r\n"
" %6I64u.%09I64u clock cycles\r\n",
qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
qwTx / 1000000000ULL, qwTx % 1000000000ULL,
qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
qwTy / 1000000000ULL, qwTy % 1000000000ULL,
qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
PrintFormat(hOutput,
"\r\n"
"__unopti4() %6I64u.%07I64u 0\r\n"
"__udivmodti4() %6I64u.%07I64u %6I64u.%07I64u\r\n"
"__umulti3() %6I64u.%07I64u %6I64u.%07I64u\r\n"
" %6I64u.%07I64u nano-seconds\r\n",
qwT1 / 10000000ULL, qwT1 % 10000000ULL,
qwT2 / 10000000ULL, qwT2 % 10000000ULL,
qwTx / 10000000ULL, qwTx % 10000000ULL,
qwT3 / 10000000ULL, qwT3 % 10000000ULL,
qwTy / 10000000ULL, qwTy % 10000000ULL,
qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
ExitProcess(0UL);
}
DWORD_PTR __security_cookie = 3141592653589793241ULL >> 16; // π * 10**18 / 2**16
const IMAGE_LOAD_CONFIG_DIRECTORY64 _load_config_used = {sizeof(_load_config_used),
'DATE', // = 2006-04-15 20:15:01 UTC
_MSC_VER / 100, _MSC_VER % 100,
0UL, 0UL, 0UL,
0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
0UL,
0U, 0U,
0ULL,
&__security_cookie,
0ULL, 0ULL};
VOID __security_check_cookie(DWORD_PTR _stackcookie)
{
if (_stackcookie != __security_cookie)
__ud2();
}
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
typedef ULONGLONG QWORD;
const struct
{
QWORD qwDividend[2], qwDivisor[2], qwQuotient[2], qwRemainder[2];
} owTable[] = {{0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 0ULL},
{1ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL},
{2ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL, 0ULL, 0ULL},
{2ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{2ULL, 0ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
{2ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 2ULL, 0ULL},
{~0ULL, 0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, 0ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 0ULL},
{0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
{0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
{0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{0ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, 1ULL},
{1ULL, 1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
{1ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL, 2ULL, 0ULL},
{1ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, 1ULL, 0ULL},
{1ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{1ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 1ULL, 1ULL},
{1ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
{1ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
{1ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL},
{~0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL},
{~0ULL, 1ULL, ~0ULL, 0ULL, 2ULL, 0ULL, 1ULL, 0ULL},
{~0ULL, 1ULL, 0ULL, 1ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, 1ULL, 1ULL, 1ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
{~0ULL, 1ULL, ~0ULL, 1ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
{~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
{~0ULL, 1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, ~0ULL, 1ULL},
{~0ULL, 0xFULL, 0xFULL, 0ULL, 0x1111111111111111ULL, 1ULL, 0ULL, 0ULL},
{~0xFULL, ~1ULL, ~1ULL, ~0ULL, 0ULL, 0ULL, ~0xFULL, ~1ULL},
{0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, ~0ULL, 0ULL, 0ULL},
{0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 0ULL, 0ULL},
{0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
{0ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 2ULL, 0ULL},
{0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{0ULL, ~0ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
{0ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 0ULL, ~0ULL},
{1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 0ULL},
{1ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, 1ULL, 0ULL},
{1ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, 1ULL, 0ULL},
{1ULL, ~0ULL, 1ULL, 1ULL, ~1ULL, 0ULL, 3ULL, 0ULL},
{1ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
{1ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{1ULL, ~0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL, 1ULL, ~0ULL},
{~0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, ~0ULL, 0ULL, 0ULL},
{~0ULL, ~0ULL, ~0ULL, 0ULL, 1ULL, 1ULL, 0ULL, 0ULL},
{~0ULL, ~0ULL, 0ULL, 1ULL, ~0ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, ~0ULL, 1ULL, 1ULL, ~0ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, ~0ULL, 1ULL, 3ULL, 0x5555555555555555ULL, 0ULL, 0xAAAAAAAAAAAAAAAAULL, 0ULL},
{~0ULL, ~0ULL, 0ULL, ~0ULL, 1ULL, 0ULL, ~0ULL, 0ULL},
{~0ULL, ~0ULL, 1ULL, ~0ULL, 1ULL, 0ULL, ~1ULL, 0ULL},
{~0ULL, ~0ULL, ~0ULL, ~0ULL, 1ULL, 0ULL, 0ULL, 0ULL},
{~0ULL, ~0ULL, ~1ULL, ~0ULL, 1ULL, 0ULL, 1ULL, 0ULL},
{0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x0000000001110001ULL, 0ULL, 0x00000000003EB455ULL, 0ULL},
{0xBF25975319080000ULL, 0x530EDA741C71D4C3ULL, 0x0000000001110001ULL, 0ULL, 0x14C34AB4676E4BABULL, 0x0000004DE2CAB081ULL, 0x00000000003EB455ULL, 0ULL}};
#pragma intrinsic(_umul128)
__declspec(noinline)
QWORD *__umulti3(QWORD qwProduct[2], QWORD qwMultiplicand[2], QWORD qwMultiplier[2])
{
qwProduct[0] = _umul128(qwMultiplicand[0], qwMultiplier[0], qwProduct + 1);
qwProduct[1] += qwMultiplicand[0] * qwMultiplier[1]
+ qwMultiplicand[1] * qwMultiplier[0];
return qwProduct;
}
__declspec(noinline)
QWORD *__unopti4(QWORD qwQuotient[2], QWORD qwDividend[2], QWORD qwDivisor[2], QWORD qwRemainder[2])
{
if (qwRemainder != NULL)
*qwDivisor = *qwDividend;
return qwQuotient;
}
QWORD *__udivmodti4(QWORD qwQuotient[2], QWORD qwDividend[2], QWORD qwDivisor[2], QWORD qwRemainder[2]);
#pragma intrinsic(__shiftleft128, __shiftright128)
__forceinline
VOID lfsr128l(QWORD qw[2])
{
#ifndef XORSHIFT
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0x5DB2B62B0C5F8E1B:D8CCE715FCB2726D
QWORD qwMask = (LONGLONG) (qw[1]) >> 63;
qw[1] = __shiftleft128(qw[0], qw[1], 1)
^ (qwMask & 0x5DB2B62B0C5F8E1BULL);
qw[0] = (qwMask & 0xD8CCE715FCB2726DULL) ^ (qw[0] << 1);
#elif 1
// 128-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
QWORD qwTemp = qw[1];
qw[1] = qw[0];
qw[0] ^= qw[0] << 33;
qwTemp ^= qwTemp << 28;
qw[0] ^= qw[0] >> 31;
qwTemp ^= qwTemp >> 29;
qw[0] ^= qwTemp;
#else
// 128-bit linear feedback shift register (XorShift form)
// using shift constants from Melissa O'Neill
qw[1] ^= __shiftleft128(qw[0], qw[1], 26);
qw[0] ^= qw[0] << 26;
qw[0] ^= __shiftright128(qw[0], qw[1], 61);
qw[1] ^= qw[1] >> 61;
qw[1] ^= __shiftleft128(qw[0], qw[1], 37);
qw[0] ^= qw[0] << 37;
#endif
}
__forceinline
VOID lfsr128r(QWORD qw[2])
{
#ifndef XORSHIFT
// 128-bit linear feedback shift register (Galois form) using
// primitive polynomial 0xB64E4D3FA8E7331B:D871FA30D46D4DBA
QWORD qwMask = 0ULL - (qw[0] & 1ULL);
qw[0] = __shiftright128(qw[0], qw[1], 1)
^ (qwMask & 0xD871FA30D46D4DBAULL);
qw[1] = (qwMask & 0xB64E4D3FA8E7331BULL) ^ (qw[1] >> 1);
#elif 1
// 128-bit linear feedback shift register (XorShift form)
// using shift constants from Sebastiano Vigna
QWORD qwTemp = qw[1];
qw[1] = qw[0];
qwTemp ^= qwTemp << 23;
qw[0] ^= qw[0] >> 26;
qwTemp ^= qwTemp >> 17;
qw[0] ^= qwTemp;
#else
// 128-bit linear feedback shift register (XorShift form)
// using shift constants from Melissa O'Neill
qw[1] ^= __shiftleft128(qw[0], qw[1], 11);
qw[0] ^= qw[0] << 11;
qw[0] ^= __shiftright128(qw[0], qw[1], 61);
qw[1] ^= qw[1] >> 61;
qw[1] ^= __shiftleft128(qw[0], qw[1], 45);
qw[0] ^= qw[0] << 45;
#endif
}
__forceinline
VOID scale128(QWORD qwOut[2], QWORD qwIn[2])
{
qwOut[0] = __shiftright128(qwIn[0], qwIn[1], qwIn[0] /* & 63 */);
qwOut[1] = qwIn[1] >> (qwIn[0] /* & 63 */);
}
__declspec(safebuffers)
BOOL CDECL PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
CHAR szFormat[1024];
DWORD dwFormat;
DWORD dwOutput;
va_list vaInput;
va_start(vaInput, lpFormat);
dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
va_end(vaInput);
if ((dwFormat == 0UL)
|| !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
return FALSE;
return dwOutput == dwFormat;
}
__declspec(noreturn)
VOID CDECL mainCRTStartup(VOID)
{
DWORD dw, dwCPUID[12];
QWORD qwT0, qwT1, qwT2, qwT3;
QWORD qwTx, qwTy, qwTz;
QWORD qwDividend[2], qwDivisor[2], qwQuotient[2], qwRemainder[2];
// 2**128 / golden ratio
QWORD qwLeft[2] = {0xF39CC0605CEDC834ULL, 0x9E3779B97F4A7C15ULL};
// bit-vector of prime numbers:
// 2**prime is set for each prime in [0, 127]
QWORD qwRight[2] = {0x28208A20A08A28ACULL, 0x800228A202088288ULL};
HANDLE hThread = GetCurrentThread();
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
if (hOutput == INVALID_HANDLE_VALUE)
ExitProcess(GetLastError());
if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
ExitProcess(GetLastError());
__cpuid(dwCPUID, 0x80000000UL);
if (*dwCPUID >= 0x80000004UL)
{
__cpuid(dwCPUID, 0x80000002UL);
__cpuid(dwCPUID + 4, 0x80000003UL);
__cpuid(dwCPUID + 8, 0x80000004UL);
}
else
__movsb(dwCPUID, "unidentified processor", sizeof("unidentified processor"));
PrintFormat(hOutput, "\r\nTesting 128-bit division...\r\n");
for (dw = 1UL; dw < sizeof(owTable) / sizeof(*owTable); dw++)
{
PrintFormat(hOutput, "\r%lu", dw);
#if 0
if ((owTable[dw].qwDivisor[1] | owTable[dw].qwDivisor[0]) == 0ULL)
continue;
#endif
__udivmodti4(qwQuotient, owTable[dw].qwDividend, owTable[dw].qwDivisor, qwRemainder);
if ((qwQuotient[1] != owTable[dw].qwQuotient[1])
|| (qwQuotient[0] != owTable[dw].qwQuotient[0]))
PrintFormat(hOutput,
"\t0x%016I64X:%016I64X\a / %016I64X:%016I64X\r\n"
"\t0x%016I64X:%016I64X\r\n"
"\t0x%016I64X:%016I64X\r\n",
owTable[dw].qwDividend[1], owTable[dw].qwDividend[0],
owTable[dw].qwDivisor[1], owTable[dw].qwDivisor[0],
owTable[dw].qwQuotient[1], owTable[dw].qwQuotient[0],
qwQuotient[1], qwQuotient[0]);
if ((qwRemainder[1] != owTable[dw].qwRemainder[1])
|| (qwRemainder[0] != owTable[dw].qwRemainder[0]))
PrintFormat(hOutput,
"\t0x%016I64X:%016I64X\a %% %016I64X:%016I64X\r\n"
"\t0x%016I64X:%016I64X\r\n"
"\t0x%016I64X:%016I64X\r\n",
owTable[dw].qwDividend[1], owTable[dw].qwDividend[0],
owTable[dw].qwDivisor[1], owTable[dw].qwDivisor[0],
owTable[dw].qwRemainder[1], owTable[dw].qwRemainder[0],
qwRemainder[1], qwRemainder[0]);
}
PrintFormat(hOutput, "\r\nTiming 128-bit division on %.48hs\r\n", dwCPUID);
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(qwLeft);
__unopti4(qwQuotient, qwLeft, qwRight, NULL);
lfsr128r(qwRight);
__unopti4(qwQuotient, qwLeft, qwRight, qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(qwLeft);
__udivmodti4(qwQuotient, qwLeft, qwRight, NULL);
lfsr128r(qwRight);
__udivmodti4(qwQuotient, qwLeft, qwRight, qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(qwLeft);
__umulti3(qwQuotient, qwLeft, qwRight);
lfsr128r(qwRight);
__umulti3(qwQuotient, qwLeft, qwRight);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
qwTz = qwT3 - qwT0;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
qwTy = qwT3 - qwT1;
qwTx = qwT2 - qwT1;
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"__unopti4() %6I64u.%09I64u 0\r\n"
"__udivmodti4() %6I64u.%09I64u %6I64u.%09I64u\r\n"
"__umulti3() %6I64u.%09I64u %6I64u.%09I64u\r\n"
" %6I64u.%09I64u clock cycles\r\n",
qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
qwTx / 1000000000ULL, qwTx % 1000000000ULL,
qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
qwTy / 1000000000ULL, qwTy % 1000000000ULL,
qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
PrintFormat(hOutput,
"\r\n"
"__unopti4() %6I64u.%07I64u 0\r\n"
"__udivmodti4() %6I64u.%07I64u %6I64u.%07I64u\r\n"
"__umulti3() %6I64u.%07I64u %6I64u.%07I64u\r\n"
" %6I64u.%07I64u nano-seconds\r\n",
qwT1 / 10000000ULL, qwT1 % 10000000ULL,
qwT2 / 10000000ULL, qwT2 % 10000000ULL,
qwTx / 10000000ULL, qwTx % 10000000ULL,
qwT3 / 10000000ULL, qwT3 % 10000000ULL,
qwTy / 10000000ULL, qwTy % 10000000ULL,
qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(qwLeft);
scale128(qwDividend, qwLeft);
__unopti4(qwQuotient, qwDividend, qwDivisor, NULL);
lfsr128r(qwRight);
scale128(qwDivisor, qwRight);
__unopti4(qwQuotient, qwDividend, qwDivisor, qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(qwLeft);
scale128(qwDividend, qwLeft);
__udivmodti4(qwQuotient, qwDividend, qwDivisor, NULL);
lfsr128r(qwRight);
scale128(qwDivisor, qwRight);
__udivmodti4(qwQuotient, qwDividend, qwDivisor, qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
lfsr128l(qwLeft);
scale128(qwDividend, qwLeft);
__umulti3(qwQuotient, qwDividend, qwDivisor);
lfsr128r(qwRight);
scale128(qwDivisor, qwRight);
__umulti3(qwQuotient, qwDividend, qwDivisor);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
qwTz = qwT3 - qwT0;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
qwTy = qwT3 - qwT1;
qwTx = qwT2 - qwT1;
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"__unopti4() %6I64u.%09I64u 0\r\n"
"__udivmodti4() %6I64u.%09I64u %6I64u.%09I64u\r\n"
"__umulti3() %6I64u.%09I64u %6I64u.%09I64u\r\n"
" %6I64u.%09I64u clock cycles\r\n",
qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
qwTx / 1000000000ULL, qwTx % 1000000000ULL,
qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
qwTy / 1000000000ULL, qwTy % 1000000000ULL,
qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
PrintFormat(hOutput,
"\r\n"
"__unopti4() %6I64u.%07I64u 0\r\n"
"__udivmodti4() %6I64u.%07I64u %6I64u.%07I64u\r\n"
"__umulti3() %6I64u.%07I64u %6I64u.%07I64u\r\n"
" %6I64u.%07I64u nano-seconds\r\n",
qwT1 / 10000000ULL, qwT1 % 10000000ULL,
qwT2 / 10000000ULL, qwT2 % 10000000ULL,
qwTx / 10000000ULL, qwTx % 10000000ULL,
qwT3 / 10000000ULL, qwT3 % 10000000ULL,
qwTy / 10000000ULL, qwTy % 10000000ULL,
qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
ExitProcess(0UL);
}
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef _M_AMD64
#pragma message("For AMD64 platform only!")
#endif
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
typedef ULONGLONG QWORD;
#ifndef NATIVE
#define _(DIVIDEND, DIVISOR) {(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}
const struct
{
QWORD qwDividend, qwDivisor, qwQuotient, qwRemainder;
} qwTable[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
_(0x0000000000000001ULL, 0x0000000000000001ULL),
_(0x0000000000000002ULL, 0x0000000000000001ULL),
_(0x0000000000000002ULL, 0x0000000000000002ULL),
_(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFDULL),
_(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
_(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
_(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
_(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
_(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
_(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
_(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
_(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
_(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
_(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFDULL),
_(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x8000000000000000ULL, 0x0000000000000001ULL),
_(0x8000000000000000ULL, 0x0000000000000002ULL),
_(0x8000000000000000ULL, 0x0000000000000003ULL),
_(0x8000000000000000ULL, 0x00000000FFFFFFFDULL),
_(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
_(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
_(0x8000000000000000ULL, 0x0000000100000000ULL),
_(0x8000000000000000ULL, 0x0000000100000001ULL),
_(0x8000000000000000ULL, 0x0000000100000002ULL),
_(0x8000000000000000ULL, 0x0000000100000003ULL),
_(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
_(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFDULL),
_(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x8000000080000000ULL, 0x0000000080000000ULL),
_(0x8000000080000001ULL, 0x0000000080000001ULL),
_(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFCULL, 0x00000000FFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFCULL, 0x0000000100000002ULL),
_(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000003ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFDULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFFULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000001ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000002ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
_(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};
#undef _
#ifndef INTERN
QWORD __udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder);
#else
__declspec(noinline)
QWORD __udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
QWORD quotient;
DWORD index1, index2;
if (_BitScanReverse64(&index2, divisor))
if (_BitScanReverse64(&index1, dividend))
#if 0
if (index1 >= index2)
#else
if (dividend >= divisor)
#endif
{
// dividend >= divisor > 0,
// 64 > index1 >= index2 >= 0
// (number of leading '0' bits = 63 - index)
divisor <<= index1 - index2;
quotient = 0ULL;
do
{
quotient <<= 1;
if (dividend >= divisor)
{
dividend -= divisor;
quotient |= 1ULL;
}
divisor >>= 1;
} while (index1 >= ++index2);
if (remainder != NULL)
*remainder = dividend;
return quotient;
}
else // divisor > dividend > 0:
// quotient = 0, remainder = dividend
{
if (remainder != NULL)
*remainder = dividend;
return 0ULL;
}
else // divisor > dividend == 0:
// quotient = 0, remainder = 0
{
if (remainder != NULL)
*remainder = 0ULL;
return 0ULL;
}
else // divisor == 0
{
if (remainder != NULL)
*remainder = dividend % divisor;
return dividend / divisor;
}
}
#endif // INTERN
#else
__declspec(noinline)
QWORD __udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
if (remainder != NULL)
*remainder = dividend % divisor;
return dividend / divisor;
}
#endif // NATIVE
__declspec(noinline)
QWORD __unopdi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
if (remainder != NULL)
*remainder = divisor;
return dividend;
}
__declspec(noinline)
QWORD __umuldi4(QWORD multiplicand, QWORD multiplier, QWORD *dummy)
{
if (dummy != NULL)
*dummy = 0ULL;
return multiplicand * multiplier;
}
__declspec(safebuffers)
BOOL CDECL PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
CHAR szFormat[1024];
DWORD dwFormat;
DWORD dwOutput;
va_list vaInput;
va_start(vaInput, lpFormat);
dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
va_end(vaInput);
if ((dwFormat == 0UL)
|| !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
return FALSE;
return dwOutput == dwFormat;
}
__declspec(noreturn)
VOID CDECL mainCRTStartup(VOID)
{
DWORD dw, dwCPUID[12];
QWORD qwT0, qwT1, qwT2, qwT3;
QWORD qwTx, qwTy, qwTz;
volatile
QWORD qwQuotient;
QWORD qwRemainder, qwDividend, qwDivisor = ~0ULL;
// 2**64 / golden ratio
QWORD qwLeft = 0x9E3779B97F4A7C15ULL;
// bit-vector of prime numbers:
// 2**prime is set for each prime in [0, 63]
QWORD qwRight = 0x28208A20A08A28ACULL;
HANDLE hThread = GetCurrentThread();
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
if (hOutput == INVALID_HANDLE_VALUE)
ExitProcess(GetLastError());
if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
ExitProcess(GetLastError());
__cpuid(dwCPUID, 0x80000000UL);
if (*dwCPUID >= 0x80000004UL)
{
__cpuid(dwCPUID, 0x80000002UL);
__cpuid(dwCPUID + 4, 0x80000003UL);
__cpuid(dwCPUID + 8, 0x80000004UL);
}
else
__movsb(dwCPUID, "unidentified processor", sizeof("unidentified processor"));
#ifndef NATIVE
#ifndef INTERN
PrintFormat(hOutput, "\r\nTesting 64-bit assembler division...\r\n");
#else
PrintFormat(hOutput, "\r\nTesting 64-bit C division...\r\n");
#endif
for (dw = 0UL; dw < sizeof(qwTable) / sizeof(*qwTable); dw++)
{
PrintFormat(hOutput, "\r%lu", dw);
qwQuotient = __udivmoddi4(qwTable[dw].qwDividend, qwTable[dw].qwDivisor, &qwRemainder);
if (qwQuotient != qwTable[dw].qwQuotient)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
qwTable[dw].qwDividend, qwTable[dw].qwDivisor, qwQuotient, qwTable[dw].qwQuotient);
if (qwQuotient > qwTable[dw].qwDividend)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
qwTable[dw].qwDividend, qwTable[dw].qwDivisor, qwQuotient);
if (qwRemainder != qwTable[dw].qwRemainder)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a remainder %I64u not equal %I64u\r\n",
qwTable[dw].qwDividend, qwTable[dw].qwDivisor, qwRemainder, qwTable[dw].qwRemainder);
if (qwRemainder >= qwTable[dw].qwDivisor)
PrintFormat(hOutput,
"\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
qwTable[dw].qwDividend, qwTable[dw].qwDivisor, qwRemainder);
}
#ifndef INTERN
PrintFormat(hOutput, "\r\nTiming 64-bit assembler division on %.48hs\r\n", dwCPUID);
#else
PrintFormat(hOutput, "\r\nTiming 64-bit C division on %.48hs\r\n", dwCPUID);
#endif
#else
PrintFormat(hOutput, "\r\nTiming 64-bit native division on %.48hs\r\n", dwCPUID);
#endif // NATIVE
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
qwQuotient = __unopdi4(qwLeft, qwRight, NULL);
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
qwQuotient = __unopdi4(qwLeft, qwRight, &qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
qwQuotient = __udivmoddi4(qwLeft, qwRight, NULL);
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
qwQuotient = __udivmoddi4(qwLeft, qwRight, &qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
qwQuotient = __umuldi4(qwLeft, qwRight, NULL);
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
qwQuotient = __umuldi4(qwLeft, qwRight, &qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
qwTz = qwT3 - qwT0;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
qwTx = qwT2 - qwT1;
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"__unopdi4() %6I64u.%09I64u 0\r\n"
"__udivmoddi4() %6I64u.%09I64u %6I64u.%09I64u\r\n"
"__umuldi3() %6I64u.%09I64u %6I64u.%09I64u\r\n"
" %6I64u.%09I64u clock cycles\r\n",
qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
qwTx / 1000000000ULL, qwTx % 1000000000ULL,
qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
qwTy / 1000000000ULL, qwTy % 1000000000ULL,
qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
PrintFormat(hOutput,
"\r\n"
"__unopdi4() %6I64u.%07I64u 0\r\n"
"__udivmoddi4() %6I64u.%07I64u %6I64u.%07I64u\r\n"
"__umuldi3() %6I64u.%07I64u %6I64u.%07I64u\r\n"
" %6I64u.%07I64u nano-seconds\r\n",
qwT1 / 10000000ULL, qwT1 % 10000000ULL,
qwT2 / 10000000ULL, qwT2 % 10000000ULL,
qwTx / 10000000ULL, qwTx % 10000000ULL,
qwT3 / 10000000ULL, qwT3 % 10000000ULL,
qwTy / 10000000ULL, qwTy % 10000000ULL,
qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
qwDividend = qwLeft >> (qwLeft & 31ULL);
qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
qwDivisor = qwRight >> (qwRight & 31ULL);
qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
qwDividend = qwLeft >> (qwLeft & 31ULL);
qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
qwDivisor = qwRight >> (qwRight & 31ULL);
qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((LONGLONG) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
qwDividend = qwLeft >> (qwLeft & 31ULL);
qwQuotient = __umuldi4(qwDividend, qwDivisor, &qwRemainder);
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
qwDivisor = qwRight >> (qwRight & 31ULL);
qwQuotient = __umuldi4(qwDividend, qwDivisor, &qwRemainder);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
qwTz = qwT3 - qwT0;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
qwTx = qwT2 - qwT1;
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"__unopdi4() %6I64u.%09I64u 0\r\n"
"__udivmoddi4() %6I64u.%09I64u %6I64u.%09I64u\r\n"
"__umuldi3() %6I64u.%09I64u %6I64u.%09I64u\r\n"
" %6I64u.%09I64u clock cycles\r\n",
qwT1 / 1000000000ULL, qwT1 % 1000000000ULL,
qwT2 / 1000000000ULL, qwT2 % 1000000000ULL,
qwTx / 1000000000ULL, qwTx % 1000000000ULL,
qwT3 / 1000000000ULL, qwT3 % 1000000000ULL,
qwTy / 1000000000ULL, qwTy % 1000000000ULL,
qwTz / 1000000000ULL, qwTz % 1000000000ULL);
#else
PrintFormat(hOutput,
"\r\n"
"__unopdi4() %6I64u.%07I64u 0\r\n"
"__udivmoddi4() %6I64u.%07I64u %6I64u.%07I64u\r\n"
"__umuldi3() %6I64u.%07I64u %6I64u.%07I64u\r\n"
" %6I64u.%07I64u nano-seconds\r\n",
qwT1 / 10000000ULL, qwT1 % 10000000ULL,
qwT2 / 10000000ULL, qwT2 % 10000000ULL,
qwTx / 10000000ULL, qwTx % 10000000ULL,
qwT3 / 10000000ULL, qwT3 % 10000000ULL,
qwTy / 10000000ULL, qwTy % 10000000ULL,
qwTz / 10000000ULL, qwTz % 10000000ULL);
#endif
ExitProcess(0UL);
}
Save the
first C source
presented above as 128-amd64.c
and the
second C source
as 64-amd64.c
in an arbitrary, preferable empty
directory, save the
second assembler source
presented above as udivmodti4.asm
, the
fourth assembler source
as udivmodti4-hybrid.asm
, and the
eighth assembler source
as udivmoddi4.asm
in this directory too, then start the
command prompt of the Windows software development kit
for the AMD64 platform there, run the following command
lines to 64.exe
, 64-div.exe
,
128.exe
plus 128-hybrid.exe
, and execute
them:
ML64.EXE /Brepro /Cp /Cx /c /W3 /X udivmoddi4.asm CL.EXE /Brepro /c /DCYCLES /GAFwy /O2y /W4 /Zl 64-amd64.c LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:AMD64 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64.exe /RELEASE /SUBSYSTEM:CONSOLE 64-amd64.obj udivmoddi4.obj kernel32.lib user32.lib CL.EXE /Brepro /c /DCYCLES /DNATIVE /GAFwy /O2y /W4 /Zl 64-amd64.c LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:AMD64 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-div.exe /RELEASE /SUBSYSTEM:CONSOLE 64-amd64.obj kernel32.lib user32.lib ML64.EXE /Brepro /Cp /Cx /c /DJccLess /W3 /X udivmodti4.asm ML64.EXE /Brepro /Cp /Cx /c /W3 /X udivmodti4-hybrid.asm CL.EXE /Brepro /c /DCYCLES /GAFwy /O2y /W4 /Zl 128-amd64.c LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:AMD64 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:128.exe /RELEASE /SUBSYSTEM:CONSOLE 128-amd64.obj udivmodti4.obj kernel32.lib user32.lib LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:AMD64 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:128-hybrid.exe /RELEASE /SUBSYSTEM:CONSOLE 128-amd64.obj udivmodti4-hybrid.obj kernel32.lib user32.lib .\128.exe .\128-hybrid.exe .\64.exe .\64-div.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: all 64-bit programs are pure Win32 console applications and build without the MSVCRT libraries.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: udivmoddi4.asm Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. 64-amd64.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. 64-amd64.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: udivmodti4.asm Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: udivmodti4-hybrid.asm Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. 128-amd64.c 128-amd64.c(197) : warning C4244: 'function' : conversion from 'QWORD' to 'BYTE', possible loss of data 128-amd64.c(266) : warning C4090: 'function' : different 'const' qualifiers 128-amd64.c(266) : warning C4090: 'function' : different 'const' qualifiers Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopti4() 8.406722043 0 __udivmodti4() 68.135103197 59.728381154 __umulti3() 15.426767694 7.020045651 91.968592934 clock cycles __unopti4() 10.955079756 0 __udivmodti4() 71.382337619 60.427257863 __umulti3() 20.658580753 9.703500997 102.995998128 clock cycles Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopti4() 8.500168812 0 __udivmodti4() 35.628062934 27.127894122 __umulti3() 15.499977787 6.999808975 59.628209533 clock cycles __unopti4() 10.962429071 0 __udivmodti4() 127.868276342 116.905847271 __umulti3() 20.865134980 9.902705909 159.695840393 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopdi4() 7.046572321 0 __udivmoddi4() 39.744549176 32.697976855 __umuldi3() 8.225293991 1.178721670 55.016415488 clock cycles __unopdi4() 7.939823193 0 __udivmoddi4() 67.565681569 59.625858376 __umuldi3() 8.377724642 0.437901449 83.883229404 clock cycles Timing 64-bit native division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopdi4() 7.267861652 0 __udivmoddi4() 35.352622330 28.084760678 __umuldi3() 7.988646681 0.720785029 50.609130663 clock cycles __unopdi4() 7.972264793 0 __udivmoddi4() 37.397281198 29.425016405 __umuldi3() 8.457423360 0.485158567 53.826969351 clock cyclesNow without the preprocessor macro
CYCLES
defined:
[…] Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopti4() 3.3852217 0 __udivmodti4() 27.3937756 24.0085539 __umulti3() 6.2556401 2.8704184 37.0346374 nano-seconds __unopti4() 4.3992282 0 __udivmodti4() 28.3453817 23.9461535 __umulti3() 8.3772537 3.9780255 41.1218636 nano-seconds Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopti4() 3.3852217 0 __udivmodti4() 14.5080930 11.1228713 __umulti3() 6.2868403 2.9016186 24.1801550 nano-seconds __unopti4() 4.3368278 0 __udivmodti4() 51.9015327 47.5647049 __umulti3() 8.3772537 4.0404259 64.6156142 nano-seconds Testing 64-bit assembler division... 57 Timing 64-bit assembler division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopdi4() 3.0264194 0 __udivmoddi4() 14.6952942 11.6688748 __umuldi3() 3.1512202 0.1248008 20.8729338 nano-seconds __unopdi4() 3.0888198 0 __udivmoddi4() 26.7229713 23.6341515 __umuldi3() 3.5256226 0.4368028 33.3374137 nano-seconds Timing 64-bit native division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopdi4() 2.7300175 0 __udivmoddi4() 14.4144924 11.6844749 __umuldi3() 3.1356201 0.4056026 20.2801300 nano-seconds __unopdi4() 3.1044199 0 __udivmoddi4() 15.1320970 12.0276771 __umuldi3() 3.5100225 0.4056026 21.7465394 nano-seconds
[…] Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz __unopti4() 6.787912997 0 __udivmodti4() 59.706410079 52.918497082 __umulti3() 9.539191229 2.751278232 76.033514305 clock cycles __unopti4() 9.064762228 0 __udivmodti4() 64.037532443 54.972770215 __umulti3() 12.381997632 3.317235404 85.484292303 clock cycles Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz __unopti4() 6.721662277 0 __udivmodti4() 29.001432494 22.279770217 __umulti3() 9.428926160 2.707263883 45.152020931 clock cycles __unopti4() 8.960905247 0 __udivmodti4() 83.353539778 74.392634531 __umulti3() 12.205547423 3.244642176 104.519992448 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz __unopdi4() 5.247617465 0 __udivmoddi4() 24.890194689 19.642577224 __umuldi3() 6.295783447 1.048165982 36.433595601 clock cycles __unopdi4() 5.941744662 0 __udivmoddi4() 44.505047583 38.563302921 __umuldi3() 7.127920907 1.186176245 57.574713152 clock cycles Timing 64-bit native division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz __unopdi4() 4.928616197 0 __udivmoddi4() 36.111315035 31.182698838 __umuldi3() 6.104300272 1.175684075 47.144231504 clock cycles __unopdi4() 5.832601506 0 __udivmoddi4() 37.489066778 31.656465272 __umuldi3() 6.694749067 0.862147561 50.016417351 clock cycles
[…] Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz __unopti4() 6.137470372 0 __udivmodti4() 97.558400914 91.420930542 __umulti3() 7.721820100 1.584349728 111.417691386 clock cycles Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz __unopti4() 6.169866194 0 __udivmodti4() 26.153439674 19.983573480 __umulti3() 7.535166722 1.365300528 39.858472590 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz __unopdi4() 3.936338838 0 __udivmoddi4() 20.136963782 16.200624944 __umuldi3() 4.070675478 0.134336640 28.143978098 clock cycles Timing 64-bit native division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz __unopdi4() 3.938140068 0 __udivmoddi4() 40.965728700 37.027588632 __umuldi3() 4.276305996 0.338165928 49.180174764 clock cycles
[…] Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz __unopti4() 5.739886508 0 __udivmodti4() 60.265247522 54.525361014 __umulti3() 8.030493537 2.290607029 74.035627567 clock cycles __unopti4() 8.376397925 0 __udivmodti4() 63.878099605 55.501701680 __umulti3() 10.674320936 2.297923011 82.928818466 clock cycles Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz __unopti4() 5.759768704 0 __udivmodti4() 24.207704489 18.447935785 __umulti3() 7.991973095 2.232204391 37.959446288 clock cycles __unopti4() 8.356751289 0 __udivmodti4() 73.164876383 64.808125094 __umulti3() 10.667141168 2.310389879 92.188768840 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz __unopdi4() 5.215172319 0 __udivmoddi4() 20.464980809 15.249808490 __umuldi3() 4.339737255 0.000000000 30.019890383 clock cycles __unopdi4() 6.034145232 0 __udivmoddi4() 37.823775351 31.789630119 __umuldi3() 5.595748061 0.000000000 49.453668644 clock cycles Timing 64-bit native division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz __unopdi4() 4.311459182 0 __udivmoddi4() 32.456528199 28.145069017 __umuldi3() 5.798396574 1.486937392 42.566383955 clock cycles __unopdi4() 5.594526971 0 __udivmoddi4() 34.625131407 29.030604436 __umuldi3() 5.613384251 0.018857280 45.833042629 clock cycles
[…] Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz __unopti4() 3.384227961 0 __udivmodti4() 34.535941576 31.151713615 __umulti3() 4.561600376 1.177372415 42.481769913 clock cycles __unopti4() 4.958807640 0 __udivmodti4() 36.796688055 31.837880415 __umulti3() 6.071705006 1.112897366 47.827200701 clock cycles Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz __unopti4() 3.285746669 0 __udivmodti4() 14.265474814 10.979728145 __umulti3() 4.595527857 1.309781188 22.146749340 clock cycles __unopti4() 4.911969125 0 __udivmodti4() 42.153665292 37.241696167 __umulti3() 6.065414902 1.153445777 53.131049319 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz __unopdi4() 4.833738399 0 __udivmoddi4() 16.775388790 11.941650391 __umuldi3() 2.665548625 0.000000000 24.274675814 clock cycles __unopdi4() 4.713733765 0 __udivmoddi4() 25.241120837 20.527387072 __umuldi3() 3.737127811 0.000000000 33.691982413 clock cycles Timing 64-bit native division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz __unopdi4() 5.110940314 0 __udivmoddi4() 33.205803994 28.094863680 __umuldi3() 5.463078377 0.352138063 43.779822685 clock cycles __unopdi4() 5.326425239 0 __udivmoddi4() 32.876005661 27.549580422 __umuldi3() 5.327849707 0.001424468 43.530280607 clock cycles
[…] Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz __unopti4() 5.540168007 0 __udivmodti4() 79.168789872 73.628621865 __umulti3() 6.110792248 0.570624241 90.819750127 clock cycles Testing 128-bit division... 80 Timing 128-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz __unopti4() 5.564817542 0 __udivmodti4() 20.911815694 15.346998152 __umulti3() 6.121753970 0.556936428 32.598387206 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz __unopdi4() 3.170010417 0 __udivmoddi4() 16.466358752 13.296348335 __umuldi3() 3.334743188 0.164732771 22.971112357 clock cycles Timing 64-bit native division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz __unopdi4() 3.157196080 0 __udivmoddi4() 33.291178787 30.133982707 __umuldi3() 3.494135522 0.336939442 39.942510389 clock cycles
[…] Testing 128-bit division... 80 Timing 128-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopti4() 8.868646128 0 __udivmodti4() 46.689956608 37.821310480 __umulti3() 21.770531089 12.901884961 77.329133825 clock cycles __unopti4() 11.520301596 0 __udivmodti4() 52.089834710 40.569533114 __umulti3() 17.011904024 5.491602428 80.622040330 clock cycles Testing 128-bit division... 80 Timing 128-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopti4() 8.881698241 0 __udivmodti4() 35.183735229 26.302036988 __umulti3() 21.749787857 12.868089616 65.815221327 clock cycles __unopti4() 11.526274366 0 __udivmodti4() 89.981335374 78.455061008 __umulti3() 18.157572955 6.631298589 119.665182695 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopdi4() 10.051541116 0 __udivmoddi4() 34.927090936 24.875549820 __umuldi3() 9.377512848 0.000000000 54.356144900 clock cycles __unopdi4() 8.748480388 0 __udivmoddi4() 67.648442691 58.899962303 __umuldi3() 8.607450535 0.000000000 85.004373614 clock cycles Timing 64-bit native division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopdi4() 10.965153851 0 __udivmoddi4() 21.871554419 10.906400568 __umuldi3() 8.548907922 0.000000000 41.385616192 clock cycles __unopdi4() 8.868903219 0 __udivmoddi4() 31.380700111 22.511796892 __umuldi3() 9.254742318 0.385839099 49.504345648 clock cyclesNow without the preprocessor macro
CYCLES
defined:
[…] Testing 128-bit division... 80 Timing 128-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopti4() 4.0781250 0 __udivmodti4() 23.7656250 19.6875000 __umulti3() 11.1250000 7.0468750 38.9687500 nano-seconds __unopti4() 5.2968750 0 __udivmodti4() 23.0625000 17.7656250 __umulti3() 7.4218750 2.1250000 35.7812500 nano-seconds Testing 128-bit division... 80 Timing 128-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopti4() 4.0937500 0 __udivmodti4() 18.9062500 14.8125000 __umulti3() 9.7968750 5.7031250 32.7968750 nano-seconds __unopti4() 5.0156250 0 __udivmodti4() 37.1875000 32.1718750 __umulti3() 7.5000000 2.4843750 49.7031250 nano-seconds Testing 64-bit assembler division... 57 Timing 64-bit assembler division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopdi4() 3.3906250 0 __udivmoddi4() 16.2968750 12.9062500 __umuldi3() 4.8750000 1.4843750 24.5625000 nano-seconds __unopdi4() 5.1093750 0 __udivmoddi4() 24.6718750 19.5625000 __umuldi3() 5.2500000 0.1406250 35.0312500 nano-seconds Timing 64-bit native division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopdi4() 3.1875000 0 __udivmoddi4() 8.7812500 5.5937500 __umuldi3() 3.1093750 0.0000000 15.0781250 nano-seconds __unopdi4() 3.5000000 0 __udivmoddi4() 11.6093750 8.1093750 __umuldi3() 3.2968750 0.0000000 18.4062500 nano-seconds
[…] Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 7 2700X Eight-Core Processor __unopti4() 6.486039210 0 __udivmodti4() 26.669040891 20.183001681 __umulti3() 9.301024928 2.814985718 42.456105029 clock cycles __unopti4() 9.463028389 0 __udivmodti4() 28.686989038 19.223960649 __umulti3() 12.162446193 2.699417804 50.312463620 clock cycles Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 7 2700X Eight-Core Processor __unopti4() 6.759914430 0 __udivmodti4() 29.315665619 22.555751189 __umulti3() 9.908795028 3.148880598 45.984375077 clock cycles __unopti4() 10.063938544 0 __udivmodti4() 73.125239046 63.061300502 __umulti3() 12.680222751 2.616284207 95.869400341 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on AMD Ryzen 7 2700X Eight-Core Processor __unopdi4() 5.169526411 0 __udivmoddi4() 26.320155604 21.150629193 __umuldi3() 5.166327910 0.000000000 36.656009925 clock cycles __unopdi4() 5.596172042 0 __udivmoddi4() 47.084314600 41.488142558 __umuldi3() 5.595622223 0.000000000 58.276108865 clock cycles Timing 64-bit native division on AMD Ryzen 7 2700X Eight-Core Processor __unopdi4() 5.242663759 0 __udivmoddi4() 19.351564974 14.108901215 __umuldi3() 6.522515592 1.279851833 31.116744325 clock cycles __unopdi4() 5.654228263 0 __udivmoddi4() 22.197831810 16.543603547 __umuldi3() 6.958791467 1.304563204 34.810851540 clock cycles
[…] Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor __unopti4() 6.898510224 0 __udivmodti4() 26.925326748 20.026816524 __umulti3() 9.177466284 2.278956060 43.001303256 clock cycles __unopti4() 9.476578368 0 __udivmodti4() 29.322601849 19.846023481 __umulti3() 11.710555056 2.233976688 50.509735273 clock cycles Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor __unopti4() 6.865702096 0 __udivmodti4() 27.542023885 20.676321789 __umulti3() 9.108802297 2.243100201 43.516528278 clock cycles __unopti4() 9.442571687 0 __udivmodti4() 68.794109504 59.351537817 __umulti3() 11.703519360 2.260947673 89.940200551 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on AMD Ryzen 5 3600 6-Core Processor __unopdi4() 4.722583824 0 __udivmoddi4() 26.829651312 22.107067488 __umuldi3() 4.722048143 0.000000000 36.274283279 clock cycles __unopdi4() 5.156534846 0 __udivmoddi4() 46.419813577 41.263278731 __umuldi3() 5.153521140 0.000000000 56.729869563 clock cycles Timing 64-bit native division on AMD Ryzen 5 3600 6-Core Processor __unopdi4() 4.721372604 0 __udivmoddi4() 19.197411303 14.476038699 __umuldi3() 5.582367577 0.860994973 29.501151484 clock cycles __unopdi4() 5.153817924 0 __udivmoddi4() 22.015109233 16.861291309 __umuldi3() 6.009193188 0.855375264 33.178120345 clock cyclesAnd without the preprocessor macro
CYCLES
defined:
[…] Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor __unopti4() 2.0000000 0 __udivmodti4() 10.7343750 8.7343750 __umulti3() 2.5312500 0.5312500 15.2656250 nano-seconds __unopti4() 2.6250000 0 __udivmodti4() 8.1093750 5.4843750 __umulti3() 3.2031250 0.5781250 13.9375000 nano-seconds Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 5 3600 6-Core Processor __unopti4() 2.0156250 0 __udivmodti4() 8.7500000 6.7343750 __umulti3() 2.5312500 0.5156250 13.2968750 nano-seconds __unopti4() 2.6250000 0 __udivmodti4() 20.0468750 17.4218750 __umulti3() 3.1875000 0.5625000 25.8593750 nano-seconds Testing 64-bit assembler division... 57 Timing 64-bit assembler division on AMD Ryzen 5 3600 6-Core Processor __unopdi4() 1.3125000 0 __udivmoddi4() 7.2812500 5.9687500 __umuldi3() 1.3281250 0.0156250 9.9218750 nano-seconds __unopdi4() 1.4375000 0 __udivmoddi4() 12.3281250 10.8906250 __umuldi3() 1.3125000 0.0000000 15.0781250 nano-seconds Timing 64-bit native division on AMD Ryzen 5 3600 6-Core Processor __unopdi4() 1.3125000 0 __udivmoddi4() 5.3281250 4.0156250 __umuldi3() 1.5625000 0.2500000 8.2031250 nano-seconds __unopdi4() 1.4218750 0 __udivmoddi4() 6.1250000 4.7031250 __umuldi3() 1.6718750 0.2500000 9.2187500 nano-seconds
[…] Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor __unopti4() 6.639901585 0 __udivmodti4() 25.407730112 18.767828527 __umulti3() 8.787561378 2.147659793 40.835193075 clock cycles __unopti4() 9.009978790 0 __udivmodti4() 27.641956656 18.631977866 __umulti3() 11.091630160 2.081651370 47.743565606 clock cycles Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor __unopti4() 6.564753997 0 __udivmodti4() 26.123169316 19.558415319 __umulti3() 8.788291128 2.223537131 41.476214441 clock cycles __unopti4() 9.088542617 0 __udivmodti4() 65.108671217 56.020128600 __umulti3() 11.086648437 1.998105820 85.283862271 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on AMD Ryzen 9 3900XT 12-Core Processor __unopdi4() 4.487864035 0 __udivmoddi4() 25.416913553 20.929049518 __umuldi3() 4.530544610 0.042680575 34.435322198 clock cycles __unopdi4() 4.909401335 0 __udivmoddi4() 43.696358312 38.786956977 __umuldi3() 4.915678491 0.006277156 53.521438138 clock cycles Timing 64-bit native division on AMD Ryzen 9 3900XT 12-Core Processor __unopdi4() 4.486075560 0 __udivmoddi4() 17.826014215 13.339938655 __umuldi3() 5.293388605 0.807313045 27.605478380 clock cycles __unopdi4() 4.913181349 0 __udivmoddi4() 20.468841039 15.555659690 __umuldi3() 5.707274649 0.794093300 31.089297037 clock cyclesAnd without the preprocessor macro
CYCLES
defined:
[…] Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor __unopti4() 1.8437500 0 __udivmodti4() 6.7187500 4.8750000 __umulti3() 2.3125000 0.4687500 10.8750000 nano-seconds __unopti4() 2.3906250 0 __udivmodti4() 7.2343750 4.8437500 __umulti3() 2.9531250 0.5625000 12.5781250 nano-seconds Testing 128-bit division... 80 Timing 128-bit division on AMD Ryzen 9 3900XT 12-Core Processor __unopti4() 1.7968750 0 __udivmodti4() 7.1875000 5.3906250 __umulti3() 2.3125000 0.5156250 11.2968750 nano-seconds __unopti4() 2.3437500 0 __udivmodti4() 17.2812500 14.9375000 __umulti3() 2.9062500 0.5625000 22.5312500 nano-seconds Testing 64-bit assembler division... 57 Timing 64-bit assembler division on AMD Ryzen 9 3900XT 12-Core Processor __unopdi4() 1.1718750 0 __udivmoddi4() 6.5468750 5.3750000 __umuldi3() 1.3281250 0.1562500 9.0468750 nano-seconds __unopdi4() 1.2968750 0 __udivmoddi4() 11.0937500 9.7968750 __umuldi3() 1.1562500 0.0000000 13.5468750 nano-seconds Timing 64-bit native division on AMD Ryzen 9 3900XT 12-Core Processor __unopdi4() 1.2031250 0 __udivmoddi4() 4.7031250 3.5000000 __umuldi3() 1.3593750 0.1562500 7.2656250 nano-seconds __unopdi4() 1.2968750 0 __udivmoddi4() 5.4062500 4.1093750 __umuldi3() 1.6406250 0.3437500 8.3437500 nano-seconds
[…] Testing 128-bit division... 80 Timing 128-bit division on AMD EPYC 7713 64-Core Processor __unopti4() 4.210726820 0 __udivmodti4() 13.480688960 9.269962140 __umulti3() 8.231189400 4.020462580 25.922605180 clock cycles __unopti4() 6.144526480 0 __udivmodti4() 16.105263640 9.960737160 __umulti3() 7.747329420 1.602802940 29.997119540 clock cycles Testing 128-bit division... 80 Timing 128-bit division on AMD EPYC 7713 64-Core Processor __unopti4() 4.213688400 0 __udivmodti4() 15.444323440 11.230635040 __umulti3() 8.231448620 4.017760220 27.889460460 clock cycles __unopti4() 6.144599980 0 __udivmodti4() 38.763362940 32.618762960 __umulti3() 7.749265980 1.604666000 52.657228900 clock cycles Testing 64-bit assembler division... 57 Timing 64-bit assembler division on AMD EPYC 7713 64-Core Processor __unopdi4() 3.566402000 0 __udivmoddi4() 16.570870520 13.004468520 __umuldi3() 3.561979020 0.000000000 23.699251540 clock cycles __unopdi4() 3.889461880 0 __udivmoddi4() 26.724160800 22.834698920 __umuldi3() 3.889344680 0.000000000 34.502967360 clock cycles Timing 64-bit native division on AMD EPYC 7713 64-Core Processor __unopdi4() 3.557871840 0 __udivmoddi4() 6.782881360 3.225009520 __umuldi3() 4.205765360 0.647893520 14.546518560 clock cycles __unopdi4() 3.882580440 0 __udivmoddi4() 6.782383720 2.899803280 __umuldi3() 4.532050080 0.649469640 15.197014240 clock cyclesAnd without the preprocessor macro
CYCLES
defined:
[…] Testing 128-bit division... 80 Timing 128-bit division on AMD EPYC 7713 64-Core Processor __unopti4() 2.1718750 0 __udivmodti4() 6.7656250 4.5937500 __umulti3() 4.0781250 1.9062500 13.0156250 nano-seconds __unopti4() 2.9687500 0 __udivmodti4() 8.1875000 5.2187500 __umulti3() 3.8906250 0.9218750 15.0468750 nano-seconds Testing 128-bit division... 80 Timing 128-bit division on AMD EPYC 7713 64-Core Processor __unopti4() 2.1406250 0 __udivmodti4() 8.1875000 6.0468750 __umulti3() 4.0781250 1.9375000 14.4062500 nano-seconds __unopti4() 2.9687500 0 __udivmodti4() 20.0000000 17.0312500 __umulti3() 3.9062500 0.9375000 26.8750000 nano-seconds Testing 64-bit assembler division... 57 Timing 64-bit assembler division on AMD EPYC 7713 64-Core Processor __unopdi4() 1.8125000 0 __udivmoddi4() 8.4218750 6.6093750 __umuldi3() 1.8281250 0.0156250 12.0625000 nano-seconds __unopdi4() 1.9843750 0 __udivmoddi4() 13.5000000 11.5156250 __umuldi3() 1.8125000 0.0000000 17.2968750 nano-seconds Timing 64-bit native division on AMD EPYC 7713 64-Core Processor __unopdi4() 1.8281250 0 __udivmoddi4() 3.4531250 1.6250000 __umuldi3() 2.1406250 0.3125000 7.4218750 nano-seconds __unopdi4() 1.9843750 0 __udivmoddi4() 3.4531250 1.4687500 __umuldi3() 2.3125000 0.3281250 7.7500000 nano-seconds
__udivmoddi4()
function.
Note: with the preprocessor macro
HELPER
defined, it uses the compiler helper routines
_alldiv()
,
_alldvrm()
,
_allmul()
,
_allrem()
,
_aulldiv()
,
_aulldvrm()
and _aullrem()
instead, which
the Microsoft Visual C compiler calls
to perform 64-bit division and multiplication.
Note: with the preprocessor macro
CYCLES
defined, it measures the execution time in
processor clock cycles and runs on
Windows Vista® and newer versions, else
it measures the execution time in nano-seconds and runs on all
versions of
Windows™ NT.
Note: it uses the same pseudo-random number generators as the second C program for 64-bit processors, so their results are directly comparable.
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef _M_IX86
#pragma message("For I386 platform only!")
#endif
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
typedef LONGLONG SQWORD;
typedef ULONGLONG QWORD;
#define _(DIVIDEND, DIVISOR) {(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}
const struct _ull
{
QWORD ullDividend, ullDivisor, ullQuotient, ullRemainder;
} ullTable[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
_(0x0000000000000001ULL, 0x0000000000000001ULL),
_(0x0000000000000002ULL, 0x0000000000000001ULL),
_(0x0000000000000002ULL, 0x0000000000000002ULL),
_(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFDULL),
_(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
_(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
_(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
_(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
_(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
_(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
_(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
_(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
_(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
_(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFDULL),
_(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x8000000000000000ULL, 0x0000000000000001ULL),
_(0x8000000000000000ULL, 0x0000000000000002ULL),
_(0x8000000000000000ULL, 0x0000000000000003ULL),
_(0x8000000000000000ULL, 0x00000000FFFFFFFDULL),
_(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
_(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
_(0x8000000000000000ULL, 0x0000000100000000ULL),
_(0x8000000000000000ULL, 0x0000000100000001ULL),
_(0x8000000000000000ULL, 0x0000000100000002ULL),
_(0x8000000000000000ULL, 0x0000000100000003ULL),
_(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
_(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFDULL),
_(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x8000000080000000ULL, 0x0000000080000000ULL),
_(0x8000000080000001ULL, 0x0000000080000001ULL),
_(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFCULL, 0x00000000FFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFCULL, 0x0000000100000002ULL),
_(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000003ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFDULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFFULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000001ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000002ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
_(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};
const struct _ll
{
SQWORD llDividend, llDivisor, llQuotient, llRemainder;
} llTable[] = {_(0x0000000000000000LL, 0x0000000000000001LL), // 0, 1
_(0x0000000000000001LL, 0x0000000000000001LL), // 1, 1
_(0x0000000000000000LL, 0xFFFFFFFFFFFFFFFFLL), // 0, -1
_(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFFLL), // 1, -1
_(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFELL), // 1, -2
_(0x0000000000000002LL, 0xFFFFFFFFFFFFFFFELL), // 2, -2
_(0x000000000FFFFFFFLL, 0x0000000000000001LL),
_(0x0000000FFFFFFFFFLL, 0x000000000000000FLL),
_(0x0000000FFFFFFFFFLL, 0x0000000000000010LL),
_(0x0000000000000100LL, 0x000000000FFFFFFFLL),
_(0x00FFFFFFF0000000LL, 0x0000000010000000LL),
_(0x07FFFFFF80000000LL, 0x0000000080000000LL),
_(0x7FFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
_(0x7FFFFFFEFFFFFFF0LL, 0x0000FFFFFFFFFFFELL),
_(0x7FFFFFFEFFFFFFF0LL, 0x7FFFFFFEFFFFFFF0LL),
_(0x7FFFFFFFFFFFFFFFLL, 0x8000000000000000LL), // llmax, llmin
_(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFDLL), // llmax, -3
_(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL), // llmax, -2
_(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL), // llmax, -1
_(0x8000000000000000LL, 0x0000000000000001LL), // llmin, 1
_(0x8000000000000000LL, 0x0000000000000002LL), // llmin, 2
_(0x8000000000000000LL, 0x0000000000000003LL), // llmin, 3
_(0x8000000000000000LL, 0x00000000FFFFFFFELL),
_(0x8000000000000000LL, 0x00000000FFFFFFFFLL),
_(0x8000000000000000LL, 0x0000000100000000LL),
_(0x8000000000000000LL, 0x0000000100000001LL),
_(0x8000000000000000LL, 0x0000000100000002LL),
_(0x8000000000000000LL, 0x8000000000000000LL), // llmin, llmin
_(0x8000000000000000LL, 0xFFFFFFFF00000000LL),
_(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFDLL), // llmin, -3
_(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFELL), // llmin, -2
_(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFFLL), // llmin, -1
_(0x8000000080000000LL, 0x0000000080000000LL),
_(0x8000000080000001LL, 0x0000000080000001LL),
_(0xFFFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
_(0xFFFFFFFFFFFFFFFELL, 0x0000000080000000LL),
_(0xFFFFFFFFFFFFFFFELL, 0x0000000000000001LL), // -2, 1
_(0xFFFFFFFFFFFFFFFELL, 0x0000000000000002LL), // -2, 2
_(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFELL), // -2, -2
_(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFFLL), // -2, -1
_(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000001LL), // -1, 1
_(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000002LL), // -1, 2
_(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL), // -1, -2
_(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL)}; // -1, -1
#undef _
#ifndef HELPER
SQWORD __divdi3(SQWORD dividend, SQWORD divisor);
SQWORD __moddi3(SQWORD dividend, SQWORD divisor);
SQWORD __muldi3(SQWORD multiplicand, SQWORD multiplier);
QWORD __udivdi3(QWORD dividend, QWORD divisor);
QWORD __umoddi3(QWORD dividend, QWORD divisor);
QWORD __umuldi3(QWORD multiplicand, QWORD multiplier);
QWORD __udivmoddi4(QWORD dividend, QWORD divisor, QWORD *remainder);
__declspec(noinline)
QWORD __unopdi4(QWORD dividend, QWORD divisor, QWORD *remainder)
{
if (remainder != NULL)
*remainder = divisor;
return dividend;
}
#else
__declspec(naked)
__declspec(noinline)
QWORD WINAPI _aullnop(QWORD left, QWORD right)
{
__asm ret 16
}
#endif // HELPER
__forceinline // companion for __emulu()
struct
{
DWORD ulQuotient, ulRemainder;
} WINAPI __edivmodu(QWORD ullDividend, DWORD ulDivisor)
{
__asm mov eax, dword ptr ullDividend
__asm mov edx, dword ptr ullDividend+4
__asm div ulDivisor
}
__declspec(safebuffers)
BOOL CDECL PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
CHAR szFormat[1024];
DWORD dwFormat;
DWORD dwOutput;
va_list vaInput;
va_start(vaInput, lpFormat);
dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
va_end(vaInput);
if ((dwFormat == 0UL)
|| !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
return FALSE;
return dwOutput == dwFormat;
}
__declspec(noreturn)
VOID CDECL mainCRTStartup(VOID)
{
DWORD dw, dwCPUID[12];
QWORD qwT0, qwT1, qwT2, qwT3;
QWORD qwTx, qwTy, qwTz;
QWORD ullQuotient, ullRemainder;
SQWORD llQuotient, llRemainder;
volatile
#ifdef HELPER
QWORD qwQuotient, qwRemainder;
QWORD qwDividend, qwDivisor = ~0ULL;
#else
QWORD qwQuotient;
QWORD qwDividend, qwDivisor = ~0ULL, qwRemainder;
#endif // 2**64 / golden ratio
QWORD qwLeft = 0x9E3779B97F4A7C15ULL;
// bit-vector of prime numbers:
// 2**prime is set for each prime in [0, 63]
QWORD qwRight = 0x28208A20A08A28ACULL;
HANDLE hThread = GetCurrentThread();
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
if (hOutput == INVALID_HANDLE_VALUE)
ExitProcess(GetLastError());
if (SetThreadIdealProcessor(hThread, 0UL) == -1L)
ExitProcess(GetLastError());
__cpuid(dwCPUID, 0x80000000UL);
if (*dwCPUID >= 0x80000004UL)
{
__cpuid(dwCPUID, 0x80000002UL);
__cpuid(dwCPUID + 4, 0x80000003UL);
__cpuid(dwCPUID + 8, 0x80000004UL);
}
else
__movsb(dwCPUID, "unidentified processor", sizeof("unidentified processor"));
PrintFormat(hOutput, "\r\nTesting 64-bit division...\r\n");
for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
{
PrintFormat(hOutput, "\r%lu", dw);
#ifndef HELPER
ullQuotient = __udivmoddi4(ullTable[dw].ullDividend, ullTable[dw].ullDivisor, &ullRemainder);
#else
ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;
ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;
#endif
if (ullQuotient != ullTable[dw].ullQuotient)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);
if (ullQuotient > ullTable[dw].ullDividend)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);
if (ullRemainder != ullTable[dw].ullRemainder)
PrintFormat(hOutput,
"\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
if (ullRemainder >= ullTable[dw].ullDivisor)
PrintFormat(hOutput,
"\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
}
PrintFormat(hOutput, "\r\nTesting unsigned 64-bit division...\r\n");
for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
{
PrintFormat(hOutput, "\r%lu", dw);
#ifndef HELPER
ullQuotient = __udivdi3(ullTable[dw].ullDividend, ullTable[dw].ullDivisor);
#else
ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;
#endif
if (ullQuotient != ullTable[dw].ullQuotient)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);
if (ullQuotient > ullTable[dw].ullDividend)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);
#ifndef HELPER
ullRemainder = ullTable[dw].ullDividend - __muldi3(ullTable[dw].ullDivisor, ullQuotient);
#else
ullRemainder = ullTable[dw].ullDividend - ullTable[dw].ullDivisor * ullQuotient;
#endif
if (ullRemainder != ullTable[dw].ullRemainder)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a remainder %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
#ifndef HELPER
ullRemainder = __umoddi3(ullTable[dw].ullDividend, ullTable[dw].ullDivisor);
#else
ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;
#endif
if (ullRemainder != ullTable[dw].ullRemainder)
PrintFormat(hOutput,
"\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
if (ullRemainder >= ullTable[dw].ullDivisor)
PrintFormat(hOutput,
"\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
}
PrintFormat(hOutput, "\r\nTesting signed 64-bit division...\r\n");
for (dw = 0UL; dw < sizeof(llTable) / sizeof(*llTable); dw++)
{
PrintFormat(hOutput, "\r%lu", dw);
#ifndef HELPER
llQuotient = __divdi3(llTable[dw].llDividend, llTable[dw].llDivisor);
#else
llQuotient = llTable[dw].llDividend / llTable[dw].llDivisor;
#endif
if (llQuotient != llTable[dw].llQuotient)
PrintFormat(hOutput,
"\t%I64d / %I64d:\a quotient %I64d not equal %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient, llTable[dw].llQuotient);
if ((llTable[dw].llDividend < 0LL) && (llQuotient < llTable[dw].llDividend)
|| (llTable[dw].llDividend >= 0LL) && (llQuotient > llTable[dw].llDividend))
PrintFormat(hOutput,
"\t%I64d / %I64d:\a quotient %I64d greater dividend\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient);
#ifndef HELPER
llRemainder = llTable[dw].llDividend - __muldi3(llTable[dw].llDivisor, llQuotient);
#else
llRemainder = llTable[dw].llDividend - llTable[dw].llDivisor * llQuotient;
#endif
if (llRemainder != llTable[dw].llRemainder)
PrintFormat(hOutput,
"\t%I64d / %I64d:\a remainder %I64d not equal %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
if ((llRemainder != 0LL)
&& ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
PrintFormat(hOutput,
"\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
#ifndef HELPER
llRemainder = __moddi3(llTable[dw].llDividend, llTable[dw].llDivisor);
#else
llRemainder = llTable[dw].llDividend % llTable[dw].llDivisor;
#endif
if (llRemainder != llTable[dw].llRemainder)
PrintFormat(hOutput,
"\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
if ((llTable[dw].llDivisor < 0LL) && (llRemainder <= llTable[dw].llDivisor)
|| (llTable[dw].llDivisor > 0LL) && (llRemainder >= llTable[dw].llDivisor))
PrintFormat(hOutput,
"\t%I64d %% %I64d:\a remainder %I64d not less divisor\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder);
if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
PrintFormat(hOutput,
"\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
}
PrintFormat(hOutput, "\r\nTiming 64-bit division on %.48hs\r\n", dwCPUID);
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef HELPER
qwQuotient = __unopdi4(qwLeft, qwRight, NULL);
#else
qwQuotient = _aullnop(qwLeft, qwRight);
#endif
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef HELPER
qwQuotient = __unopdi4(qwLeft, qwRight, &qwRemainder);
#else
qwQuotient = _aullnop(qwLeft, qwRight);
#endif
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef HELPER
qwQuotient = __udivmoddi4(qwLeft, qwRight, &NULL);
#else
qwQuotient = qwLeft / qwRight;
qwRemainder = qwLeft % qwRight;
#endif
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef HELPER
qwQuotient = __udivmoddi4(qwLeft, qwRight, &qwRemainder);
#else
qwQuotient = qwLeft / qwRight;
qwRemainder = qwLeft % qwRight;
#endif
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
#ifndef HELPER
qwQuotient = __umuldi3(qwLeft, qwRight);
#else
qwQuotient = qwLeft * qwRight;
#endif
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
#ifndef HELPER
qwQuotient = __umuldi3(qwLeft, qwRight);
#else
qwQuotient = qwLeft * qwRight;
#endif
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
qwTz = qwT3 - qwT0;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
qwTx = qwT2 - qwT1;
#ifndef HELPER
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"__unopdi4() %6lu.%09lu 0\r\n"
"__udivmoddi4() %6lu.%09lu %6lu.%09lu\r\n"
"__umuldi3() %6lu.%09lu %6lu.%09lu\r\n"
" %6lu.%09lu clock cycles\r\n",
__edivmodu(qwT1, 1000000000UL),
__edivmodu(qwT2, 1000000000UL),
__edivmodu(qwTx, 1000000000UL),
__edivmodu(qwT3, 1000000000UL),
__edivmodu(qwTy, 1000000000UL),
__edivmodu(qwTz, 1000000000UL));
#else
PrintFormat(hOutput,
"\r\n"
"__unopdi4() %6lu.%07lu 0\r\n"
"__udivmoddi4() %6lu.%07lu %6lu.%07lu\r\n"
"__umuldi3() %6lu.%07lu %6lu.%07lu\r\n"
" %6lu.%07lu nano-seconds\r\n",
__edivmodu(qwT1, 10000000UL),
__edivmodu(qwT2, 10000000UL),
__edivmodu(qwTx, 10000000UL),
__edivmodu(qwT3, 10000000UL),
__edivmodu(qwTy, 10000000UL),
__edivmodu(qwTz, 10000000UL));
#endif // CYCLES
#else // HELPER
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"_aullnop() %6lu.%09lu 0\r\n"
"_aulldvrm() %6lu.%09lu %6lu.%09lu\r\n"
"_aullmul() %6lu.%09lu %6lu.%09lu\r\n"
" %6lu.%09lu clock cycles\r\n",
__edivmodu(qwT1, 1000000000UL),
__edivmodu(qwT2, 1000000000UL),
__edivmodu(qwTx, 1000000000UL),
__edivmodu(qwT3, 1000000000UL),
__edivmodu(qwTy, 1000000000UL),
__edivmodu(qwTz, 1000000000UL));
#else
PrintFormat(hOutput,
"\r\n"
"_aullnop() %6lu.%07lu 0\r\n"
"_aulldvrm() %6lu.%07lu %6lu.%07lu\r\n"
"_aullmul() %6lu.%07lu %6lu.%07lu\r\n"
" %6lu.%07lu nano-seconds\r\n",
__edivmodu(qwT1, 10000000UL),
__edivmodu(qwT2, 10000000UL),
__edivmodu(qwTx, 10000000UL),
__edivmodu(qwT3, 10000000UL),
__edivmodu(qwTy, 10000000UL),
__edivmodu(qwTz, 10000000UL));
#endif // CYCLES
#endif // HELPER
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
#ifndef HELPER
qwQuotient = __unopdi4(qwDividend, qwDivisor, NULL);
#else
qwQuotient = _aullnop(qwDividend, qwDivisor);
#endif
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
#ifndef HELPER
qwQuotient = __unopdi4(qwDividend, qwDivisor, &qwRemainder);
#else
qwQuotient = _aullnop(qwDividend, qwDivisor);
#endif
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
#ifndef HELPER
qwQuotient = __udivmoddi4(qwDividend, qwDivisor, NULL);
#else
qwQuotient = qwDividend / qwDivisor;
qwRemainder = qwDividend % qwDivisor;
#endif
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
#ifndef HELPER
qwQuotient = __udivmoddi4(qwDividend, qwDivisor, &qwRemainder);
#else
qwQuotient = qwDividend / qwDivisor;
qwRemainder = qwDividend % qwDivisor;
#endif
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from George Marsaglia
qwLeft ^= qwLeft << 14;
qwLeft ^= qwLeft >> 31;
qwLeft ^= qwLeft << 45;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
#endif
qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
#ifndef HELPER
qwQuotient = __umuldi3(qwDividend, qwDivisor);
#else
qwQuotient = qwDividend * qwDivisor;
#endif
#ifdef XORSHIFT
// 64-bit linear feedback shift register (XorShift form)
// using shift constants from Richard Peirce Brent
qwRight ^= qwRight << 10;
qwRight ^= qwRight >> 15;
qwRight ^= qwRight << 4;
qwRight ^= qwRight >> 13;
#else
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
#endif
qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
#ifndef HELPER
qwQuotient = __umuldi3(qwDividend, qwDivisor);
#else
qwQuotient = qwDividend * qwDivisor;
#endif
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwTx, (LPFILETIME) &qwTy, (LPFILETIME) &qwTz, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
qwTz = qwT3 - qwT0;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
qwTy = qwT3 > qwT1 ? qwT3 - qwT1 : 0ULL;
qwTx = qwT2 - qwT1;
#ifndef HELPER
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"__unopdi4() %6lu.%09lu 0\r\n"
"__udivmoddi4() %6lu.%09lu %6lu.%09lu\r\n"
"__umuldi3() %6lu.%09lu %6lu.%09lu\r\n"
" %6lu.%09lu clock cycles\r\n",
__edivmodu(qwT1, 1000000000UL),
__edivmodu(qwT2, 1000000000UL),
__edivmodu(qwTx, 1000000000UL),
__edivmodu(qwT3, 1000000000UL),
__edivmodu(qwTy, 1000000000UL),
__edivmodu(qwTz, 1000000000UL));
#else
PrintFormat(hOutput,
"\r\n"
"__unopdi4() %6lu.%07lu 0\r\n"
"__udivmoddi4() %6lu.%07lu %6lu.%07lu\r\n"
"__umuldi3() %6lu.%07lu %6lu.%07lu\r\n"
" %6lu.%07lu nano-seconds\r\n",
__edivmodu(qwT1, 10000000UL),
__edivmodu(qwT2, 10000000UL),
__edivmodu(qwTx, 10000000UL),
__edivmodu(qwT3, 10000000UL),
__edivmodu(qwTy, 10000000UL),
__edivmodu(qwTz, 10000000UL));
#endif // CYCLES
#else // HELPER
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"_aullnop() %6lu.%09lu 0\r\n"
"_aulldvrm() %6lu.%09lu %6lu.%09lu\r\n"
"_aullmul() %6lu.%09lu %6lu.%09lu\r\n"
" %6lu.%09lu clock cycles\r\n",
__edivmodu(qwT1, 1000000000UL),
__edivmodu(qwT2, 1000000000UL),
__edivmodu(qwTx, 1000000000UL),
__edivmodu(qwT3, 1000000000UL),
__edivmodu(qwTy, 1000000000UL),
__edivmodu(qwTz, 1000000000UL));
#else
PrintFormat(hOutput,
"\r\n"
"_aullnop() %6lu.%07lu 0\r\n"
"_aulldvrm() %6lu.%07lu %6lu.%07lu\r\n"
"_aullmul() %6lu.%07lu %6lu.%07lu\r\n"
" %6lu.%07lu nano-seconds\r\n",
__edivmodu(qwT1, 10000000UL),
__edivmodu(qwT2, 10000000UL),
__edivmodu(qwTx, 10000000UL),
__edivmodu(qwT3, 10000000UL),
__edivmodu(qwTy, 10000000UL),
__edivmodu(qwTz, 10000000UL));
#endif // CYCLES
#endif // HELPER
ExitProcess(0UL);
}
DWORD_PTR __security_cookie = 3141592654UL; // π * 10**9
extern LPVOID __safe_se_handler_table[];
extern BYTE __safe_se_handler_count;
const IMAGE_LOAD_CONFIG_DIRECTORY32 _load_config_used = {sizeof(_load_config_used),
'DATE', // = 2006-04-15 20:15:01 UTC
_MSC_VER / 100, _MSC_VER % 100,
0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL,
0U, 0U,
0UL,
&__security_cookie,
__safe_se_handler_table,
&__safe_se_handler_count};
__declspec(naked)
VOID __fastcall __security_check_cookie(DWORD_PTR _stackcookie)
{
__asm
{
cmp ecx, __security_cookie
jne corrupt
ret
corrupt:
ud2
}
}
Save this C source as 64-i386.c
in an
arbitrary, preferable empty directory, save the 16 32-bit assembler
sources presented above as
alldiv.asm
,
alldvrm.asm
,
allmul.asm
,
allrem.asm
,
allshl.asm
,
allshr.asm
,
aulldiv.asm
,
aulldvrm.asm
,
aullrem.asm
,
aullshr.asm
,
divdi3.asm
,
moddi3.asm
,
muldi3.asm
,
udivdi3.asm
,
umoddi3.asm
and
udivmoddi4.asm
respectively there too, optionally copy
clang_rt.builtins-i386.lib
from an installation of
LLVM’s Clang, then start the command
prompt of the Windows software development kit for the
I386 platform in this directory, run the following
command lines to build the benchmark programs
64-i386.exe
, 64-helper.exe
,
64-msft.exe
and optionally 64-llvm.exe
,
and execute them:
CL.EXE /Brepro /c /DCYCLES /GAFwy /O2y /W4 /Zl 64-i386.c ML.EXE /Brepro /Cp /Cx /c /W3 /X divdi3.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X moddi3.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X muldi3.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X udivdi3.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X umoddi3.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X udivmoddi4.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X alldiv.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X alldvrm.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X allmul.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X allrem.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X allshl.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X allshr.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X aulldiv.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X aulldvrm.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X aullrem.asm ML.EXE /Brepro /Cp /Cx /c /W3 /X aullshr.asm LINK.EXE /LIB /BREPRO /MACHINE:I386 /NODEFAULTLIB /OUT:64-i386.lib divdi3.obj moddi3.obj muldi3.obj udivdi3.obj umoddi3.obj udivmoddi4.obj alldiv.obj alldvrm.obj allmul.obj allrem.obj allshl.obj allshr.obj aulldiv.obj aulldvrm.obj aullrem.obj aullshr.obj LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:I386 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-i386.exe /RELEASE /SUBSYSTEM:CONSOLE 64-i386.obj 64-i386.lib kernel32.lib user32.lib CL.EXE /Brepro /c /DCYCLES /DHELPER /GAFwy /O2y /W4 /Zl 64-i386.c LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:I386 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-helper.exe /RELEASE /SUBSYSTEM:CONSOLE 64-i386.obj 64-i386.lib kernel32.lib user32.lib LINK.EXE /LIB /BREPRO /DEF /EXPORT:_alldiv /EXPORT:_alldvrm /EXPORT:_allmul /EXPORT:_allrem /EXPORT:_allshl /EXPORT:_allshr /EXPORT:_aulldiv /EXPORT:_aulldvrm /EXPORT:_aullrem /EXPORT:_aullshr /MACHINE:I386 /NAME:NTDLL /NODEFAULTLIB /OUT:64-msft.lib LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:I386 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-msft.exe /RELEASE /SUBSYSTEM:CONSOLE 64-i386.obj 64-msft.lib kernel32.lib user32.lib IF EXIST clang_rt.builtins-i386.lib LINK.EXE /LINK /BREPRO /DYNAMICBASE /ENTRY:mainCRTStartup /MACHINE:I386 /NOCOFFGRPINFO /NODEFAULTLIB /NXCOMPAT /OPT:REF /OUT:64-llvm.exe /RELEASE /SUBSYSTEM:CONSOLE 64-i386.obj clang_rt.builtins-i386.lib kernel32.lib user32.lib .\64-i386.exe .\64-helper.exe .\64-msft.exe IF EXIST 64-llvm.exe .\64-llvm.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: all 32-bit programs are pure Win32 console applications and build without the MSVCRT libraries.
Note: the trivial transformation of the assembler
sources with directives for Unix’ or
GNU’s as
into assembler sources for Microsoft’s
ML.EXE
is left
as an exercise to the reader.
Microsoft Macro Assembler Reference
Note: linking the program 64-msft.exe
with the compiler helper routines built from their source code
blcrtasm.asm
is also left as an exercise to the reader.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. 64-i386.c 64-i386.c(823) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD_PTR *' 64-i386.c(824) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'LPVOID *' 64-i386.c(825) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'BYTE *' 64-i386.c(828) : warning C4100: '_stackcookie' : unreferenced formal parameter Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: divdi3.asm … Assembling: udivmoddi4.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alldiv.asm … Assembling: aullshr.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. 64-i386.c 64-i386.c(150) : warning C4100: 'right' : unreferenced formal parameter 64-i386.c(150) : warning C4100: 'left' : unreferenced formal parameter 64-i386.c(823) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD_PTR *' 64-i386.c(824) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'LPVOID *' 64-i386.c(825) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'BYTE *' 64-i386.c(828) : warning C4100: '_stackcookie' : unreferenced formal parameter Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Microsoft (R) Program Maintenance Utility Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Creating library 64-msft.lib and object 64-msft.exp Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopdi4() 12.849202656 0 __udivmoddi4() 37.561920358 24.712717702 __umuldi3() 14.358287749 1.509085093 64.769410763 clock cycles __unopdi4() 18.308137879 0 __udivmoddi4() 37.448476732 19.140338853 __umuldi3() 19.587959635 1.279821756 75.344574246 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz _aullnop() 9.108604673 0 _aulldvrm() 39.178505498 30.069900825 _aullmul() 14.272042690 5.163438017 62.559152861 clock cycles _aullnop() 14.043325395 0 _aulldvrm() 38.404302453 24.360977058 _aullmul() 19.309414816 5.266089421 71.757042664 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz _aullnop() 9.005029514 0 _aulldvrm() 145.500002260 136.494972746 _aullmul() 17.647885499 8.642855985 172.152917273 clock cycles _aullnop() 13.827490013 0 _aulldvrm() 111.386159799 97.558669786 _aullmul() 22.663936806 8.836446793 147.877586618 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopdi4() 12.857257576 0 __udivmoddi4() 94.499937193 81.642679617 __umuldi3() 30.708206573 17.850948997 138.065401342 clock cycles __unopdi4() 17.108234965 0 __udivmoddi4() 161.966266965 144.858032000 __umuldi3() 35.101783471 17.993548506 214.176285401 clock cyclesAlso without the preprocessor macro
CYCLES
defined:
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopdi4() 4.5864294 0 __udivmoddi4() 15.5064994 10.9200700 __umuldi3() 5.6004359 1.0140065 25.6933647 nano-seconds __unopdi4() 7.1760460 0 __udivmoddi4() 14.9760960 7.8000500 __umuldi3() 7.7376496 0.5616036 29.8897916 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz _aullnop() 3.6660235 0 _aulldvrm() 15.8029013 12.1368778 _aullmul() 5.5380355 1.8720120 25.0069603 nano-seconds _aullnop() 5.9592382 0 _aulldvrm() 15.4752992 9.5160610 _aullmul() 7.9716511 2.0124129 29.4061885 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz _aullnop() 3.6660235 0 _aulldvrm() 58.4691748 54.8031513 _aullmul() 7.2696466 3.6036231 69.4048449 nano-seconds _aullnop() 5.7564369 0 _aulldvrm() 44.0546824 38.2982455 _aullmul() 9.3132597 3.5568228 59.1243790 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz __unopdi4() 5.1480330 0 __udivmoddi4() 37.3934397 32.2454067 __umuldi3() 12.4176796 7.2696466 54.9591523 nano-seconds __unopdi4() 6.8640440 0 __udivmoddi4() 64.9432163 58.0791723 __umuldi3() 14.1648908 7.3008468 85.9721511 nano-seconds
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz __unopdi4() 8.710364463 0 __udivmoddi4() 29.568165444 20.857800981 __umuldi3() 10.016409737 1.306045274 48.294939644 clock cycles __unopdi4() 11.899356861 0 __udivmoddi4() 31.305810062 19.406453201 __umuldi3() 14.074341743 2.174984882 57.279508666 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz _aullnop() 6.281171716 0 _aulldvrm() 30.299316500 24.018144784 _aullmul() 10.415490092 4.134318376 46.995978308 clock cycles _aullnop() 10.305468488 0 _aulldvrm() 29.560666513 19.255198025 _aullmul() 15.232518004 4.927049516 55.098653005 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz _aullnop() 6.282753357 0 _aulldvrm() 130.221916499 123.939163142 _aullmul() 12.560291961 6.277538604 149.064961817 clock cycles _aullnop() 10.305609251 0 _aulldvrm() 93.916607827 83.610998576 _aullmul() 17.949963126 7.644353875 122.172180204 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz __unopdi4() 8.794296716 0 __udivmoddi4() 58.334799420 49.540502704 __umuldi3() 16.971398673 8.177101957 84.100494809 clock cycles __unopdi4() 11.806963851 0 __udivmoddi4() 108.598490949 96.791527098 __umuldi3() 22.271070710 10.464106859 142.676525510 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz __unopdi4() 9.513493832 0 __udivmoddi4() 28.904259242 19.390765410 __umuldi3() 9.111766044 0.000000000 47.529519118 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-6600 CPU @ 3.30GHz _aullnop() 8.466770864 0 _aulldvrm() 133.568853734 125.102082870 _aullmul() 13.159542118 4.692771254 155.195166716 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 57 Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz __unopdi4() 8.176246060 0 __udivmoddi4() 24.540802967 16.364556907 __umuldi3() 8.774901071 0.598655011 41.491950098 clock cycles __unopdi4() 10.752357791 0 __udivmoddi4() 24.479256622 13.726898831 __umuldi3() 12.175662023 1.423304232 47.407276436 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz _aullnop() 6.042327017 0 _aulldvrm() 24.822108405 18.779781388 _aullmul() 8.850690256 2.808363239 39.715125678 clock cycles _aullnop() 9.036137407 0 _aulldvrm() 24.078298463 15.042161056 _aullmul() 12.182378249 3.146240842 45.296814119 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz _aullnop() 6.043049544 0 _aulldvrm() 121.360828766 115.317779222 _aullmul() 11.284504325 5.241454781 138.688382635 clock cycles _aullnop() 9.038334480 0 _aulldvrm() 87.144452426 78.106117946 _aullmul() 14.460059957 5.421725477 110.642846863 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz __unopdi4() 8.182234986 0 __udivmoddi4() 49.594440527 41.412205541 __umuldi3() 15.480297393 7.298062407 73.256972906 clock cycles __unopdi4() 10.785032002 0 __udivmoddi4() 93.296232493 82.511200491 __umuldi3() 19.044985770 8.259953768 123.126250265 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz __unopdi4() 4.758041439 0 __udivmoddi4() 14.900456178 10.142414739 __umuldi3() 5.118839780 0.360798341 24.777337397 clock cycles __unopdi4() 6.264035993 0 __udivmoddi4() 14.991681122 8.727645129 __umuldi3() 7.116819579 0.852783586 28.372536694 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz _aullnop() 3.511911560 0 _aulldvrm() 14.931006596 11.419095036 _aullmul() 5.185640855 1.673729295 23.628559011 clock cycles _aullnop() 5.267329959 0 _aulldvrm() 14.287518025 9.020188066 _aullmul() 7.649375365 2.382045406 27.204223349 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz _aullnop() 3.888929085 0 _aulldvrm() 75.630752291 71.741823206 _aullmul() 7.039982148 3.151053063 86.559663524 clock cycles _aullnop() 5.706960149 0 _aulldvrm() 51.744648850 46.037688701 _aullmul() 8.437223071 2.730262922 65.888832070 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz __unopdi4() 4.761309343 0 __udivmoddi4() 30.499168861 25.737859518 __umuldi3() 9.146076148 4.384766805 44.406554352 clock cycles __unopdi4() 6.320688165 0 __udivmoddi4() 58.207913400 51.887225235 __umuldi3() 11.342916480 5.022228315 75.871518045 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz __unopdi4() 7.677882315 0 __udivmoddi4() 23.667828663 15.989946348 __umuldi3() 7.422475230 0.000000000 38.768186208 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on Intel(R) Core(TM) i5-9500 CPU @ 3.00GHz _aullnop() 5.841766912 0 _aulldvrm() 106.658152285 100.816385373 _aullmul() 10.760192090 4.918425178 123.260111287 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopdi4() 14.243392667 0 __udivmoddi4() 55.997587943 41.754195276 __umuldi3() 13.500837936 0.000000000 83.741818546 clock cycles __unopdi4() 17.199216332 0 __udivmoddi4() 45.874249502 28.675033170 __umuldi3() 18.633292583 1.434076251 81.706758417 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G _aullnop() 9.027738059 0 _aulldvrm() 120.796853280 111.769115221 _aullmul() 15.186058308 6.158320249 145.010649647 clock cycles _aullnop() 15.578224772 0 _aulldvrm() 90.215115103 74.636890331 _aullmul() 21.860576148 6.282351376 127.653916023 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G _aullnop() 10.246997781 0 _aulldvrm() 54.808625176 44.561627395 _aullmul() 13.902460030 3.655462249 78.958082987 clock cycles _aullnop() 15.956642108 0 _aulldvrm() 47.420239312 31.463597204 _aullmul() 22.055934131 6.099292023 85.432815551 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopdi4() 11.401600724 0 __udivmoddi4() 90.877376326 79.475775602 __umuldi3() 24.230306820 12.828706096 126.509283870 clock cycles __unopdi4() 16.350275112 0 __udivmoddi4() 181.099752347 164.749477235 __umuldi3() 28.889429738 12.539154626 226.339457197 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopdi4() 5.1406250 0 __udivmoddi4() 20.2656250 15.1250000 __umuldi3() 5.7343750 0.5937500 31.1406250 nano-seconds __unopdi4() 7.1718750 0 __udivmoddi4() 19.9375000 12.7656250 __umuldi3() 8.0468750 0.8750000 35.1562500 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G _aullnop() 3.9531250 0 _aulldvrm() 20.4062500 16.4531250 _aullmul() 6.0468750 2.0937500 30.4062500 nano-seconds _aullnop() 6.8281250 0 _aulldvrm() 20.7812500 13.9531250 _aullmul() 8.8281250 2.0000000 36.4375000 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G _aullnop() 3.9375000 0 _aulldvrm() 49.3437500 45.4062500 _aullmul() 6.6406250 2.7031250 59.9218750 nano-seconds _aullnop() 7.0781250 0 _aulldvrm() 42.0000000 34.9218750 _aullmul() 9.4843750 2.4062500 58.5625000 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD A4-9125 RADEON R3, 4 COMPUTE CORES 2C+2G __unopdi4() 5.1093750 0 __udivmoddi4() 39.4843750 34.3750000 __umuldi3() 10.5468750 5.4375000 55.1406250 nano-seconds __unopdi4() 7.2656250 0 __udivmoddi4() 69.2343750 61.9687500 __umuldi3() 12.7656250 5.5000000 89.2656250 nano-seconds
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor __unopdi4() 8.637489867 0 __udivmoddi4() 27.828655455 19.191165588 __umuldi3() 9.761457334 1.123967467 46.227602656 clock cycles __unopdi4() 11.229091635 0 __udivmoddi4() 26.703517279 15.474425644 __umuldi3() 12.675702170 1.446610535 50.608311084 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor _aullnop() 6.031238132 0 _aulldvrm() 27.804057740 21.772819608 _aullmul() 10.548285859 4.517047727 44.383581731 clock cycles _aullnop() 9.489672570 0 _aulldvrm() 27.331796039 17.842123469 _aullmul() 11.909754514 2.420081944 48.731223123 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor _aullnop() 6.040463934 0 _aulldvrm() 88.367491909 82.327027975 _aullmul() 10.869423368 4.828959434 105.277379211 clock cycles _aullnop() 9.492139142 0 _aulldvrm() 67.661042025 58.168902883 _aullmul() 14.151106441 4.658967299 91.304287608 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 7 2700X Eight-Core Processor __unopdi4() 8.218757132 0 __udivmoddi4() 68.973908219 60.755151087 __umuldi3() 16.682145911 8.463388779 93.874811262 clock cycles __unopdi4() 11.236472949 0 __udivmoddi4() 125.545346907 114.308873958 __umuldi3() 20.463970255 9.227497306 157.245790111 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor __unopdi4() 8.156522700 0 __udivmoddi4() 26.587471432 18.430948732 __umuldi3() 8.227731420 0.071208720 42.971725552 clock cycles __unopdi4() 11.171514133 0 __udivmoddi4() 24.479854346 13.308340213 __umuldi3() 11.409089652 0.237575519 47.060458131 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor _aullnop() 6.010190784 0 _aulldvrm() 26.050758627 20.040567843 _aullmul() 8.197509891 2.187319107 40.258459302 clock cycles _aullnop() 9.910357023 0 _aulldvrm() 24.299246162 14.388889139 _aullmul() 11.622902328 1.712545305 45.832505513 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor _aullnop() 6.010334245 0 _aulldvrm() 79.145175680 73.134841435 _aullmul() 9.026602597 3.016268352 94.182112522 clock cycles _aullnop() 9.909955188 0 _aulldvrm() 53.473960584 43.564005396 _aullmul() 13.012305555 3.102350367 76.396221327 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor __unopdi4() 8.154862312 0 __udivmoddi4() 64.296126258 56.141263946 __umuldi3() 14.782954357 6.628092045 87.233942927 clock cycles __unopdi4() 11.159903449 0 __udivmoddi4() 115.862991234 104.703087785 __umuldi3() 18.352062682 7.192159233 145.374957365 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor __unopdi4() 2.2656250 0 __udivmoddi4() 7.4062500 5.1406250 __umuldi3() 2.2812500 0.0156250 11.9531250 nano-seconds __unopdi4() 3.1093750 0 __udivmoddi4() 6.7812500 3.6718750 __umuldi3() 3.2343750 0.1250000 13.1250000 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor _aullnop() 1.6718750 0 _aulldvrm() 7.2500000 5.5781250 _aullmul() 2.2812500 0.6093750 11.2031250 nano-seconds _aullnop() 2.7343750 0 _aulldvrm() 6.7812500 4.0468750 _aullmul() 3.2343750 0.5000000 12.7500000 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor _aullnop() 1.6718750 0 _aulldvrm() 22.0156250 20.3437500 _aullmul() 2.5156250 0.8437500 26.2031250 nano-seconds _aullnop() 2.7500000 0 _aulldvrm() 14.9218750 12.1718750 _aullmul() 3.6093750 0.8593750 21.2812500 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 5 3600 6-Core Processor __unopdi4() 2.2656250 0 __udivmoddi4() 18.0000000 15.7343750 __umuldi3() 4.2968750 2.0312500 24.5625000 nano-seconds __unopdi4() 3.1093750 0 __udivmoddi4() 32.2812500 29.1718750 __umuldi3() 5.1875000 2.0781250 40.5781250 nano-seconds
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor __unopdi4() 7.820339238 0 __udivmoddi4() 25.219830234 17.399490996 __umuldi3() 7.896016504 0.075677266 40.936185976 clock cycles __unopdi4() 10.683594514 0 __udivmoddi4() 23.327167298 12.643572784 __umuldi3() 10.966053464 0.282458950 44.976815276 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor _aullnop() 5.782054086 0 _aulldvrm() 24.732162011 18.950107925 _aullmul() 7.867011224 2.084957138 38.381227321 clock cycles _aullnop() 9.460103500 0 _aulldvrm() 23.216316090 13.756212590 _aullmul() 11.151011086 1.690907586 43.827430676 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor _aullnop() 5.815668129 0 _aulldvrm() 75.679497601 69.863829472 _aullmul() 9.474285823 3.658617694 90.969451553 clock cycles _aullnop() 9.458013729 0 _aulldvrm() 50.089870151 40.631856422 _aullmul() 12.508264043 3.050250314 72.056147923 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor __unopdi4() 7.817188050 0 __udivmoddi4() 61.087041419 53.269853369 __umuldi3() 14.145779236 6.328591186 83.050008705 clock cycles __unopdi4() 10.659368382 0 __udivmoddi4() 109.785429868 99.126061486 __umuldi3() 17.585432466 6.926064084 138.030230716 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor __unopdi4() 2.0781250 0 __udivmoddi4() 6.6562500 4.5781250 __umuldi3() 2.0937500 0.0156250 10.8281250 nano-seconds __unopdi4() 2.8125000 0 __udivmoddi4() 6.2343750 3.4218750 __umuldi3() 2.9843750 0.1718750 12.0312500 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor _aullnop() 1.5312500 0 _aulldvrm() 6.5312500 5.0000000 _aullmul() 2.1718750 0.6406250 10.2343750 nano-seconds _aullnop() 2.5000000 0 _aulldvrm() 6.1250000 3.6250000 _aullmul() 2.9375000 0.4375000 11.5625000 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor _aullnop() 1.5000000 0 _aulldvrm() 19.7187500 18.2187500 _aullmul() 2.5000000 1.0000000 23.7187500 nano-seconds _aullnop() 2.4843750 0 _aulldvrm() 13.2031250 10.7187500 _aullmul() 3.2812500 0.7968750 18.9687500 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD Ryzen 9 3900XT 12-Core Processor __unopdi4() 2.0625000 0 __udivmoddi4() 16.2343750 14.1718750 __umuldi3() 3.8906250 1.8281250 22.1875000 nano-seconds __unopdi4() 2.8125000 0 __udivmoddi4() 29.0781250 26.2656250 __umuldi3() 4.7187500 1.9062500 36.6093750 nano-seconds
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD EPYC 7713 64-Core Processor __unopdi4() 5.783152900 0 __udivmoddi4() 15.055998960 9.272846060 __umuldi3() 5.495972020 0.000000000 26.335123880 clock cycles __unopdi4() 7.998956160 0 __udivmoddi4() 15.909793620 7.910837460 __umuldi3() 7.757748040 0.000000000 31.666497820 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD EPYC 7713 64-Core Processor _aullnop() 4.537734560 0 _aulldvrm() 15.874445940 11.336711380 _aullmul() 6.147943420 1.610208860 26.560123920 clock cycles _aullnop() 6.793898680 0 _aulldvrm() 16.717796680 9.923898000 _aullmul() 9.055848460 2.261949780 32.567543820 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD EPYC 7713 64-Core Processor _aullnop() 4.527671860 0 _aulldvrm() 51.431490520 46.903818660 _aullmul() 6.800312040 2.272640180 62.759474420 clock cycles _aullnop() 6.791780420 0 _aulldvrm() 34.746860380 27.955079960 _aullmul() 9.737921840 2.946141420 51.276562640 clock cycles Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD EPYC 7713 64-Core Processor __unopdi4() 5.787192880 0 __udivmoddi4() 39.760556600 33.973363720 __umuldi3() 10.045714200 4.258521320 55.593463680 clock cycles __unopdi4() 7.803949780 0 __udivmoddi4() 70.401916480 62.597966700 __umuldi3() 12.308324680 4.504374900 90.514190940 clock cycles
[…] Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD EPYC 7713 64-Core Processor __unopdi4() 2.7968750 0 __udivmoddi4() 7.5000000 4.7031250 __umuldi3() 2.7968750 0.0000000 13.0937500 nano-seconds __unopdi4() 4.1406250 0 __udivmoddi4() 7.9843750 3.8437500 __umuldi3() 3.9687500 0.0000000 16.0937500 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD EPYC 7713 64-Core Processor _aullnop() 2.3437500 0 _aulldvrm() 8.1875000 5.8437500 _aullmul() 3.2031250 0.8593750 13.7343750 nano-seconds _aullnop() 3.5468750 0 _aulldvrm() 8.7500000 5.2031250 _aullmul() 4.7187500 1.1718750 17.0156250 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD EPYC 7713 64-Core Processor _aullnop() 2.3593750 0 _aulldvrm() 26.7656250 24.4062500 _aullmul() 3.5468750 1.1875000 32.6718750 nano-seconds _aullnop() 3.5468750 0 _aulldvrm() 17.9843750 14.4375000 _aullmul() 5.0937500 1.5468750 26.6250000 nano-seconds Testing 64-bit division... 57 Testing unsigned 64-bit division... 57 Testing signed 64-bit division... 43 Timing 64-bit division on AMD EPYC 7713 64-Core Processor __unopdi4() 2.8593750 0 __udivmoddi4() 20.8125000 17.9531250 __umuldi3() 5.2187500 2.3593750 28.8906250 nano-seconds __unopdi4() 4.3437500 0 __udivmoddi4() 36.9062500 32.5625000 __umuldi3() 6.4062500 2.0625000 47.6562500 nano-seconds
Use the X.509 certificate to send S/MIME encrypted mail.
Note: email in weird format and without a proper sender name is likely to be discarded!
I dislike
HTML (and even
weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your
nickname.
I abhor top posts and expect inline quotes in replies.
as iswithout any warranty, neither express nor implied.
cookiesin the web browser.
The web service is operated and provided by
Telekom Deutschland GmbH The web service provider stores a session cookie
in the web
browser and records every visit of this web site with the following
data in an access log on their server(s):