__chkstk
Routine_alloca
Routine_chkstk
Routine_allmul
Routine_alldiv
Routine_alldvrm
Routine_allrem
Routine_aulldiv
Routine_aulldvrm
Routine_aullrem
Routine_aullshr
Routine_allshl
Routine_allshr
Routine_all*
and _aull*
Routines in LeakedSource
_all*
and _aull*
Routines_rotl64()
and _rotr64()
Intrinsic Functions for i386 Platform_allrol()
and _allror()
Functions in i386 Assembler_abs64()
Intrinsic Function for i386 Platform_allabs()
Function in i386 Assembler_allneg()
Function in i386 Assembler_allsgn()
Function in i386 Assembler_allcmp()
and _aullcmp()
Functions in i386 Assembler_allmax()
and _aullmax()
Functions in i386 Assembler_allmin()
and _aullmin()
Functions in i386 Assembleracos()
, asin()
, atan()
, atan2()
, cos()
, cosh()
, exp()
, fmod()
, log()
, log10()
, pow()
, sin()
, sinh()
, sqrt()
, tan()
and tanh()
Standard Functions for i386 Platform_CI*
and _ftol*
Routinesmemchr()
Standard Function for i386 PlatformSmartImplementation in i386 Assembler
SmartImplementation in AMD64 Assembler
mem*()
Standard Functionsmemcpy()
and memset()
with Intrinsic Functionsstrchr()
Standard Function for i386 Platformstrlen()
Standard Function for i386 Platformstrrchr()
and strstr()
Standard Functions for i386 Platformstr*()
Standard Functionswcs*()
Standard Functions_load_config_used
and __security_check_cookie()
Function (/GS
Support)main()
and wmain()
Support.CRT
Section Usage.rtc
Section UsageAdditionally present properly written, especially for 64÷64-bit integer division on the i386 platform several times faster implementations.
__chkstk
for memory allocations on the stack, and to the standard functions
memcpy()
and
memset()
for assignment and initialisation of arrays and structures.
For code running on the i386 alias x86
processor architecture, the Microsoft
Visual C compiler generates calls to the (almost)
undocumented helper routines
_alloca
and
_chkstk
for memory allocations on the stack, to the standard functions
memcpy()
and
memset()
for assignment and initialisation of arrays and structures, to the
(almost) undocumented helper routines
_alldiv
,
_alldvrm
,
_allmul
,
_allrem
, _allshl
and
_allshr
for signed 64-bit integer division,
multiplication and shift operations, to the also (almost)
undocumented helper routines
_aulldiv
,
_aulldvrm
, _aullrem
and
_aullshr
for unsigned 64-bit integer division,
multiplication and shift operations, and to the helper routines
_CIacos
, _CIasin
,
_CIatan
,
_CIatan2
,
_CIcos
,
_CIcosh
,
_CIexp
,
_CIfmod
,
_CIlog
,
_CIlog10
,
_CIpow
,
_CIsin
,
_CIsinh
,
_CIsqrt
,
_CItan
,
_CItanh
and
_ftol
for transcendental as well as trigonometric floating-point
functions.
Internal CRT globals and functions
Note: except for the mem*()
and
str*()
standard functions, all helper routines use
non-standard
calling or naming convention
and can’t be called from C or C++
sources by their name!
These routines are provided in the object file
chkstk.obj
, the object libraries
libcmt.lib
,
libcmtd.lib
, msvcrt.lib
and
msvcrtd.lib
, partially also in
runtmchk.lib
;
their i386 assembler sources are provided in the files
alloca16.asm
, chkstk.asm
,
lldiv.asm
, lldvrm.asm
,
llmul.asm
, llrem.asm
,
llshl.asm
, llshr.asm
,
ulldiv.asm
, ulldvrm.asm
,
ullrem.asm
, ullshr.asm
,
memchr.asm
, memcpy.asm
,
memset.asm
, strchr.asm
,
strlen.asm
etc., and their
ANSI C
sources are provided in the files strcat.c
,
strrchr.c
, strstr.c
etc.
Note: many of these routines are exported from
NTDLL.dll
.
Import libraries
amd64.lib
and i386.lib
can be generated
with the following 2 command lines:
LINK.EXE /LIB /DEF /EXPORT:__C_specific_handler /EXPORT:__chkstk /EXPORT:atoi /EXPORT:atol /EXPORT:isalnum /EXPORT:isalpha /EXPORT:iscntrl /EXPORT:isdigit /EXPORT:isgraph /EXPORT:islower /EXPORT:isprint /EXPORT:ispunct /EXPORT:isspace /EXPORT:isupper /EXPORT:isxdigit /EXPORT:iswalpha /EXPORT:iswctype /EXPORT:iswdigit /EXPORT:iswlower /EXPORT:iswspace /EXPORT:iswxdigit /EXPORT:memchr /EXPORT:memcmp /EXPORT:memcpy /EXPORT:memmove /EXPORT:memset /EXPORT:qsort /EXPORT:strcat /EXPORT:strcat_s /EXPORT:strchr /EXPORT:strcmp /EXPORT:strcpy /EXPORT:strcpy_s /EXPORT:strcspn /EXPORT:strlen /EXPORT:strncat /EXPORT:strncat_s /EXPORT:strncmp /EXPORT:strncpy /EXPORT:strncpy_s /EXPORT:strnlen /EXPORT:strpbrk /EXPORT:strrchr /EXPORT:strspn /EXPORT:strstr /EXPORT:strtok_s /EXPORT:strtol /EXPORT:strtoul /EXPORT:tolower /EXPORT:toupper /EXPORT:towlower /EXPORT:towupper /EXPORT:wcscat /EXPORT:wcscat_s /EXPORT:wcschr /EXPORT:wcscmp /EXPORT:wcscpy /EXPORT:wcscpy_s /EXPORT:wcscspn /EXPORT:wcslen /EXPORT:wcsncat /EXPORT:wcsncat_s /EXPORT:wcsncmp /EXPORT:wcsncpy /EXPORT:wcsncpy_s /EXPORT:wcsnlen /EXPORT:wcspbrk /EXPORT:wcsspn /EXPORT:wcsstr /EXPORT:wcstol /EXPORT:wcstoul /MACHINE:AMD64 /NAME:NTDLL /NODEFAULTLIB /OUT:amd64.lib LINK.EXE /LIB /DEF /EXPORT:_CIcos /EXPORT:_CIlog /EXPORT:_CIpow /EXPORT:_CIsin /EXPORT:_CIsqrt /EXPORT:_alldiv /EXPORT:_alldvrm /EXPORT:_allmul /EXPORT:_alloca_probe /EXPORT:_alloca_probe_8 /EXPORT:_alloca_probe_16 /EXPORT:_allrem /EXPORT:_allshl /EXPORT:_allshr /EXPORT:_aulldiv /EXPORT:_aulldvrm /EXPORT:_aullrem /EXPORT:_aullshr /EXPORT:_chkstk /EXPORT:_fltused /EXPORT:_ftol /EXPORT:atoi /EXPORT:atol /EXPORT:isalnum /EXPORT:isalpha /EXPORT:iscntrl /EXPORT:isdigit /EXPORT:isgraph /EXPORT:islower /EXPORT:isprint /EXPORT:ispunct /EXPORT:isspace /EXPORT:isupper /EXPORT:isxdigit /EXPORT:iswalpha /EXPORT:iswctype /EXPORT:iswdigit /EXPORT:iswlower /EXPORT:iswspace /EXPORT:iswxdigit /EXPORT:memchr /EXPORT:memcmp /EXPORT:memcpy /EXPORT:memmove /EXPORT:memset /EXPORT:qsort /EXPORT:strcat /EXPORT:strcat_s /EXPORT:strchr /EXPORT:strcmp /EXPORT:strcpy /EXPORT:strcpy_s /EXPORT:strcspn /EXPORT:strlen /EXPORT:strncat /EXPORT:strncat_s /EXPORT:strncmp /EXPORT:strncpy /EXPORT:strncpy_s /EXPORT:strnlen /EXPORT:strpbrk /EXPORT:strrchr /EXPORT:strspn /EXPORT:strstr /EXPORT:strtok_s /EXPORT:strtol /EXPORT:strtoul /EXPORT:tolower /EXPORT:toupper /EXPORT:towlower /EXPORT:towupper /EXPORT:wcscat /EXPORT:wcscat_s /EXPORT:wcschr /EXPORT:wcscmp /EXPORT:wcscpy /EXPORT:wcscpy_s /EXPORT:wcscspn /EXPORT:wcslen /EXPORT:wcsncat /EXPORT:wcsncat_s /EXPORT:wcsncmp /EXPORT:wcsncpy /EXPORT:wcsncpy_s /EXPORT:wcsnlen /EXPORT:wcspbrk /EXPORT:wcsspn /EXPORT:wcsstr /EXPORT:wcstol /EXPORT:wcstoul /MACHINE:I386 /NAME:NTDLL /NODEFAULTLIB /OUT:i386.libLIB Reference
Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Creating library amd64.lib and object amd64.exp Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Creating library i386.lib and object i386.expNote: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
__chkstk
Routine__chkstk
routine states:
Called by the compiler when you have more than one page of local variables in your function.OUCH¹: contrary to the first highlighted statement, the correct number for AMD64 alias x64 processors is but 4k too; 8kRemarks
__chkstk Routine is a helper routine for the C compiler. For x86 compilers, __chkstk Routine is called when the local variables exceed 4K bytes; for x64 compilers it is 8K.
This function is not defined in an SDK header and must be declared by the caller. This function is exported from kernelbase.dll.
OUCH²: contrary to the second highlighted statement, which is complete and dangerous nonsense, this routine uses a non-standard calling convention – it must not be declared and can not be called from C or C++ sources by its name!
The MSDN article x64 Prolog and Epilog specifies:
The __chkstk
helper will not modify any registers other
than R10, R11, and the condition codes. In particular, it will
return RAX unchanged and leave all nonvolatile registers and
argument-passing registers unmodified.
Note: the Visual C compiler calls
it through the
_alloca()
intrinsic
function, using register RAX
for its argument.
Overview of x64 Calling Conventions
Register Usage
Types and Storage
Scalar Types
Aggregates and Unions
Examples of Structure Alignment
Bitfields
Calling Convention
Parameter Passing
Varargs
Unprototyped Functions
Caller/Callee Saved Registers
Function Pointers
Legacy Floating-Point Support
FPSCR
MXCSR
Stack Usage
Start the command prompt of the Visual C
development environment for the AMD64 platform, then
execute the following 3 command lines to locate the object file
chkstk.obj
and display its disassembly:
FOR %? IN (chkstk.obj) DO SET chkstk=%~$LIB:? DIR "%chkstk%" LINK.EXE /DUMP /DISASM "%chkstk%"
SET chkstk=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64\chkstk.obj Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64 02/18/2011 03:08 PM 1,922 chkstk.obj 1 File(s) 1,922 bytes 0 Dir(s) 9,876,543,210 bytes free Microsoft (R) COFF/PE Dumper Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Dump of file C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64\chkstk.obj File Type: COFF OBJECT19 (plus 7) instructions in 78 (plus 18) bytes.$$000000: 0000000000000000: CC int 3 0000000000000001: CC int 3 0000000000000002: CC int 3 0000000000000003: CC int 3 0000000000000004: CC int 3 0000000000000005: CC int 3 0000000000000006: 66 66 0F 1F 84 00 nop word ptr [rax+rax+00000000h] 00 00 00 00__chkstk:0000000000000010: 48 83 EC 10 sub rsp,10h 0000000000000014: 4C 89 14 24 mov qword ptr [rsp],r10 0000000000000018: 4C 89 5C 24 08 mov qword ptr [rsp+8],r11000000000000001D: 4D 33 DB xor r11,r110000000000000020: 4C 8D 54 24 18 lea r10,[rsp+18h]0000000000000020: 4C 8D 54 24 08 lea r10,[rsp+8] 0000000000000025: 4C 2B D0 sub r10,rax 0000000000000028: 4D 0F 42 D3 cmovb r10,r11000000000000002C: 65 4C 8B 1C 25 10 mov r11,qword ptr gs:[00000010h] 00 00 00000000000000002C: 65 4D 8B 52 10 mov r11,qword ptr gs:[r11+10h] 0000000000000035: 4D 3B D3 cmp r10,r11 0000000000000038: 73 16 jae cs20000000000000003A: 66 41 81 E2 00 F0 and r10w,0F000hcs10: 0000000000000040: 4D 8D 9B 00 F0 FF lea r11,[r11+FFFFF000h] FF0000000000000047: 41 C6 03 00 mov byte ptr [r11],00000000000000047: 4D 85 1B test qword ptr [r11],r11 000000000000004B: 4D 3B D3 cmp r10,r11000000000000004E: 75 F0 jne cs10000000000000004E: 72 F0 jnae cs10 cs20:0000000000000050: 4C 8B 14 24 mov r10,qword ptr [rsp] 0000000000000054: 4C 8B 5C 24 08 mov r11,qword ptr [rsp+8] 0000000000000059: 48 83 C4 10 add rsp,10h000000000000005D: C3 ret Summary 0 .data 3A8 .debug$S 70 .debug$T C .pdata 5E .text 8 .xdata
OUCH¹: the __chkstk
routine saves
and restores the volatile registers
R10
and R11
without necessity, and
very clumsy too; instead to use 2
PUSH
plus 2 POP
instructions with just 8 bytes, it increments respectively
decrements the stack pointer with SUB
and ADD
instructions and writes
respectively reads the stack with 4
MOV
instructions, wasting 26 bytes!
OUCH²: replacing the deleted
JNE
instruction at
address 0x38 with the inserted
JNAE
alias
JB
instruction makes
the deleted AND
instruction at address 0x3A superfluous and saves
6 bytes!
OUCH³: replacing the deleted
MOV
instruction at address 0x47 with the
inserted TEST
instruction avoids a superfluous memory write and
saves 1 byte!
Note: 7 of the 19 instructions and 37 of the total 78 code bytes are completely superfluous, they only waste memory, processor cycles – and every user’s time!
Oops: replacing the deleted
MOV
instruction at address 0x2C with the
inserted one saves 4 more bytes!
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; * The software is provided "as is" without any warranty, neither express
; nor implied.
; * In no event will the author be held liable for any damage(s) arising
; from the use of the software.
; * Redistribution of the software is allowed only in unmodified form.
; * Permission is granted to use the software solely for personal private
; and non-commercial purposes.
; * An individuals use of the software in his or her capacity or function
; as an agent, (independent) contractor, employee, member or officer of
; a business, corporation or organization (commercial or non-commercial)
; does not qualify as personal private and non-commercial purpose.
; * Without written approval from the author the software must not be used
; for a business, for commercial, corporate, governmental, military or
; organizational purposes of any kind, or in a commercial, corporate,
; governmental, military or organizational environment of any kind.
_nt_tib struct 8 ; thread information block
chain qword ? ; address of first exception registration record
base qword ? ; stack base
limit qword ? ; stack limit
qword ? ; address of subsystem thread information block
fiber qword ? ; fiber data
pointer qword ? ; arbitrary user pointer
self qword ? ; address of _nt_tib
_nt_tib ends
.code
; MSC internal intrinsic _alloca() alias __chkstk():
; receives argument in rax
; NOTE: _alloca() must preserve rax and all argument registers;
; it can raise 'stack overflow' exception!
;; alias <_alloca> = <__chkstk>
__chkstk proc public ; qword __chkstk(qword size)
mov r10, gs:[_nt_tib.limit]
; r10 = (current) stack limit
lea r11, [rsp+8] ; r11 = stack pointer of caller
sub r11, rax ; r11 = new stack pointer
jnb short limit
overflow:
xor r11, r11 ; r11 = 0
probe:
sub r10, 4096 ; r10 = address of guard page
test r10, [r10] ; r10 = new stack limit via 'guard page' exception
limit:
cmp r10, r11
ja short probe ; stack limit > new stack pointer?
ret
__chkstk endp
end
Microsoft Macro Assembler Reference
Save the AMD64 assembler source presented above as
chkstk.asm
in an arbitrary, preferable empty directory,
then execute the following 3 command lines to generate the object
file chkstk.obj
and put it into the new object library
amd64.lib
:
SET ML=/c /W3 /X ML64.EXE chkstk.asm LINK.EXE /LIB /OUT:amd64.lib chkstk.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: chkstk.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_alloca
Routine_alloca
compiler helper routine states:
Allocates memory on the stack. […]OUCH: the[…]void *_alloca( size_t size );
The
_alloca
routine returns avoid
pointer to the allocated space, which is suitably aligned for storage of any type of object. Ifsize
is 0,_alloca
allocates a zero-length item and returns a valid pointer to that item.A stack overflow exception is generated if the space can't be allocated. […]
_alloca_probe
alias
_chkstk
routine returns but an unaligned pointer; only the
_alloca_probe_8
and _alloca_probe_16
routines return an aligned pointer, the first suitable to store
MMX™
variables, the second suitable to store
SSE variables!
CAVEAT: for constant arguments less than 64,
the Visual C compiler generates calls to the
_alloca_probe
routine!
Start the command prompt of the Visual C
development environment for the i386 platform, then
execute the following 4 command lines to locate the assembler source
file alloca16.asm
and display its content:
FOR %? IN (msvcrt.lib) DO SET msvcrt=%~$LIB:? SET source=%msvcrt:\lib\msvcrt.lib=\crt\src% DIR "%source%\intel\alloca16.asm" TYPE "%source%\intel\alloca16.asm"
SET msvcrt=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\msvcrt.lib SET source=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 2,241 alloca16.asm 1 File(s) 2,241 bytes 0 Dir(s) 9,876,543,210 bytes free page ,132 title alloca16 - aligned C stack checking routine ;*** ;chkstk.asm - aligned C stack checking routine ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; Provides 16 and 8 bit aligned alloca routines. ; ;******************************************************************************* .xlist include cruntime.inc .list extern _chkstk:near ; size of a page of memory CODESEG page ;*** ; _alloca_probe_16, _alloca_probe_8 - align allocation to 16/8 byte boundary ; ;Purpose: ; Adjust allocation size so the ESP returned from chkstk will be aligned ; to 16/8 bit boundary. Call chkstk to do the real allocation. ; ;Entry: ; EAX = size of local frame ; ;Exit: ; Adjusted EAX. ; ;Uses: ; EAX ; ;******************************************************************************* public _alloca_probe_8 _alloca_probe_16 proc ; 16 byte aligned alloca push ecx lea ecx, [esp] + 8 ; TOS before entering this function sub ecx, eax ; New TOS and ecx, (16 - 1) ; Distance from 16 bit align (align down) add eax, ecx ; Increase allocation size sbb ecx, ecx ; ecx = 0xFFFFFFFF if size wrapped around or eax, ecx ; cap allocation size on wraparound pop ecx ; Restore ecx jmp _chkstk alloca_8: ; 8 byte aligned alloca _alloca_probe_8 = alloca_8 push ecx lea ecx, [esp] + 8 ; TOS before entering this function sub ecx, eax ; New TOS and ecx, (8 - 1) ; Distance from 8 bit align (align down) add eax, ecx ; Increase allocation Size sbb ecx, ecx ; ecx = 0xFFFFFFFF if size wrapped around or eax, ecx ; cap allocation size on wraparound pop ecx ; Restore ecx jmp _chkstk _alloca_probe_16 endp end18 instructions in 44 bytes (plus 4 bytes for alignment).
Oops: since both routines are contained in one
(linkable) function, they occupy 48 bytes instead of 32 bytes;
together with the referenced
_chkstk
routine they occupy 96 bytes.
_chkstk
Routine_chkstk
compiler helper routine states:
_chkstk Routine is a helper routine for the C compiler. For x86 compilers, _chkstk Routine is called when the local variables exceed 4K bytes; for x64 compilers it is 8K.OUCH: contrary to the highlighted statement, the correct number for x64 alias AMD64 processors is but 4096 too; 8192
The documentation for the
/Gs
compiler option states:
A stack probe is a sequence of code that the compiler inserts at the beginning of a function call. When initiated, a stack probe reaches benignly into memory by the amount of space required to store the function's local variables. This probe causes the operating system to transparently page in more stack memory if necessary, before the rest of the function runs.Execute the following 2 command lines to display the content of the assembler source fileBy default, the compiler generates code that initiates a stack probe when a function requires more than one page of stack space. This default is equivalent to a compiler option of
/Gs4096
for x86, x64, ARM, and ARM64 platforms. This value allows an application and the Windows memory manager to increase the amount of memory committed to the program stack dynamically at run time.
chkstk.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\chkstk.asm" TYPE "%source%\intel\chkstk.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 3,465 chkstk.asm 1 File(s) 3,465 bytes 0 Dir(s) 9,876,543,210 bytes free page ,132 title chkstk - C stack checking routine ;*** ;chkstk.asm - C stack checking routine ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; Provides support for automatic stack checking in C procedures ; when stack checking is enabled. ; ;******************************************************************************* .xlist include cruntime.inc .list ; size of a page of memory _PAGESIZE_ equ 1000h CODESEG page ;*** ;_chkstk - check stack upon procedure entry ; ;Purpose: ; Provide stack checking on procedure entry. Method is to simply probe ; each page of memory required for the stack in descending order. This ; causes the necessary pages of memory to be allocated via the guard ; page scheme, if possible. In the event of failure, the OS raises the ; _XCPT_UNABLE_TO_GROW_STACK exception. ; ; NOTE: Currently, the (EAX < _PAGESIZE_) code path falls through ; to the "lastpage" label of the (EAX >= _PAGESIZE_) code path. This ; is small; a minor speed optimization would be to special case ; this up top. This would avoid the painful save/restore of ; ecx and would shorten the code path by 4-6 instructions. ; ;Entry: ; EAX = size of local frame ; ;Exit: ; ESP = new stackframe, if successful ; ;Uses: ; EAX ; ;Exceptions: ; _XCPT_GUARD_PAGE_VIOLATION - May be raised on a page probe. NEVER TRAP ; THIS!!!! It is used by the OS to grow the ; stack on demand. ; _XCPT_UNABLE_TO_GROW_STACK - The stack cannot be grown. More precisely, ; the attempt by the OS memory manager to ; allocate another guard page in response ; to a _XCPT_GUARD_PAGE_VIOLATION has ; failed. ; ;******************************************************************************* public _alloca_probe _chkstk proc _alloca_probe = _chkstk push ecx ; Calculate new TOS. lea ecx, [esp] + 8 - 4 ; TOS before entering function + size for ret value sub ecx, eax ; new TOS ; Handle allocation size that results in wraparound. ; Wraparound will result in StackOverflow exception. cmc sbb eax, eax ; 0 if CF==0, ~0 if CF==119 instructions in 43 bytes (plus 5 bytes for alignment).not eax; ~0 if TOS did not wrapped around, 0 otherwise and ecx, eax ; set to 0 if wraparound mov eax, esp ; current TOS and eax, not ( _PAGESIZE_ - 1) ; Round down to current page boundary cs10: cmp ecx, eax ; Is new TOS jb short cs20 ; in probed page? mov eax, ecx ; yes. pop ecx xchg esp, eax ; update espmov eax, dword ptr [eax]; get return addressmov dword ptr [esp], eax; and put it at new TOS push [eax] ret ; Find next lower page and probe cs20: sub eax, _PAGESIZE_ ; decrease by PAGESIZE test dword ptr [eax],eax ; probe page. jmp short cs10 _chkstk endp end
Oops¹: every programmer should but really know
that two’s-complement binary arithmetic exhibits the identity
−value = not (value − 1)
!
Oops²: instead of the deleted
NOT
instruction
the CMC
instruction
inserted before the
SBB
instruction
should be used, saving 1 byte.
OOPS: on Pentium® and
later processors, instead of the 2 deleted
MOV
instructions the single
inserted
PUSH
instruction should be used, saving 4 bytes; the term
- 4
of the initial
LEA
instruction must
then be removed to account for the additional 4 bytes pushed onto
the stack!
OUCH: if the new TOS is within an already allocated stack page, this stupid implementation but performs superfluous page probes, i.e. superfluous slow memory accesses, loading stale data into the cache hierarchy in the best case and triggering page faults transferring stale data into memory in the worst case!
FOR %? IN (chkstk.obj) DO SET chkstk=%~$LIB:? DIR "%chkstk%" LINK.EXE /DUMP /DISASM "%chkstk%"
SET chkstk=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\chkstk.obj Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib 02/18/2011 03:52 PM 1,377 chkstk.obj 1 File(s) 1,377 bytes 0 Dir(s) 9,876,543,210 bytes free Microsoft (R) COFF/PE Dumper Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Dump of file C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\chkstk.obj File Type: COFF OBJECT __chkstk: 00000000: 51 push ecx 00000001: 8D 4C 24 04 lea ecx,[esp+4] 00000005: 2B C8 sub ecx,eax 00000007: 1B C0 sbb eax,eax 00000009: F7 D0 not eax 0000000B: 23 C8 and ecx,eax 0000000D: 8B C4 mov eax,esp 0000000F: 25 00 F0 FF FF and eax,0FFFFF000h cs10: 00000014: 3B C8 cmp ecx,eax 00000016: 72 0A jb cs20 00000018: 8B C1 mov eax,ecx 0000001A: 59 pop ecx 0000001B: 94 xchg eax,esp 0000001C: 8B 00 mov eax,dword ptr [eax] 0000001E: 89 04 24 mov dword ptr [esp],eax 00000021: C3 ret cs20: 00000022: 2D 00 10 00 00 sub eax,1000h 00000027: 85 00 test dword ptr [eax],eax 00000029: EB E9 jmp cs10 Summary 0 .data 2EC .debug$S 24 .debug$T 2B .text
SHL
instruction; it uses 17
instructions in 40 bytes (plus 8 bytes for alignment) when the text
macro ALLOCA
is undefined, and 18 instructions in 43
bytes (plus 5 bytes for alignment) when the text macro
ALLOCA
is defined as 8 or 16:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
_nt_tib struct 4 ; thread information block
chain dword ? ; address of first exception registration record
base dword ? ; stack base
limit dword ? ; stack limit
dword ? ; address of subsystem thread information block
fiber dword ? ; fiber data
pointer dword ? ; arbitrary user pointer
self dword ? ; address of _nt_tib
_nt_tib ends
.code
; MSC internal intrinsic _alloca() alias _chkstk():
; receives argument in eax, returns result in esp
; NOTE: _alloca() must preserve all registers except eax;
; it can raise 'stack overflow' exception!
ifndef ALLOCA
_chkstk proc public ; void _chkstk(dword size)
_alloca_probe proc public ; void _alloca_probe(dword size)
elseifidn ALLOCA, %8
_alloca_probe_8 proc public ; void _alloca_probe_8(dword size)
elseifidn ALLOCA, %16
_alloca_probe_16 proc public ; void _alloca_probe_16(dword size)
endif
push ebx ; decrement stack pointer, save register
lea ebx, [esp+8] ; ebx = stack pointer of caller
sub ebx, eax ; ebx = new (unaligned) stack pointer
cmc ; CF = ~(ebx < 0)
sbb eax, eax ; eax = (ebx < 0) ? 0 : -1
ifndef ALLOCA
elseifidn ALLOCA, %8
shl eax, 3 ; eax = (ebx < 0) ? 0 : -8
elseifidn ALLOCA, %16
shl eax, 4 ; eax = (ebx < 0) ? 0 : -16
endif
and eax, ebx ; eax = (ebx < 0) ? 0 : new (aligned) stack pointer
assume fs :flat
mov ebx, fs:[_nt_tib.limit]
; ebx = (current) stack limit
cmp ebx, eax
jna short ready ; stack limit <= new stack pointer?
probe:
sub ebx, 4096 ; ebx = address of guard page
test ebx, [ebx] ; ebx = new stack limit via 'guard page' exception
cmp ebx, eax
ja short probe ; new stack limit > new stack pointer?
ready:
pop ebx ; restore register
xchg eax, esp ; esp = new stack pointer,
; eax = old stack pointer
; = address of return address
push [eax] ; decrement stack pointer, write return address
ret
ifndef ALLOCA
_alloca_probe endp
_chkstk endp
elseifidn ALLOCA, %8
_alloca_probe_8 endp
elseifidn ALLOCA, %16
_alloca_probe_16 endp
else
echo ALLOCA must be 8 or 16 when defined!
.err ALLOCA
endif
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
alloca.asm
in an arbitrary, preferable empty directory,
then execute the following 5 command lines to generate the 3 object
files alloca.obj
, alloca8.obj
plus
alloca16.obj
and put them into the new object library
i386.lib
:
SET ML=/c /safeseh /W3 /X ML.EXE alloca.asm ML.EXE /DALLOCA=8 /Foalloca8.obj alloca.asm ML.EXE /DALLOCA=16 /Foalloca16.obj alloca.asm LINK.EXE /LIB /OUT:i386.lib alloca.obj alloca8.obj alloca16.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alloca.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alloca.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alloca.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_allmul
Routine_allmul
compiler helper routine states:
Multiplies two LONGLONG or ULONGLONG integers. For example, to multiply two int64 values the compiler might generate a call to the _allmul routine.OUCH: contrary to the highlighted statement, the Visual C compiler generates calls to theRemarks
The _allmul routine is a helper routine for the C compiler. Whether the compiler uses _allmul is completely dependent on the optimization set.
This routine is used only on x86 platforms.
_allmul
routine unconditionally, independent from any optimisation, when it
encounters a multiplication where at least one of its operands is a
(signed or unsigned) 64-bit integer!
Execute the following 2 command lines to display the content of the
assembler source file llmul.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\llmul.asm" TYPE "%source%\intel\llmul.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 2,570 llmul.asm 1 File(s) 2,570 bytes 0 Dir(s) 9,876,543,210 bytes free title llmul - long multiply routine ;*** ;llmul.asm - long multiply routine ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; Defines long multiply routine ; Both signed and unsigned routines are the same, since multiply's ; work out the same in 2's complement ; creates the following routine: ; __allmul ; ;******************************************************************************* .xlist include cruntime.inc include mm.inc .list ;*** ;llmul - long multiply routine ; ;Purpose: ; Does a long multiply (same for signed/unsigned) ; Parameters are not changed. ; ;Entry: ; Parameters are passed on the stack: ; 1st pushed: multiplier (QWORD) ; 2nd pushed: multiplicand (QWORD) ; ;Exit: ; EDX:EAX - product of multiplier and multiplicand ; NOTE: parameters are removed from the stack ; ;Uses: ; ECX ; ;Exceptions: ; ;******************************************************************************* CODESEG _allmul PROC NEAR .FPO (0, 4, 0, 0, 0, 0) A EQU [esp + 4] ; stack address of a B EQU [esp + 12] ; stack address of b ; ; AHI, BHI : upper 32 bits of A and B ; ALO, BLO : lower 32 bits of A and B ; ; ALO * BLO ; ALO * BHI ; + BLO * AHI ; --------------------- ; mov eax,HIWORD(A) mov ecx,HIWORD(B) or ecx,eax ;test for both hiwords zero. mov ecx,LOWORD(B) jnz short hard ;both are zero, just mult ALO and BLO mov eax,LOWORD(A) mul ecx ret 16 ; callee restores the stack hard: push ebx .FPO (1, 4, 0, 0, 0, 0) ; must redefine A and B since esp has been altered A2 EQU [esp + 8] ; stack address of a B2 EQU [esp + 16] ; stack address of b mul ecx ;eax has AHI, ecx has BLO, so AHI * BLO mov ebx,eax ;save result mov eax,LOWORD(A2) mul dword ptr HIWORD(B2) ;ALO * BHI add ebx,eax ;ebx = ((ALO * BHI) + (AHI * BLO)) mov eax,LOWORD(A2) ;ecx = BLO mul ecx ;so edx:eax = ALO*BLO add edx,ebx ;now edx has all the LO*HI stuff pop ebx ret 16 ; callee restores the stack _allmul ENDP end19 instructions in 52 bytes (plus 12 bytes for alignment).
Ouch¹: since only the low parts of the
products of the low and high parts of the arguments are needed, the
2 highlighted widening
MUL
instructions should be
replaced with 2 faster
IMUL
instructions!
Ouch²: on processors featuring speculative
execution, i.e. Pentium®Pro (introduced
November 1, 1995)
and newer, which execute 2 IMUL
or MUL
instructions faster
than a mispredicted conditional branch, the test whether the high
parts of both arguments are 0 is superfluous and
impairs performance!
SPACE
is undefined,
else 12 instructions in 37 bytes (plus 11 bytes for alignment), but
no non-volatile register:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _allmul():
; receives arguments on stack, returns product modulo 2**64 in edx:eax
_allmul proc public ; sqword _allmul(sqword multiplicand, sqword multiplier)
mov ecx, [esp+16] ; ecx = high dword of multiplier
mov edx, [esp+8] ; edx = high dword of multiplicand
mov eax, [esp+4] ; eax = low dword of multiplicand
or ecx, edx
ifdef SPACE
jz short zero ; high dwords are 0?
else ; SPACE
jnz short notzero ; high dwords are not 0?
mul dword ptr [esp+12]
; edx:eax = low dword of multiplicand
; * low dword of multiplier
ret 16 ; callee restores stack
notzero:
endif ; SPACE
imul edx, [esp+12] ; edx = high dword of multiplicand
; * low dword of multiplier
mov ecx, [esp+16] ; ecx = high dword of multiplier
imul ecx, eax ; ecx = high dword of multiplier
; * low dword of multiplicand
add ecx, edx ; ecx = high dword of multiplier
; * low dword of multiplicand
; + high dword of multiplicand
; * low dword of multiplier
zero:
mul dword ptr [esp+12]
; edx:eax = low dword of multiplicand
; * low dword of multiplier
add edx, ecx ; edx:eax = product % 2**64
ret 16 ; callee restores stack
_allmul endp
end
A proper implementation for processors which feature speculative
execution uses 9 instructions in 29 bytes (plus 3 bytes for
alignment) without conditional branch instead of the 19 instructions
in 52 bytes used by Microsoft’s
poor implementation:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _allmul():
; receives arguments on stack, returns product modulo 2**64 in edx:eax
_allmul proc public ; sqword _allmul(sqword multiplicand, sqword multiplier)
mov eax, [esp+4] ; eax = low dword of multiplicand
mov edx, [esp+8] ; edx = high dword of multiplicand
imul edx, [esp+12] ; edx = high dword of multiplicand
; * low dword of multiplier
mov ecx, [esp+16] ; ecx = high dword of multiplier
imul ecx, eax ; ecx = high dword of multiplier
; * low dword of multiplicand
add ecx, edx ; ecx = high dword of multiplier
; * low dword of multiplicand
; + high dword of multiplicand
; * low dword of multiplier
mul dword ptr [esp+12]
; edx:eax = low dword of multiplicand
; * low dword of multiplier
add edx, ecx ; edx:eax = product % 2**64
ret 16 ; callee restores stack
_allmul endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
allmul.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
allmul.obj
and add it to the existing object library
i386.lib
:
ML.EXE allmul.asm LINK.EXE /LIB /OUT:i386.lib i386.lib allmul.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: allmul.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_alldiv
Routine_alldiv
compiler helper routine states:
Divides two LONGLONG integers. For example, to divide two int64 values the compiler might generate a call to _alldiv Routine.OUCH: contrary to the highlighted statement, the Visual C compiler generates calls to theRemarks
_alldiv Routine is a helper routine for the C compiler. Whether the compiler calls _alldiv Routine is completely dependent on the optimization set.
_alldiv
routine unconditionally, independent from any optimisation, when it
encounters a division where at least one of its operands is a signed
64-bit integer!
Execute the following 2 command lines to display the content of the
assembler source file lldiv.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\lldiv.asm" TYPE "%source%\intel\lldiv.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 6,670 lldiv.asm 1 File(s) 6,670 bytes 0 Dir(s) 9,876,543,210 bytes free title lldiv - signed long divide routine ;*** ;lldiv.asm - signed long divide routine ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; defines the signed long divide routine ; __alldiv ; ;******************************************************************************* .xlist include cruntime.inc include mm.inc .list ;*** ;lldiv - signed long divide ; ;Purpose: ; Does a signed long divide of the arguments. Arguments are ; not changed. ; ;Entry: ; Arguments are passed on the stack: ; 1st pushed: divisor (QWORD) ; 2nd pushed: dividend (QWORD) ; ;Exit: ; EDX:EAX contains the quotient (dividend/divisor) ; NOTE: this routine removes the parameters from the stack. ; ;Uses: ; ECX ; ;Exceptions: ; ;******************************************************************************* CODESEG _alldiv PROC NEAR .FPO (3, 4, 0, 0, 0, 0) push ediWith 70 instructions in 170 bytes (plus 6 bytes for alignment), this routine has several major and minor flaws: 3 major flaws on all kinds of processors, and 4 more only on processors which feature speculative execution!push esipush ebx ; Set up the local stack and save the index registers. When this is done ; the stack frame will look as follows (assuming that the expression a/b will ; generate a call to lldiv(a, b)): ; ; ----------------- ; | | ; |---------------| ; | | ; |--divisor (b)--| ; | | ; |---------------| ; | | ; |--dividend (a)-| ; | | ; |---------------| ; | return addr** | ; |---------------| ; | EDI | ; |---------------| ;| ESI |;|---------------|; ESP---->| EBX | ; ----------------- ;DVND equ [esp + 16]; stack address of dividend (a)DVSR equ [esp + 24]; stack address of divisor (b) DVND equ [esp + 12] DVSR equ [esp + 20] ; Determine sign of the result (edi = 0 if result is positive, non-zero ; otherwise) and make operands positive. xor edi,edi ; result sign assumed positive mov eax,HIWORD(DVND) ; hi word of aor eax,eax; test to see if signed test eax,eax jge short L1 ; skip rest if a is already positive inc edi ; complement result sign flag mov edx,LOWORD(DVND) ; lo word of a neg eax ; make a positive neg edx sbb eax,0 mov HIWORD(DVND),eax ; save positive value mov LOWORD(DVND),edx L1: mov eax,HIWORD(DVSR) ; hi word of bor eax,eax; test to see if signed test eax,eax jge short L2 ; skip rest if b is already positive inc edi ; complement the result sign flag mov edx,LOWORD(DVSR) ; lo word of a neg eax ; make b positive neg edx sbb eax,0 mov HIWORD(DVSR),eax ; save positive value mov LOWORD(DVSR),edx L2: ; ; Now do the divide. First look to see if the divisor is less than 4194304K. ; If so, then we can use a simple algorithm with word divides, otherwise ; things get a little more complex. ; ; NOTE - eax currently contains the high order word of DVSR ;or eax,eax; check to see if divisor < 4194304K test eax,eax jnz short L3 ; nope, gotta do this the hard way mov ecx,LOWORD(DVSR) ; load divisor mov eax,HIWORD(DVND) ; load high word of dividend xor edx,edx div ecx ; eax <- high order bits of quotient mov ebx,eax ; save high bits of quotient mov eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend div ecx ; eax <- low order bits of quotient mov edx,ebx ; edx:eax <- quotient jmp short L4 ; set sign, restore stack and return ; ; Here we do it the hard way. Remember, eax contains the high word of DVSR ; L3: mov ebx,eax ; ebx:ecx <- divisor mov ecx,LOWORD(DVSR) mov edx,HIWORD(DVND) ; edx:eax <- dividend mov eax,LOWORD(DVND) L5: shr ebx,1 ; shift divisor right one bit rcr ecx,1 shr edx,1 ; shift dividend right one bit rcr eax,1or ebx,ebxtest ebx,ebx jnz short L5 ; loop until divisor < 4194304K div ecx ; now divide, ignore remaindermov esi,eax; save quotient mov ebx,eax ; ; We may be off by one, so to check, we will multiply the quotient ; by the divisor and check the result against the orignal dividend ; Note that we must also check for overflow, which can occur if the ; dividend is close to 2**64 and the quotient is off by 1. ;mul dword ptr HIWORD(DVSR); QUOT * HIWORD(DVSR)mov ecx,eaxmov ecx,HIWORD(DVSR) imul ecx,ebx mov eax,LOWORD(DVSR)mul esi; QUOT * LOWORD(DVSR) mul ebx add edx,ecx ; EDX:EAX = QUOT * DVSR jc short L6 ; carry means Quotient is off by 1 ; ; do long compare here between original dividend and the result of the ; multiply in edx:eax. If original is larger or equal, we are ok, otherwise ; subtract one (1) from the quotient. ;cmp edx,HIWORD(DVND); compare hi words of result and originalja short L6; if result > original, do subtractjb short L7; if result < original, we are ok cmp eax,LOWORD(DVND) ; hi words are equal, compare lo words sbb edx,HIWORD(DVND) jbe short L7 ; if less or equal we are ok, else subtract L6:dec esi; subtract 1 from quotient dec ebx L7: xor edx,edx ; edx:eax <- quotientmov eax,esimov eax,ebx ; ; Just the cleanup left to do. edx:eax contains the quotient. Set the sign ; according to the save value, cleanup the stack, and return. ; L4: dec edi ; check to see if result is negative jnz short L8 ; if EDI == 0, result should be negative neg edx ; otherwise, negate the result neg eax sbb edx,0 ; ; Restore the saved registers and return. ; L8: pop ebxpop esipop edi ret 16 _alldiv ENDP end
Note: unlike the
IDIV
instruction, which raises a
divide error (#DE
) exception when dividing
−263, the smallest signed 64-bit integer, by
−1, this routine returns but the (wrong) quotient
−263!
OOPS¹: instead of the 4 deleted
OR
instructions which
perform superfluous writes, the 4 inserted
TEST
instructions should be
used.
OOPS²: instead of the deleted
first widening
MUL
instruction and the following deleted
MOV
instruction, the inserted
MOV
instruction loading the high part of
the divisor into register ECX
followed by the
inserted faster
IMUL
instruction should be
used.
OUCH¹: instead of register ESI
register EBX
should be used, saving a pair of
PUSH
and POP
instructions
and 2 bytes!
OUCH²: for divisors less than 232
and a dividend less than 232×divisor, i.e. if the
quotient is less than 232, instead of the long
alias schoolbook
division performed with the 2
highlighted chained
DIV
instructions – each
slower than a mispredicted conditional branch – after the
conditional branch to label L3:
, a single
DIV
instruction is sufficient,
saving about 40 to 240 processor cycles!
OUCH³: instead of the highlighted
(brain)dead slow loop with 2 pairs of
SHR
and
RCR
instructions
after label L5:
, 2 pairs of
SHRD
and
SHR
instructions with their
shift count determined per BSR
instruction should be used!
Note: this
BSR
instruction would also
replace the deleted
OR
instruction
respectively the inserted
TEST
instruction after label
L2:
.
OUCH⁴: on processors which feature
speculative execution, instead of the 3 highlighted
conditional branches to the labels L1:
,
L2:
and L8:
, which are
slow when mispredicted, and the following
NEG
plus
SBB
instructions to negate the arguments as well as the quotient, a
branchless and thus faster instruction sequence should be used!
OUCH⁵: on processors which feature
speculative execution, instead of the 2
CMP
instructions and the
3 conditional branches before label L6:
, which are
slow when mispredicted, a faster instruction
sequence with less or no conditional branches should be used!
Note: with the modifications shown in the source, this routine has 66 instructions in 164 bytes (plus 12 bytes for alignment).
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _alldiv():
; receives arguments on stack, returns quotient in edx:eax
; NOTE: _alldiv() can raise 'division by zero' exception; it does
; not raise 'integer overflow' exception on quotient overflow,
; but returns ±2**63 for -2**63 / -1!
_alldiv proc public ; sqword _alldiv(sqword dividend, sqword divisor)
xor ecx, ecx ; ecx = sign of quotient = 0
; determine sign of dividend and compute |dividend|
mov edx, [esp+8] ; edx = high dword of dividend
test edx, edx
jns short @f ; (high dword of) dividend >= 0?
mov eax, [esp+4] ; eax = low dword of dividend
neg edx
neg eax
sbb edx, ecx ; edx:eax = -dividend = |dividend|
mov [esp+4], eax
mov [esp+8], edx ; write |dividend| back on stack
dec ecx ; ecx = sign of dividend = -1
@@:
; determine sign of divisor and compute |divisor|
mov eax, [esp+12]
mov edx, [esp+16] ; edx:eax = divisor
test edx, edx
jns short @f ; (high dword of) divisor >= 0?
neg edx
neg eax
sbb edx, 0 ; edx:eax = -divisor = |divisor|
mov [esp+12], eax
mov [esp+16], edx ; write |divisor| back on stack
not ecx ; ecx = sign of dividend
; ^ sign of divisor
; = sign of quotient
@@:
push ecx ; [esp] = (quotient < 0) ? -1 : 0
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+12] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+8] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
;; xor edx, edx ; edx:eax = |quotient|
jmp short quotient
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
push eax ; [esp] = high dword of quotient
mov eax, [esp+12] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
pop edx ; edx:eax = |quotient|
pop ecx ; ecx = (quotient < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend < divisor: quotient = 0
trivial:
pop ecx ; ecx = (quotient < 0) ? -1 : 0
xor eax, eax
cdq ; edx:eax = quotient = 0
ret 16 ; callee restores stack
; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor = 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
mov eax, [esp+16] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+20] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
mov ecx, [esp+24] ; ecx = high dword of divisor
imul ecx, ebx ; ecx = high dword of divisor * quotient"
add edx, ecx ; edx:eax = divisor * quotient"
jc short @f ; divisor * quotient" >= 2**64?
mov ecx, [esp+16] ; ecx = high dword of dividend
cmp [esp+12], eax
sbb ecx, edx ; CF = (dividend < divisor * quotient")
; = (remainder" < 0)
@@:
sbb eax, eax ; eax = (quotient < quotient") ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) |quotient|
;; xor edx, edx ; edx:eax = |quotient|
pop ebx
quotient:
pop edx ; edx = (quotient < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend = divisor = -2**63: quotient = 1
special:
pop eax ; eax = sign of quotient = 0
inc eax ; eax = (low dword of) quotient = 1
cdq ; edx:eax = quotient = 1
ret 16 ; callee restores stack
_alldiv endp
end
A proper (and several times faster) implementation for processors
which feature speculative execution, minimising the number of
(mispredictable) conditional branches, uses 86 instructions in 208
bytes, including 13 instructions in 29 bytes for the special and
trivial cases not covered by Microsoft’s
poor implementation:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _alldiv():
; receives arguments on stack, returns quotient in edx:eax
; NOTE: _alldiv() can raise 'division by zero' exception; it does
; not raise 'integer overflow' exception on quotient overflow,
; but returns ±2**63 for -2**63 / -1!
_alldiv proc public ; sqword _alldiv(sqword dividend, sqword divisor)
; determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] ; eax:ecx = dividend
cdq ; edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx ; eax:ecx = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx ; eax:ecx = (dividend < 0) ? -dividend : dividend
; = |dividend|
mov [esp+4], ecx ; write |dividend| back on stack
mov [esp+8], eax
push edx ; [esp] = (dividend < 0) ? -1 : 0
; determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] ; edx:eax = divisor
mov ecx, edx
sar ecx, 31 ; ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx ; edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx ; edx:eax = (divisor < 0) ? -divisor : divisor
; = |divisor|
mov [esp+16], eax ; write |divisor| back on stack
mov [esp+20], edx
xor [esp], ecx ; [esp] = (dividend < 0) ^ (divisor < 0) ? -1 : 0
; = (quotient < 0) ? -1 : 0
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+12] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+8] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
;; xor edx, edx ; edx:eax = |quotient|
jmp short quotient
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
push eax ; [esp] = high dword of quotient
mov eax, [esp+12] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
pop edx ; edx:eax = |quotient|
pop ecx ; ecx = (quotient < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend < divisor: quotient = 0
trivial:
pop ecx ; ecx = (quotient < 0) ? -1 : 0
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret 16 ; callee restores stack
; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor = 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JCCLESS
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JCCLESS
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JCCLESS
mov eax, [esp+16] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+20] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
mov ecx, [esp+24] ; ecx = high dword of divisor
imul ecx, ebx ; ecx = high dword of divisor * quotient"
add edx, ecx ; edx:eax = divisor * quotient"
mov ecx, [esp+16] ; ecx = high dword of dividend
cmp [esp+12], eax
sbb ecx, edx ; CF = (dividend < divisor * quotient")
; = (remainder" < 0)
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) |quotient|
;; xor edx, edx ; edx:eax = |quotient|
pop ebx
quotient:
pop edx ; edx = (quotient < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend = divisor = -2**63: quotient = 1
special:
pop eax ; eax = sign of quotient = 0
inc eax ; eax = (low dword of) quotient = 1
xor edx, edx ; edx:eax = quotient = 1
ret 16 ; callee restores stack
_alldiv endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
alldiv.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
alldiv.obj
and add it to the existing object library
i386.lib
:
ML.EXE alldiv.asm LINK.EXE /LIB /OUT:i386.lib i386.lib alldiv.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alldiv.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_alldvrm
Routinelldvrm.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\lldvrm.asm" TYPE "%source%\intel\lldvrm.asm"
Volume in drive C has no label.
Volume Serial Number is 1957-0427
Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011 03:40 PM 8,557 lldvrm.asm
1 File(s) 8,557 bytes
0 Dir(s) 9,876,543,210 bytes free
title lldvrm - signed long divide and remainder routine
;***
;lldvrm.asm - signed long divide and remainder routine
;
; Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
; defines the signed long divide and remainder routine
; __alldvrm
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;lldvrm - signed long divide and remainder
;
;Purpose:
; Does a signed long divide and remainder of the arguments. Arguments are
; not changed.
;
;Entry:
; Arguments are passed on the stack:
; 1st pushed: divisor (QWORD)
; 2nd pushed: dividend (QWORD)
;
;Exit:
; EDX:EAX contains the quotient (dividend/divisor)
; EBX:ECX contains the remainder (divided % divisor)
; NOTE: this routine removes the parameters from the stack.
;
;Uses:
; ECX
;
;Exceptions:
;
;*******************************************************************************
CODESEG
_alldvrm PROC NEAR
.FPO (3, 4, 0, 0, 1, 0)
push edi
push esi
push ebp
; Set up the local stack and save the index registers. When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to alldvrm(a, b)):
;
; -----------------
; | |
; |---------------|
; | |
; |--divisor (b)--|
; | |
; |---------------|
; | |
; |--dividend (a)-|
; | |
; |---------------|
; | return addr** |
; |---------------|
; | EDI |
; |---------------|
; | ESI |
; |---------------|
; ESP---->| EBP |
; -----------------
;
DVND equ [esp + 16] ; stack address of dividend (a)
DVSR equ [esp + 24] ; stack address of divisor (b)
; Determine sign of the quotient (edi = 0 if result is positive, non-zero
; otherwise) and make operands positive.
; Sign of the remainder is kept in ebp.
xor edi,edi ; result sign assumed positive
xor ebp,ebp ; result sign assumed positive
mov eax,HIWORD(DVND) ; hi word of a
or eax,eax ; test to see if signed
jge short L1 ; skip rest if a is already positive
inc edi ; complement result sign flag
inc ebp ; complement result sign flag
mov edx,LOWORD(DVND) ; lo word of a
neg eax ; make a positive
neg edx
sbb eax,0
mov HIWORD(DVND),eax ; save positive value
mov LOWORD(DVND),edx
L1:
mov eax,HIWORD(DVSR) ; hi word of b
or eax,eax ; test to see if signed
jge short L2 ; skip rest if b is already positive
inc edi ; complement the result sign flag
mov edx,LOWORD(DVSR) ; lo word of a
neg eax ; make b positive
neg edx
sbb eax,0
mov HIWORD(DVSR),eax ; save positive value
mov LOWORD(DVSR),edx
L2:
;
; Now do the divide. First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
; NOTE - eax currently contains the high order word of DVSR
;
or eax,eax ; check to see if divisor < 4194304K
jnz short L3 ; nope, gotta do this the hard way
mov ecx,LOWORD(DVSR) ; load divisor
mov eax,HIWORD(DVND) ; load high word of dividend
xor edx,edx
div ecx ; eax <- high order bits of quotient
mov ebx,eax ; save high bits of quotient
mov eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
div ecx ; eax <- low order bits of quotient
mov esi,eax ; ebx:esi <- quotient
;
; Now we need to do a multiply so that we can compute the remainder.
;
mov eax,ebx ; set up high word of quotient
mul dword ptr LOWORD(DVSR) ; HIWORD(QUOT) * DVSR
mov ecx,eax ; save the result in ecx
mov eax,esi ; set up low word of quotient
mul dword ptr LOWORD(DVSR) ; LOWORD(QUOT) * DVSR
add edx,ecx ; EDX:EAX = QUOT * DVSR
jmp short L4 ; complete remainder calculation
;
; Here we do it the hard way. Remember, eax contains the high word of DVSR
;
L3:
mov ebx,eax ; ebx:ecx <- divisor
mov ecx,LOWORD(DVSR)
mov edx,HIWORD(DVND) ; edx:eax <- dividend
mov eax,LOWORD(DVND)
L5:
shr ebx,1 ; shift divisor right one bit
rcr ecx,1
shr edx,1 ; shift dividend right one bit
rcr eax,1
or ebx,ebx
jnz short L5 ; loop until divisor < 4194304K
div ecx ; now divide, ignore remainder
mov esi,eax ; save quotient
;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;
mul dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
mov ecx,eax
mov eax,LOWORD(DVSR)
mul esi ; QUOT * LOWORD(DVSR)
add edx,ecx ; EDX:EAX = QUOT * DVSR
jc short L6 ; carry means Quotient is off by 1
;
; do long compare here between original dividend and the result of the
; multiply in edx:eax. If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;
cmp edx,HIWORD(DVND) ; compare hi words of result and original
ja short L6 ; if result %gt; original, do subtract
jb short L7 ; if result < original, we are ok
cmp eax,LOWORD(DVND) ; hi words are equal, compare lo words
jbe short L7 ; if less or equal we are ok, else subtract
L6:
dec esi ; subtract 1 from quotient
sub eax,LOWORD(DVSR) ; subtract divisor from result
sbb edx,HIWORD(DVSR)
L7:
xor ebx,ebx ; ebx:esi <- quotient
L4:
;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will do the subtract in the
; opposite direction and negate the result if necessary.
;
sub eax,LOWORD(DVND) ; subtract dividend from result
sbb edx,HIWORD(DVND)
;
; Now check the result sign flag to see if the result is supposed to be positive
; or negative. It is currently negated (because we subtracted in the 'wrong'
; direction), so if the sign flag is set we are done, otherwise we must negate
; the result to make it positive again.
;
dec ebp ; check result sign flag
jns short L9 ; result is ok, set up the quotient
neg edx ; otherwise, negate the result
neg eax
sbb edx,0
;
; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx.
;
L9:
mov ecx,edx
mov edx,ebx
mov ebx,ecx
mov ecx,eax
mov eax,esi
;
; Just the cleanup left to do. edx:eax contains the quotient. Set the sign
; according to the save value, cleanup the stack, and return.
;
dec edi ; check to see if result is negative
jnz short L8 ; if EDI == 0, result should be negative
neg edx ; otherwise, negate the result
neg eax
sbb edx,0
;
; Restore the saved registers and return.
;
L8:
pop ebp
pop esi
pop edi
ret 16
_alldvrm ENDP
end
91 instructions in 223 bytes (plus 1 byte for alignment).
OUCH: the highlighted comment with the following
code is a remarkable gem – the remainder is already present
in register EDX
!
Note: unlike the
IDIV
instruction, which raises a
divide error (#DE
) exception when dividing
−263, the smallest signed 64-bit integer, by
−1, this routine returns but the (wrong) quotient
−263 and the (correct) remainder 0, i.e. the only
integer smaller in magnitude than the divisor −1!
JCCLESS
is defined, else processors which don’t
feature speculative execution, uses 111 instructions in 268 bytes
(plus 4 bytes for alignment) respectively 108 instructions in 260
bytes (plus 12 bytes for alignment), including 22 instructions in 50
bytes for the special and trivial cases not covered by
Microsoft’s poor implementation:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _alldvrm():
; receives arguments on stack, returns quotient in edx:eax and remainder in ebx:ecx
; NOTE: _alldvrm() can raise 'division by zero' exception; it does
; not raise 'integer overflow' exception on quotient overflow,
; but returns ±2**63 for -2**63 / -1 and 0 for -2**63 % -1!
_alldvrm proc public ; sqword _alldvrm(sqword dividend, sqword divisor)
; determine sign of dividend and compute |dividend|
mov edx, [esp+8]
mov eax, [esp+4] ; edx:eax = dividend
mov ebx, edx
sar ebx, 31 ; ebx = (dividend < 0) ? -1 : 0
; = (remainder < 0) ? -1 : 0
xor eax, ebx
xor edx, ebx ; edx:eax = (dividend < 0) ? ~dividend : dividend
sub eax, ebx
sbb edx, ebx ; edx:eax = (dividend < 0) ? -dividend : dividend
; = |dividend|
mov [esp+4], eax ; write |dividend| back on stack
mov [esp+8], edx
; determine sign of divisor and compute |divisor|
mov edx, [esp+16]
mov eax, [esp+12] ; edx:eax = divisor
mov ecx, edx
sar ecx, 31 ; ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx ; edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx ; edx:eax = (divisor < 0) ? -divisor : divisor
; = |divisor|
mov [esp+12], eax ; write |divisor| back on stack
mov [esp+16], edx
xor ecx, ebx ; ecx = (divisor < 0) ^ (dividend < 0) ? -1 : 0
; = (quotient < 0) ? -1 : 0
push ecx ; [esp] = (quotient < 0) ? -1 : 0
push ebx ; [esp] = (remainder < 0) ? -1 : 0
mov ecx, [esp+16] ; ecx = high dword of dividend
cmp [esp+12], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+16] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
xor ebx, ebx ; ebx = high dword of quotient = 0
jmp short next
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
mov ebx, eax ; ebx = high dword of quotient
next:
mov eax, [esp+12] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov ecx, edx ; ecx = (low dword of) |remainder|
mov edx, ebx ; edx:eax = |quotient|
;; xor ebx, ebx ; ebx:ecx = |remainder|
if 0
mov ebx, [esp+4] ; ebx = (quotient < 0) ? -1 : 0
xor eax, ebx
xor edx, ebx
sub eax, ebx
sbb edx, ebx ; edx:eax = quotient
pop ebx ; ebx = (remainder < 0) ? -1 : 0
xor ecx, ebx
sub ecx, ebx
sbb ebx, ebx ; ebx:ecx = remainder
else
pop ebx ; ebx = (remainder < 0) ? -1 : 0
xor ecx, ebx
sub ecx, ebx
sbb ebx, ebx ; ebx:ecx = remainder
xor eax, [esp]
xor edx, [esp]
sub eax, [esp]
sbb edx, [esp] ; edx:eax = quotient
endif
add esp, 4
ret 16 ; callee restores stack
; dividend < divisor: quotient = 0, remainder = dividend
trivial:
pop eax ; eax = (remainder < 0) ? -1 : 0
mov ecx, [esp+8]
mov ebx, [esp+12] ; ebx:ecx = |remainder| = |dividend|
xor ecx, eax
xor ebx, eax
sub ecx, eax
sbb ebx, eax ; ebx:ecx = remainder
pop edx ; edx = (quotient < 0) ? -1 : 0
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret 16 ; callee restores stack
; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor = 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
mov ebx, edx ; ebx = divisor'
ifndef JCCLESS
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JCCLESS
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JCCLESS
mov eax, [esp+16] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
push ebx ; [esp] = quotient"
mov eax, [esp+24] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
imul ebx, [esp+28] ; ebx = high dword of divisor * quotient"
add edx, ebx ; edx:eax = divisor * quotient"
mov ecx, [esp+16]
mov ebx, [esp+20] ; ebx:ecx = dividend
sub ecx, eax
sbb ebx, edx ; ebx:ecx = dividend - divisor * quotient"
; = remainder"
ifndef JCCLESS
pop eax ; eax = quotient"
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ecx, [esp+20]
adc ebx, [esp+24] ; ebx:ecx = remainder" + divisor
; = |remainder|
dec eax ; eax = quotient" - 1
; = low dword of |quotient|
@@:
else ; JCCLESS
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
add [esp], eax ; [esp] = quotient" - (remainder" < 0)
; = (low dword of) |quotient|
and eax, [esp+24]
and edx, [esp+28] ; edx:eax = (remainder" < 0) ? divisor : 0
add ecx, eax
adc ebx, edx ; ebx:ecx = remainder" + divisor
; = |remainder|
pop eax ; eax = (low dword of) |quotient|
endif ; JCCLESS
;; xor edx, edx ; edx:eax = |quotient|
pop edx ; edx = (remainder < 0) ? -1 : 0
xor ecx, edx
xor ebx, edx
sub ecx, edx
sbb ebx, edx ; ebx:ecx = remainder
pop edx ; edx = (quotient < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend = divisor = -2**63: quotient = 1, remainder = 0
special:
pop ebx ; ebx = sign of remainder = -1
inc ebx
;; xor ecx, ecx ; ebx:ecx = remainder = 0
pop eax ; eax = sign of quotient = 0
inc eax ; eax = (low dword of) quotient = 1
xor edx, edx ; edx:eax = quotient = 1
ret 16 ; callee restores stack
_alldvrm endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
alldvrm.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
alldvrm.obj
and add it to the existing object library
i386.lib
:
ML.EXE alldvrm.asm LINK.EXE /LIB /OUT:i386.lib i386.lib alldvrm.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alldvrm.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_allrem
Routinellrem.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\llrem.asm" TYPE "%source%\intel\llrem.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 7,067 llrem.asm 1 File(s) 7,067 bytes 0 Dir(s) 9,876,543,210 bytes free title llrem - signed long remainder routine ;*** ;llrem.asm - signed long remainder routine ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; defines the signed long remainder routine ; __allrem ; ;******************************************************************************* .xlist include cruntime.inc include mm.inc .list ;*** ;llrem - signed long remainder ; ;Purpose: ; Does a signed long remainder of the arguments. Arguments are ; not changed. ; ;Entry: ; Arguments are passed on the stack: ; 1st pushed: divisor (QWORD) ; 2nd pushed: dividend (QWORD) ; ;Exit: ; EDX:EAX contains the remainder (dividend%divisor) ; NOTE: this routine removes the parameters from the stack. ; ;Uses: ; ECX ; ;Exceptions: ; ;******************************************************************************* CODESEG _allrem PROC NEAR .FPO (2, 4, 0, 0, 0, 0) push ebx push edi ; Set up the local stack and save the index registers. When this is done ; the stack frame will look as follows (assuming that the expression a%b will ; generate a call to lrem(a, b)): ; ; ----------------- ; | | ; |---------------| ; | | ; |--divisor (b)--| ; | | ; |---------------| ; | | ; |--dividend (a)-| ; | | ; |---------------| ; | return addr** | ; |---------------| ; | EBX | ; |---------------| ; ESP---->| EDI | ; ----------------- ; DVND equ [esp + 12] ; stack address of dividend (a) DVSR equ [esp + 20] ; stack address of divisor (b) ; Determine sign of the result (edi = 0 if result is positive, non-zero ; otherwise) and make operands positive. xor edi,edi ; result sign assumed positive mov eax,HIWORD(DVND) ; hi word of a or eax,eax ; test to see if signed jge short L1 ; skip rest if a is already positive inc edi ; complement result sign flag bit mov edx,LOWORD(DVND) ; lo word of a neg eax ; make a positive neg edx sbb eax,0 mov HIWORD(DVND),eax ; save positive value mov LOWORD(DVND),edx L1: mov eax,HIWORD(DVSR) ; hi word of b or eax,eax ; test to see if signed jge short L2 ; skip rest if b is already positive mov edx,LOWORD(DVSR) ; lo word of b neg eax ; make b positive neg edx sbb eax,0 mov HIWORD(DVSR),eax ; save positive value mov LOWORD(DVSR),edx L2: ; ; Now do the divide. First look to see if the divisor is less than 4194304K. ; If so, then we can use a simple algorithm with word divides, otherwise ; things get a little more complex. ; ; NOTE - eax currently contains the high order word of DVSR ; or eax,eax ; check to see if divisor < 4194304K jnz short L3 ; nope, gotta do this the hard way mov ecx,LOWORD(DVSR) ; load divisor mov eax,HIWORD(DVND) ; load high word of dividend xor edx,edx div ecx ; edx <- remainder mov eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend div ecx ; edx <- final remainder mov eax,edx ; edx:eax <- remainder xor edx,edx dec edi ; check result sign flag jns short L4 ; negate result, restore stack and return jmp short L8 ; result sign ok, restore stack and return ; ; Here we do it the hard way. Remember, eax contains the high word of DVSR ; L3: mov ebx,eax ; ebx:ecx <- divisor mov ecx,LOWORD(DVSR) mov edx,HIWORD(DVND) ; edx:eax <- dividend mov eax,LOWORD(DVND) L5: shr ebx,1 ; shift divisor right one bit rcr ecx,1 shr edx,1 ; shift dividend right one bit rcr eax,1 or ebx,ebx jnz short L5 ; loop until divisor < 4194304K div ecx ; now divide, ignore remainder ; ; We may be off by one, so to check, we will multiply the quotient ; by the divisor and check the result against the orignal dividend ; Note that we must also check for overflow, which can occur if the ; dividend is close to 2**64 and the quotient is off by 1. ; mov ecx,eax ; save a copy of quotient in ECX mul dword ptr HIWORD(DVSR) xchg ecx,eax ; save product, get quotient in EAX mul dword ptr LOWORD(DVSR) add edx,ecx ; EDX:EAX = QUOT * DVSR jc short L6 ; carry means Quotient is off by 1 ; ; do long compare here between original dividend and the result of the ; multiply in edx:eax. If original is larger or equal, we are ok, otherwise ; subtract the original divisor from the result. ; cmp edx,HIWORD(DVND) ; compare hi words of result and original ja short L6 ; if result > original, do subtract jb short L7 ; if result < original, we are ok cmp eax,LOWORD(DVND) ; hi words are equal, compare lo words jbe short L7 ; if less or equal we are ok, else subtract L6: sub eax,LOWORD(DVSR) ; subtract divisor from result sbb edx,HIWORD(DVSR) L7: ; ; Calculate remainder by subtracting the result from the original dividend. ; Since the result is already in a register, we will do the subtract in the ; opposite direction and negate the result if necessary. ; sub eax,LOWORD(DVND) ; subtract dividend from result sbb edx,HIWORD(DVND) ; ; Now check the result sign flag to see if the result is supposed to be positive ; or negative. It is currently negated (because we subtracted in the 'wrong' ; direction), so if the sign flag is set we are done, otherwise we must negate ; the result to make it positive again. ; dec edi ; check result sign flag jns short L8 ; result is ok, restore stack and return L4: neg edx ; otherwise, negate the result neg eax sbb edx,0 ; ; Just the cleanup left to do. edx:eax contains the quotient. ; Restore the saved registers and return. ; L8: pop edi pop ebx ret 16 _allrem ENDP end69 instructions in 178 bytes (plus 14 bytes for alignment).
Note: unlike the
IDIV
instruction, which raises a
divide error (#DE
) exception when dividing
−263, the smallest signed 64-bit integer, by
−1, this routine returns the (correct) remainder 0, i.e. the
only integer smaller in magnitude than the divisor −1!
JCCLESS
is defined, else processors which don’t
feature speculative execution, uses 85 instructions in 213 bytes
(plus 11 bytes for alignment) respectively 84 instructions in 211
bytes (plus 13 bytes for alignment), including 12 instructions in 33
bytes for the special and trivial cases not covered by
Microsoft’s poor implementation:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _allrem():
; receives arguments on stack, returns remainder in edx:eax
; NOTE: _allrem() can raise 'division by zero' exception; it does
; not raise 'integer overflow' exception on quotient overflow,
; but returns 0 for -2**63 % -1!
_allrem proc public ; sqword _allrem(sqword dividend, sqword divisor)
; determine sign of dividend and compute |dividend|
mov eax, [esp+8]
mov ecx, [esp+4] ; eax:ecx = dividend
cdq ; edx = (dividend < 0) ? -1 : 0
xor ecx, edx
xor eax, edx ; ecx:eax = (dividend < 0) ? ~dividend : dividend
sub ecx, edx
sbb eax, edx ; ecx:eax = (dividend < 0) ? -dividend : dividend
; = |dividend|
mov [esp+4], ecx ; write |dividend| back on stack
mov [esp+8], eax
push edx ; [esp] = (dividend < 0) ? -1 : 0
; determine sign of divisor and compute |divisor|
mov edx, [esp+20]
mov eax, [esp+16] ; edx:eax = divisor
mov ecx, edx
sar ecx, 31 ; ecx = (divisor < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx ; edx:eax = (divisor < 0) ? ~divisor : divisor
sub eax, ecx
sbb edx, ecx ; edx:eax = (divisor < 0) ? -divisor : divisor
; = |divisor|
mov [esp+16], eax ; write |divisor| back on stack
mov [esp+20], edx
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+12] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
jmp short next
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
next:
mov eax, [esp+8] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov eax, edx ; eax = (low dword of) |remainder|
;; xor edx, edx ; edx:eax = |remainder|
pop edx ; edx = (remainder < 0) ? -1 : 0
xor eax, edx
sub eax, edx
sbb edx, edx ; edx:eax = remainder
ret 16 ; callee restores stack
; dividend < divisor: remainder = dividend
trivial:
mov eax, [esp+8]
mov edx, [esp+12] ; edx:eax = |remainder| = |dividend|
jmp short remainder
; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor = 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JCCLESS
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JCCLESS
mov edx, [esp+16] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JCCLESS
mov eax, [esp+16] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+20] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
imul ebx, [esp+24] ; ebx = high dword of divisor * quotient"
add edx, ebx ; edx:eax = divisor * quotient"
mov ecx, [esp+12]
mov ebx, [esp+16] ; ebx:ecx = dividend
sub ecx, eax
sbb ebx, edx ; ebx:ecx = dividend - divisor * quotient"
; = remainder"
ifndef JCCLESS
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ecx, [esp+20]
adc ebx, [esp+24] ; ebx:ecx = remainder" + divisor
; = remainder
@@:
mov eax, ecx
mov edx, ebx ; edx:eax = |remainder|
else ; JCCLESS
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
and eax, [esp+20]
and edx, [esp+24] ; edx:eax = (remainder" < 0) ? divisor : 0
add eax, ecx
adc edx, ebx ; edx:eax = remainder" + divisor
; = |remainder|
endif ; JCCLESS
pop ebx
remainder:
pop ecx ; ecx = (remainder < 0) ? -1 : 0
xor eax, ecx
xor edx, ecx
sub eax, ecx
sbb edx, ecx ; edx:eax = remainder
ret 16 ; callee restores stack
; dividend = divisor = -2**63: remainder = 0
special:
pop eax ; eax = sign of remainder = -1
inc eax
xor edx, edx ; edx:eax = remainder = 0
ret 16 ; callee restores stack
_allrem endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
allrem.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
allrem.obj
and add it to the existing object library
i386.lib
:
ML.EXE allrem.asm LINK.EXE /LIB /OUT:i386.lib i386.lib allrem.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: allrem.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_aulldiv
Routine_aulldiv
compiler helper routine states:
Divides two ULONGLONG integers. For example, to divide two UInt64 values the compiler might generate a call to _aulldiv Routine.OUCH: contrary to the highlighted statement, the Visual C compiler generates calls to theRemarks
_aulldiv Routine is a helper routine for the C compiler. Whether the compiler calls _aulldiv Routine is completely dependent on the optimization set.
_aulldiv
routine unconditionally, independent from any optimisation, when it
encounters a division where at least one of its operands is an
unsigned 64-bit integer!
Execute the following 2 command lines to display the content of the
assembler source file ulldiv.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\ulldiv.asm" TYPE "%source%\intel\ulldiv.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 5,079 ulldiv.asm 1 File(s) 5,079 bytes 0 Dir(s) 9,876,543,210 bytes free title ulldiv - unsigned long divide routine ;*** ;ulldiv.asm - unsigned long divide routine ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; defines the unsigned long divide routine ; __aulldiv ; ;******************************************************************************* .xlist include cruntime.inc include mm.inc .list ;*** ;ulldiv - unsigned long divide ; ;Purpose: ; Does a unsigned long divide of the arguments. Arguments are ; not changed. ; ;Entry: ; Arguments are passed on the stack: ; 1st pushed: divisor (QWORD) ; 2nd pushed: dividend (QWORD) ; ;Exit: ; EDX:EAX contains the quotient (dividend/divisor) ; NOTE: this routine removes the parameters from the stack. ; ;Uses: ; ECX ; ;Exceptions: ; ;******************************************************************************* CODESEG _aulldiv PROC NEAR .FPO (2, 4, 0, 0, 0, 0) push ebxWith 43 instructions in 104 bytes (plus 8 bytes for alignment), this routine has several major and minor flaws: 3 major flaws on all kinds of processors, and 1 more only on processors which feature speculative execution!push esi; Set up the local stack and save the index registers. When this is done ; the stack frame will look as follows (assuming that the expression a/b will ; generate a call to uldiv(a, b)): ; ; ----------------- ; | | ; |---------------| ; | | ; |--divisor (b)--| ; | | ; |---------------| ; | | ; |--dividend (a)-| ; | | ; |---------------| ; | return addr** | ; |---------------| ; | EBX | ; |---------------| ; ESP---->| ESI | ; ----------------- ;DVND equ [esp + 12]; stack address of dividend (a)DVSR equ [esp + 20]; stack address of divisor (b) DVND equ [esp + 8] DVSR equ [esp + 16] ; ; Now do the divide. First look to see if the divisor is less than 4194304K. ; If so, then we can use a simple algorithm with word divides, otherwise ; things get a little more complex. ;mov eax,HIWORD(DVSR); check to see if divisor < 4194304Kor eax,eaxmov edx,HIWORD(DVSR) test edx,edx jnz short L1 ; nope, gotta do this the hard way mov ecx,LOWORD(DVSR) ; load divisor mov eax,HIWORD(DVND) ; load high word of dividendxor edx,edxdiv ecx ; get high order bits of quotient mov ebx,eax ; save high bits of quotient mov eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend div ecx ; get low order bits of quotient mov edx,ebx ; edx:eax <- quotient hi:quotient lo jmp short L2 ; restore stack and return ; ; Here we do it the hard way. Remember, eax contains DVSRHI ; L1:mov ecx,eax; ecx:ebx <- divisor mov ecx,edx mov ebx,LOWORD(DVSR) mov edx,HIWORD(DVND) ; edx:eax <- dividend mov eax,LOWORD(DVND) L3: shr ecx,1 ; shift divisor right one bit; hi bit <- 0 rcr ebx,1 shr edx,1 ; shift dividend right one bit; hi bit <- 0 rcr eax,1or ecx,ecxtest ecx,ecx jnz short L3 ; loop until divisor < 4194304K div ebx ; now divide, ignore remaindermov esi,eax; save quotient mov ebx,eax ; ; We may be off by one, so to check, we will multiply the quotient ; by the divisor and check the result against the orignal dividend ; Note that we must also check for overflow, which can occur if the ; dividend is close to 2**64 and the quotient is off by 1. ;mul dword ptr HIWORD(DVSR); QUOT * HIWORD(DVSR)mov ecx,eaxmov ecx,HIWORD(DVSR) imul ecx,ebx mov eax,LOWORD(DVSR)mul esi; QUOT * LOWORD(DVSR) mul ebx add edx,ecx ; EDX:EAX = QUOT * DVSR jc short L4 ; carry means Quotient is off by 1 ; ; do long compare here between original dividend and the result of the ; multiply in edx:eax. If original is larger or equal, we are ok, otherwise ; subtract one (1) from the quotient. ;cmp edx,HIWORD(DVND); compare hi words of result and originalja short L4; if result > original, do subtractjb short L5; if result < original, we are ok cmp eax,LOWORD(DVND) ; hi words are equal, compare lo words sbb edx,HIWORD(DVND) jbe short L5 ; if less or equal we are ok, else subtract L4:dec esi; subtract 1 from quotient dec ebx L5: xor edx,edx ; edx:eax <- quotientmov eax,esimov eax,ebx ; ; Just the cleanup left to do. edx:eax contains the quotient. ; Restore the saved registers and return. ; L2:pop esipop ebx ret 16 _aulldiv ENDP end
OOPS¹: instead of the 2 deleted
OR
instructions which
perform superfluous write operations, the 2 inserted
TEST
instructions should be
used.
OOPS²: register EDX
should be
used instead of register EAX
before the conditional
branch to label L1:
, and the following
deleted XOR
instruction should be removed.
OOPS³: instead of the deleted
first widening
MUL
instruction and the following deleted
MOV
instruction, the inserted
MOV
instruction loading the high part of
the divisor into register ECX
followed by the
inserted faster
IMUL
instruction should be
used.
OUCH¹: instead of register ESI
register EBX
should be used, saving a pair of
PUSH
and POP
instructions
and 2 bytes!
OUCH²: for divisors less than 232
and a dividend less than 232×divisor, i.e. if the
quotient is less than 232, instead of the long
alias schoolbook
division performed with the 2
highlighted chained
DIV
instructions
– each slower than a mispredicted conditional branch –
after the conditional branch to label L3:
, a single
DIV
instruction is sufficient,
saving about 40 to 240 processor cycles!
OUCH³: instead of the highlighted
(brain)dead slow loop with 2 pairs of
SHR
and
RCR
instructions
after label L3:
, 2 pairs of
SHRD
and
SHR
instructions with their
shift count determined per
BSR
instruction should be
used!
Note: this
BSR
instruction would also
replace the deleted
OR
instruction
respectively the inserted
TEST
instruction before the
conditional branch to label L1:
.
OUCH⁴: on processors which feature
speculative execution, instead of the 3 conditional branches before
label L6:
, which are slow when
mispredicted, and the 2
CMP
instructions, a faster
instruction sequence with less or no conditional branches should be
used!
Note: with the modifications shown in the source, this routine has 38 instructions in 96 bytes.
JCCLESS
is defined, else processors which don’t
feature speculative execution, uses 59 instructions in 147 bytes
(plus 13 bytes for alignment) respectively 60 instructions in 148
bytes (plus 12 bytes for alignment), including 12 instructions in 30
bytes for the special and trivial cases not covered by
Microsoft’s poor implementation:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _aulldiv():
; receives arguments on stack, returns quotient in edx:eax
; NOTE: _aulldiv() can raise 'division by zero' exception!
_aulldiv proc public ; qword _aulldiv(qword dividend, qword divisor)
mov ecx, [esp+8] ; ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] ; edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+8] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+4] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
xor edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
push eax ; [esp] = high dword of quotient
mov eax, [esp+8] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
pop edx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend < divisor: quotient = 0
trivial:
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret 16 ; callee restores stack
; dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor >= 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JCCLESS
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JCCLESS
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JCCLESS
mov eax, [esp+12] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+16] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
ifndef JCCLESS
mov ecx, [esp+20] ; ecx = high dword of divisor
imul ecx, ebx ; ecx = high dword of divisor * quotient"
add edx, ecx ; edx:eax = divisor * quotient"
jc short @f ; divisor * quotient" >= 2**64?
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx ; CF = (dividend < divisor * quotient")
; = (remainder" < 0)
@@:
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) quotient
xor edx, edx ; edx:eax = quotient
else ; JCCLESS
mov ecx, [esp+12] ; ecx = high dword of dividend
cmp [esp+8], eax
sbb ecx, edx ; ecx:... = dividend
; - low dword of divisor * quotient"
mov eax, [esp+20] ; eax = high dword of divisor
imul eax, ebx ; eax = high dword of divisor * quotient"
if 0
sub ecx, eax ; ecx:... = dividend - divisor * quotient"
; = remainder"
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
add eax, ebx ; eax = quotient" - (remainder" < 0)
; = (low dword of) quotient
xor edx, edx ; edx:eax = quotient
else
xor edx, edx ; edx = high dword of quotient = 0
sub ecx, eax ; ecx:... = dividend - divisor * quotient"
; = remainder"
mov eax, ebx ; eax = quotient"
sbb eax, edx ; eax = quotient" - (remainder" < 0)
; = (low dword of) quotient
endif
endif ; JCCLESS
pop ebx
ret 16 ; callee restores stack
; dividend >= divisor >= 2**63: quotient = 1
special:
xor eax, eax
xor edx, edx
inc eax ; edx:eax = quotient = 1
ret 16 ; callee restores stack
_aulldiv endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
aulldiv.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
aulldiv.obj
and add it to the existing object library
i386.lib
:
ML.EXE aulldiv.asm LINK.EXE /LIB /OUT:i386.lib i386.lib aulldiv.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aulldiv.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_aulldvrm
Routineulldvrm.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\ulldvrm.asm" TYPE "%source%\intel\ulldvrm.asm"
Volume in drive C has no label.
Volume Serial Number is 1957-0427
Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011 03:40 PM 6,227 ulldvrm.asm
1 File(s) 6,227 bytes
0 Dir(s) 9,876,543,210 bytes free
title ulldvrm - unsigned long divide and remainder routine
;***
;ulldvrm.asm - unsigned long divide and remainder routine
;
; Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
; defines the unsigned long divide and remainder routine
; __aulldvrm
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;ulldvrm - unsigned long divide and remainder
;
;Purpose:
; Does a unsigned long divide and remainder of the arguments. Arguments
; are not changed.
;
;Entry:
; Arguments are passed on the stack:
; 1st pushed: divisor (QWORD)
; 2nd pushed: dividend (QWORD)
;
;Exit:
; EDX:EAX contains the quotient (dividend/divisor)
; EBX:ECX contains the remainder (divided % divisor)
; NOTE: this routine removes the parameters from the stack.
;
;Uses:
; ECX
;
;Exceptions:
;
;*******************************************************************************
CODESEG
_aulldvrm PROC NEAR
.FPO (1, 4, 0, 0, 0, 0)
push esi
; Set up the local stack and save the index registers. When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to aulldvrm(a, b)):
;
; -----------------
; | |
; |---------------|
; | |
; |--divisor (b)--|
; | |
; |---------------|
; | |
; |--dividend (a)-|
; | |
; |---------------|
; | return addr** |
; |---------------|
; ESP---->| ESI |
; -----------------
;
DVND equ [esp + 8] ; stack address of dividend (a)
DVSR equ [esp + 16] ; stack address of divisor (b)
;
; Now do the divide. First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
mov eax,HIWORD(DVSR) ; check to see if divisor < 4194304K
or eax,eax
jnz short L1 ; nope, gotta do this the hard way
mov ecx,LOWORD(DVSR) ; load divisor
mov eax,HIWORD(DVND) ; load high word of dividend
xor edx,edx
div ecx ; get high order bits of quotient
mov ebx,eax ; save high bits of quotient
mov eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
div ecx ; get low order bits of quotient
mov esi,eax ; ebx:esi <- quotient
;
; Now we need to do a multiply so that we can compute the remainder.
;
mov eax,ebx ; set up high word of quotient
mul dword ptr LOWORD(DVSR) ; HIWORD(QUOT) * DVSR
mov ecx,eax ; save the result in ecx
mov eax,esi ; set up low word of quotient
mul dword ptr LOWORD(DVSR) ; LOWORD(QUOT) * DVSR
add edx,ecx ; EDX:EAX = QUOT * DVSR
jmp short L2 ; complete remainder calculation
;
; Here we do it the hard way. Remember, eax contains DVSRHI
;
L1:
mov ecx,eax ; ecx:ebx <- divisor
mov ebx,LOWORD(DVSR)
mov edx,HIWORD(DVND) ; edx:eax <- dividend
mov eax,LOWORD(DVND)
L3:
shr ecx,1 ; shift divisor right one bit; hi bit <- 0
rcr ebx,1
shr edx,1 ; shift dividend right one bit; hi bit <- 0
rcr eax,1
or ecx,ecx
jnz short L3 ; loop until divisor < 4194304K
div ebx ; now divide, ignore remainder
mov esi,eax ; save quotient
;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;
mul dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
mov ecx,eax
mov eax,LOWORD(DVSR)
mul esi ; QUOT * LOWORD(DVSR)
add edx,ecx ; EDX:EAX = QUOT * DVSR
jc short L4 ; carry means Quotient is off by 1
;
; do long compare here between original dividend and the result of the
; multiply in edx:eax. If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;
cmp edx,HIWORD(DVND) ; compare hi words of result and original
ja short L4 ; if result > original, do subtract
jb short L5 ; if result < original, we are ok
cmp eax,LOWORD(DVND) ; hi words are equal, compare lo words
jbe short L5 ; if less or equal we are ok, else subtract
L4:
dec esi ; subtract 1 from quotient
sub eax,LOWORD(DVSR) ; subtract divisor from result
sbb edx,HIWORD(DVSR)
L5:
xor ebx,ebx ; ebx:esi <- quotient
L2:
;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will do the subtract in the
; opposite direction and negate the result.
;
sub eax,LOWORD(DVND) ; subtract dividend from result
sbb edx,HIWORD(DVND)
neg edx ; otherwise, negate the result
neg eax
sbb edx,0
;
; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx.
;
mov ecx,edx
mov edx,ebx
mov ebx,ecx
mov ecx,eax
mov eax,esi
;
; Just the cleanup left to do. edx:eax contains the quotient.
; Restore the saved registers and return.
;
pop esi
ret 16
_aulldvrm ENDP
end
58 instructions in 149 bytes (plus 11 bytes for alignment).
OUCH: the highlighted comment with the following
code is a remarkable gem – the remainder is already present
in register EDX
!
JCCLESS
is defined, else processors which don’t
feature speculative execution, uses 75 instructions in 193 bytes
(plus 15 bytes for alignment) respectively 72 instructions in 185
bytes (plus 7 bytes for alignment), including 18 instructions in 50
bytes for the special and trivial cases not covered by
Microsoft’s poor implementation:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _aulldvrm():
; receives arguments on stack, returns quotient in edx:eax and remainder in ebx:ecx
; NOTE: _aulldvrm() can raise 'division by zero' exception!
_aulldvrm proc public ; qword _aulldvrm(qword dividend, qword divisor)
mov ecx, [esp+8] ; ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] ; edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+8] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+4] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
mov ecx, edx ; ecx = (low dword of) remainder
xor ebx, ebx ; ebx:ecx = remainder
xor edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
mov ebx, eax ; ebx = high dword of quotient
mov eax, [esp+4] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov ecx, edx ; ecx = (low dword of) remainder
mov edx, ebx ; edx:eax = quotient
xor ebx, ebx ; ebx:ecx = remainder
ret 16 ; callee restores stack
; dividend < divisor: quotient = 0, remainder = dividend
trivial:
mov ecx, [esp+4]
mov ebx, [esp+8] ; ebx:ecx = remainder = dividend
xor eax, eax
xor edx, edx ; edx:eax = quotient = 0
ret 16 ; callee restores stack
; dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor >= 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
mov ebx, edx ; ebx = divisor'
ifndef JCCLESS
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+8] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JCCLESS
mov edx, [esp+8] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JCCLESS
mov eax, [esp+8] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+12] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
mov ecx, [esp+16] ; ecx = high dword of divisor
imul ecx, ebx ; ecx = high dword of divisor * quotient"
push ebx ; [esp] = quotient"
mov ebx, [esp+12] ; ebx = high dword of dividend
sub ebx, ecx ; ebx = high dword of dividend
; - high dword of divisor * quotient"
mov ecx, [esp+8] ; ecx = low dword of dividend
sub ecx, eax
sbb ebx, edx ; ebx:ecx = dividend - divisor * quotient"
; = remainder"
ifndef JCCLESS
pop eax ; eax = quotient"
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ecx, [esp+12]
adc ebx, [esp+16] ; ebx:ecx = remainder" + divisor
; = remainder
dec eax ; eax = quotient" - 1
; = (low dword of) quotient
@@:
else ; JCCLESS
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
add [esp], eax ; [esp] = quotient" - (remainder" < 0)
; = (low dword of) quotient
and eax, [esp+16]
and edx, [esp+20] ; edx:eax = (remainder" < 0) ? divisor : 0
add ecx, eax
adc ebx, edx ; ebx:ecx = remainder" + divisor
; = remainder
pop eax ; eax = (low dword of) quotient
endif ; JCCLESS
xor edx, edx ; edx:eax = quotient
ret 16 ; callee restores stack
; dividend >= divisor >= 2**63:
; quotient = 1, remainder = dividend - divisor
special:
mov ecx, [esp+4]
mov ebx, [esp+8] ; ebx:ecx = dividend
sub ecx, eax
sbb ebx, edx ; ebx:ecx = dividend - divisor
; = remainder
xor eax, eax
xor edx, edx
inc eax ; edx:eax = quotient = 1
ret 16 ; callee restores stack
_aulldvrm endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
aulldvrm.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
aulldvrm.obj
and add it to the existing object library
i386.lib
:
ML.EXE aulldvrm.asm LINK.EXE /LIB /OUT:i386.lib i386.lib aulldvrm.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aulldvrm.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_aullrem
Routineullrem.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\ullrem.asm" TYPE "%source%\intel\ullrem.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 5,330 ullrem.asm 1 File(s) 5,330 bytes 0 Dir(s) 9,876,543,210 bytes free title ullrem - unsigned long remainder routine ;*** ;ullrem.asm - unsigned long remainder routine ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; defines the unsigned long remainder routine ; __aullrem ; ;******************************************************************************* .xlist include cruntime.inc include mm.inc .list ;*** ;ullrem - unsigned long remainder ; ;Purpose: ; Does a unsigned long remainder of the arguments. Arguments are ; not changed. ; ;Entry: ; Arguments are passed on the stack: ; 1st pushed: divisor (QWORD) ; 2nd pushed: dividend (QWORD) ; ;Exit: ; EDX:EAX contains the remainder (dividend%divisor) ; NOTE: this routine removes the parameters from the stack. ; ;Uses: ; ECX ; ;Exceptions: ; ;******************************************************************************* CODESEG _aullrem PROC NEAR .FPO (1, 4, 0, 0, 0, 0) push ebx ; Set up the local stack and save the index registers. When this is done ; the stack frame will look as follows (assuming that the expression a%b will ; generate a call to ullrem(a, b)): ; ; ----------------- ; | | ; |---------------| ; | | ; |--divisor (b)--| ; | | ; |---------------| ; | | ; |--dividend (a)-| ; | | ; |---------------| ; | return addr** | ; |---------------| ; ESP---->| EBX | ; ----------------- ; DVND equ [esp + 8] ; stack address of dividend (a) DVSR equ [esp + 16] ; stack address of divisor (b) ; Now do the divide. First look to see if the divisor is less than 4194304K. ; If so, then we can use a simple algorithm with word divides, otherwise ; things get a little more complex. ; mov eax,HIWORD(DVSR) ; check to see if divisor < 4194304K or eax,eax jnz short L1 ; nope, gotta do this the hard way mov ecx,LOWORD(DVSR) ; load divisor mov eax,HIWORD(DVND) ; load high word of dividend xor edx,edx div ecx ; edx <- remainder, eax <- quotient mov eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend div ecx ; edx <- final remainder mov eax,edx ; edx:eax <- remainder xor edx,edx jmp short L2 ; restore stack and return ; ; Here we do it the hard way. Remember, eax contains DVSRHI ; L1: mov ecx,eax ; ecx:ebx <- divisor mov ebx,LOWORD(DVSR) mov edx,HIWORD(DVND) ; edx:eax <- dividend mov eax,LOWORD(DVND) L3: shr ecx,1 ; shift divisor right one bit; hi bit <- 0 rcr ebx,1 shr edx,1 ; shift dividend right one bit; hi bit <- 0 rcr eax,1 or ecx,ecx jnz short L3 ; loop until divisor < 4194304K div ebx ; now divide, ignore remainder ; ; We may be off by one, so to check, we will multiply the quotient ; by the divisor and check the result against the orignal dividend ; Note that we must also check for overflow, which can occur if the ; dividend is close to 2**64 and the quotient is off by 1. ; mov ecx,eax ; save a copy of quotient in ECX mul dword ptr HIWORD(DVSR) xchg ecx,eax ; put partial product in ECX, get quotient in EAX mul dword ptr LOWORD(DVSR) add edx,ecx ; EDX:EAX = QUOT * DVSR jc short L4 ; carry means Quotient is off by 1 ; ; do long compare here between original dividend and the result of the ; multiply in edx:eax. If original is larger or equal, we're ok, otherwise ; subtract the original divisor from the result. ; cmp edx,HIWORD(DVND) ; compare hi words of result and original ja short L4 ; if result > original, do subtract jb short L5 ; if result < original, we're ok cmp eax,LOWORD(DVND) ; hi words are equal, compare lo words jbe short L5 ; if less or equal we're ok, else subtract L4: sub eax,LOWORD(DVSR) ; subtract divisor from result sbb edx,HIWORD(DVSR) L5: ; ; Calculate remainder by subtracting the result from the original dividend. ; Since the result is already in a register, we will perform the subtract in ; the opposite direction and negate the result to make it positive. ; sub eax,LOWORD(DVND) ; subtract original dividend from result sbb edx,HIWORD(DVND) neg edx ; and negate it neg eax sbb edx,0 ; ; Just the cleanup left to do. dx:ax contains the remainder. ; Restore the saved registers and return. ; L2: pop ebx ret 16 _aullrem ENDP end44 instructions in 117 bytes (plus 11 bytes for alignment).
JCCLESS
is defined, else processors which don’t
feature speculative execution, uses 65 instructions in 173 bytes
(plus 3 bytes for alignment) respectively 64 instructions in 171
bytes (plus 5 bytes for alignment), including 14 instructions in 43
bytes for the special and trivial cases not covered by
Microsoft’s poor implementation:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _aullrem():
; receives arguments on stack, returns remainder in edx:eax
; NOTE: _aullrem() can raise 'division by zero' exception!
_aullrem proc public ; qword _aullrem(qword dividend, qword divisor)
mov ecx, [esp+8] ; ecx = high dword of dividend
mov eax, [esp+12]
mov edx, [esp+16] ; edx:eax = divisor
cmp [esp+4], eax
sbb ecx, edx
jb short trivial ; dividend < divisor?
bsr ecx, edx ; ecx = index of most significant '1' bit
; in high dword of divisor
jnz short extended ; high dword of divisor <> 0?
; remainder < divisor < 2**32
mov ecx, eax ; ecx = (low dword of) divisor
mov eax, [esp+8] ; eax = high dword of dividend
cmp eax, ecx
jae short long ; high dword of dividend >= divisor?
; perform normal division
normal:
mov edx, eax ; edx = high dword of dividend
mov eax, [esp+4] ; edx:eax = dividend
div ecx ; eax = (low dword of) quotient,
; edx = (low dword of) remainder
mov eax, edx ; eax = (low dword of) remainder
xor edx, edx ; edx:eax = remainder
ret 16 ; callee restores stack
; perform "long" alias "schoolbook" division
long:
;; xor edx, edx ; edx:eax = high dword of dividend
div ecx ; eax = high dword of quotient,
; edx = high dword of remainder'
mov eax, [esp+4] ; eax = low dword of dividend
div ecx ; eax = low dword of quotient,
; edx = (low dword of) remainder
mov eax, edx ; eax = (low dword of) remainder
xor edx, edx ; edx:eax = remainder
ret 16 ; callee restores stack
; dividend < divisor: remainder = dividend
trivial:
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = remainder = dividend
ret 16 ; callee restores stack
; dividend >= divisor >= 2**32: quotient < 2**32
extended:
xor ecx, 31 ; ecx = number of leading '0' bits
; in (high dword of) divisor
jz short special ; divisor >= 2**63?
; perform "extended & adjusted" division
shld edx, eax, cl ; edx = divisor / 2**(index + 1)
; = divisor'
;; shl eax, cl
push ebx
mov ebx, edx ; ebx = divisor'
ifndef JCCLESS
xor eax, eax ; eax = high dword of quotient' = 0
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx
jb short @f ; high dword of dividend < divisor'?
; high dword of dividend >= divisor':
; subtract divisor' from high dword of dividend to prevent possible
; division overflow and set most significant bit of quotient"
sub edx, ebx ; edx = high dword of dividend - divisor'
; = high dword of dividend'
inc eax ; eax = high dword of quotient' = 1
@@:
push eax ; [esp] = high dword of quotient'
else ; JCCLESS
mov edx, [esp+12] ; edx = high dword of dividend
cmp edx, ebx ; CF = (high dword of dividend < divisor')
sbb eax, eax ; eax = (high dword of dividend < divisor') ? -1 : 0
inc eax ; eax = (high dword of dividend < divisor') ? 0 : 1
; = high dword of quotient'
push eax ; [esp] = high dword of quotient'
if 0
neg eax ; eax = (high dword of dividend < divisor') ? 0 : -1
and eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
imul eax, ebx ; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
sub edx, eax ; edx = high dword of dividend
; - (high dword of dividend < divisor') ? 0 : divisor'
; = high dword of dividend'
endif ; JCCLESS
mov eax, [esp+12] ; eax = low dword of dividend
; = low dword of dividend'
div ebx ; eax = dividend' / divisor'
; = low dword of quotient',
; edx = remainder'
pop ebx ; ebx = high dword of quotient'
shld ebx, eax, cl ; ebx = quotient' / 2**(index + 1)
; = dividend / divisor'
; = quotient"
;; shl eax, cl
mov eax, [esp+16] ; eax = low dword of divisor
mul ebx ; edx:eax = low dword of divisor * quotient"
imul ebx, [esp+20] ; ebx = high dword of divisor * quotient"
mov ecx, [esp+12] ; ecx = high dword of dividend
sub ecx, ebx ; ecx = high dword of dividend
; - high dword of divisor * quotient"
mov ebx, [esp+8] ; ebx = low dword of dividend
sub ebx, eax
sbb ecx, edx ; ecx:ebx = dividend - divisor * quotient"
; = remainder"
ifndef JCCLESS
jnb short @f ; remainder" >= 0?
; (with borrow, it is off by divisor,
; and quotient" is off by 1)
add ebx, [esp+16]
adc ecx, [esp+20] ; ecx:ebx = remainder" + divisor
; = remainder
@@:
mov eax, ebx
mov edx, ecx ; edx:eax = remainder
else ; JCCLESS
sbb eax, eax ; eax = (remainder" < 0) ? -1 : 0
cdq ; edx = (remainder" < 0) ? -1 : 0
and eax, [esp+16]
and edx, [esp+20] ; edx:eax = (remainder" < 0) ? divisor : 0
add eax, ebx
adc edx, ecx ; edx:eax = remainder" + divisor
; = remainder
endif ; JCCLESS
pop ebx
ret 16 ; callee restores stack
; dividend >= divisor >= 2**63: remainder = dividend - divisor
special:
if 0
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = dividend
sub eax, [esp+12]
sbb edx, [esp+16] ; edx:eax = dividend - divisor
; = remainder
else
neg edx
neg eax
sbb edx, ecx ; edx:eax = -divisor
add eax, [esp+4]
adc edx, [esp+8] ; edx:eax = dividend - divisor
; = remainder
endif
ret 16 ; callee restores stack
_aullrem endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
aullrem.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
aullrem.obj
and add it to the existing object library
i386.lib
:
ML.EXE aullrem.asm LINK.EXE /LIB /OUT:i386.lib i386.lib aullrem.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aullrem.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_aullshr
Routineullshr.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\ullshr.asm" TYPE "%source%\intel\ullshr.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 1,545 ullshr.asm 1 File(s) 1,545 bytes 0 Dir(s) 9,876,543,210 bytes free title ullshr - long shift right ;*** ;ullshr.asm - long shift right ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; define unsigned long shift right routine ; __aullshr ; ;******************************************************************************* .xlist include cruntime.inc include mm.inc .list ;*** ;ullshr - long shift right ; ;Purpose: ; Does a unsigned Long Shift Right ; Shifts a long right any number of bits. ; ;Entry: ; EDX:EAX - long value to be shifted ; CL - number of bits to shift by ; ;Exit: ; EDX:EAX - shifted value ; ;Uses: ; CL is destroyed. ; ;Exceptions: ; ;******************************************************************************* CODESEG _aullshr PROC NEAR .FPO (0, 0, 0, 0, 0, 0) ; ; Handle shifts of 64 bits or more (if shifting 64 bits or more, the result ; depends only on the high order bit of edx). ; cmp cl,64 jae short RETZERO ; ; Handle shifts of between 0 and 31 bits ; cmp cl, 32 jae short MORE32 shrd eax,edx,cl shr edx,cl ret ; ; Handle shifts of between 32 and 63 bits ; MORE32: mov eax,edx xor edx,edx15 instructions in 31 bytes (plus 1 byte for alignment).and cl,31shr eax,cl ret ; ; return 0 in edx:eax ; RETZERO: xor eax,eax xor edx,edx ret _aullshr ENDP end
OUCH: i386 and newer processors
perform shift operations modulo the register size, the
deleted AND
instruction
is therefore superfluous!
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _aullshr():
; receives arguments in edx:eax and cl, returns result in edx:eax
_aullshr proc public ; qword _aullshr(qword value, byte count)
cmp cl, 31
ja short @f ; count > 31?
shrd eax, edx, cl
shr edx, cl ; edx:eax = result
ret
@@:
xor eax, eax ; eax = high dword of result
; = 0
cmp cl, 63
ja short @f ; count > 63?
xchg eax, edx ; eax = high dword of value,
; edx = high dword of result = 0
shr eax, cl ; edx:eax = result
ret
@@:
cdq ; edx:eax = result = 0
ret
_aullshr endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
aullshr.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
aullshr.obj
and add it to the existing object library
i386.lib
:
ML.EXE aullshr.asm LINK.EXE /LIB /OUT:i386.lib i386.lib aullshr.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aullshr.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_allshl
Routinellshl.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\llshl.asm" TYPE "%source%\intel\llshl.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 1,493 llshl.asm 1 File(s) 1,493 bytes 0 Dir(s) 9,876,543,210 bytes free title llshl - long shift left ;*** ;llshl.asm - long shift left ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; define long shift left routine (signed and unsigned are same) ; __allshl ; ;******************************************************************************* .xlist include cruntime.inc include mm.inc .list ;*** ;llshl - long shift left ; ;Purpose: ; Does a Long Shift Left (signed and unsigned are identical) ; Shifts a long left any number of bits. ; ;Entry: ; EDX:EAX - long value to be shifted ; CL - number of bits to shift by ; ;Exit: ; EDX:EAX - shifted value ; ;Uses: ; CL is destroyed. ; ;Exceptions: ; ;******************************************************************************* CODESEG _allshl PROC NEAR .FPO (0, 0, 0, 0, 0 ,0) ; ; Handle shifts of 64 or more bits (all get 0) ; cmp cl, 64 jae short RETZERO ; ; Handle shifts of between 0 and 31 bits ; cmp cl, 32 jae short MORE32 shld edx,eax,cl shl eax,cl ret ; ; Handle shifts of between 32 and 63 bits ; MORE32: mov edx,eax xor eax,eax15 instructions in 31 bytes (plus 1 byte for alignment).and cl,31shl edx,cl ret ; ; return 0 in edx:eax ; RETZERO: xor eax,eax xor edx,edx ret _allshl ENDP end
OUCH: i386 and newer processors
perform shift operations modulo the register size, the
deleted AND
instruction
is therefore superfluous!
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _allshl():
; receives arguments in edx:eax and cl, returns result in edx:eax
_allshl proc public ; sqword _allshl(sqword value, byte count)
cmp cl, 31
ja short @f ; count > 31?
shld edx, eax, cl
shl eax, cl ; edx:eax = result
ret
@@:
mov edx, eax ; edx = low dword of value
xor eax, eax ; eax = low dword of result
; = 0
cmp cl, 63
ja short @f ; count > 63?
shl edx, cl ; edx:eax = result
ret
@@:
cdq ; edx:eax = result = 0
ret
_allshl endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
allshl.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
allshl.obj
and add it to the existing object library
i386.lib
:
ML.EXE allshl.asm LINK.EXE /LIB /OUT:i386.lib i386.lib allshl.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: allshl.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_allshr
Routinellshr.asm
shipped with the
Visual C compiler:
DIR "%source%\intel\llshr.asm" TYPE "%source%\intel\llshr.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 1,561 llshr.asm 1 File(s) 1,561 bytes 0 Dir(s) 9,876,543,210 bytes free title llshr - long shift right ;*** ;llshr.asm - long shift right ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; define signed long shift right routine ; __allshr ; ;******************************************************************************* .xlist include cruntime.inc include mm.inc .list ;*** ;llshr - long shift right ; ;Purpose: ; Does a signed Long Shift Right ; Shifts a long right any number of bits. ; ;Entry: ; EDX:EAX - long value to be shifted ; CL - number of bits to shift by ; ;Exit: ; EDX:EAX - shifted value ; ;Uses: ; CL is destroyed. ; ;Exceptions: ; ;******************************************************************************* CODESEG _allshr PROC NEAR .FPO (0, 0, 0, 0, 0, 0) ; ; Handle shifts of 64 bits or more (if shifting 64 bits or more, the result ; depends only on the high order bit of edx). ; cmp cl,64 jae short RETSIGN ; ; Handle shifts of between 0 and 31 bits ; cmp cl, 32 jae short MORE32 shrd eax,edx,cl sar edx,cl ret ; ; Handle shifts of between 32 and 63 bits ; MORE32: mov eax,edx sar edx,3115 instructions in 33 bytes (plus 15 bytes for alignment).and cl,31sar eax,cl ret ; ; Return double precision 0 or -1, depending on the sign of edx ; RETSIGN: sar edx,31 mov eax,edx ret _allshr ENDP end
OUCH: i386 and newer processors
perform shift operations modulo the register size, the
deleted AND
instruction
is therefore superfluous!
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
; MSC internal _allshr():
; receives arguments in edx:eax and cl, returns result in edx:eax
_allshr proc public ; sqword _allshr(sqword value, byte count)
cmp cl, 31
ja short @f ; count > 31?
shrd edx, eax, cl
sar eax, cl ; edx:eax = result
ret
@@:
mov eax, edx ; eax = high dword of value
cdq ; edx = (value < 0) ? -1 : 0
; = high dword of result
cmp cl, 63
ja short @f ; count > 63?
sar eax, cl ; edx:eax = result
ret
@@:
mov eax, edx ; edx:eax = (value < 0) ? -1 : 0
; = result
ret
_allshr endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
allshr.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 2 command lines to generate the object file
allshr.obj
and add it to the existing object library
i386.lib
:
ML.EXE allshr.asm LINK.EXE /LIB /OUT:i386.lib i386.lib allshr.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: allshr.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_all*
and _aull*
Routines in LeakedSource
DIR "%source%\intel\ll*.asm" DIR "%source%\intel\ull*.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 6,670 lldiv.asm 02/18/2011 03:40 PM 8,557 lldvrm.asm 02/18/2011 03:40 PM 2,570 llmul.asm 02/18/2011 03:40 PM 7,067 llrem.asm 02/18/2011 03:40 PM 1,493 llshl.asm 02/18/2011 03:40 PM 1,561 llshr.asm 6 File(s) 27,918 bytes 0 Dir(s) 9,876,543,210 bytes free Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 5,079 ulldiv.asm 02/18/2011 03:40 PM 6,227 ulldvrm.asm 02/18/2011 03:40 PM 5,330 ullrem.asm 02/18/2011 03:40 PM 1,545 ullshr.asm 4 File(s) 18,181 bytes 0 Dir(s) 9,876,543,210 bytes freeThe following table presents the revision history extracted from the i386 assembler source file blcrtasm.asm, but stripped from the 10 individual assembler source files shown above.
Note: on November 19, 1993, 8 (in words:
eight)
years
after Intel® introduced their 80386
processor, and 8
months
after they introduced the Pentium®
processor, these routines were
modified to work on 64 bit integers
,
but without taking advantage of these 32-bit
processor’s new
instructions like
BSF
,
BSR
,
SHLD
and
SHRD
to replace
the (dead)slow loops which shift both operands by just one bit per
pass with SHR
and
RCR
instructions.
Note: even the initial versions of the
_alldvrm
and _aulldvrm
routines,
created October 6, 1998, almost 3
years
after Intel introduced their
Pentium®Pro processor, and 17
months
after they introduced the Pentium®II
processor, failed to rectify (not just) this performance degrading
deficiency.
Intel Microprocessor Quick Reference Guide - Product Family
Routine(s) | Date | Who | Comment | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Routine(s) | Date | Who | Comment | |||||||||
llshl | llshr | ullshr | 1983-11-?? | HS | initial version | |||||||
lldiv | llmul | llrem | ulldiv | ullrem | 1983-11-29 | DFW | initial version | |||||
llshl | llshr | ullshr | 1983-11-30 | DFW | added medium model support | |||||||
llshl | llshr | ullshr | 1984-03-12 | DFW | broke apart; added long model support | |||||||
lldiv | llmul | llrem | llshl | llshr | ulldiv | ullrem | ullshr | 1984-06-01 | RN | modified to use cmacros | ||
llmul | 1985-04-17 | TC | ignore signs since they take care of themselves do a fast multiply if both hiwords of arguments are 0 |
|||||||||
llmul | 1986-10-10 | MH | slightly faster implementation, for 0 in upper words | |||||||||
lldiv | llrem | ulldiv | ullrem | 1987-10-23 | SKS | fixed off-by-1 error for dividend close to 2**32. | ||||||
llmul | 1989-03-20 | SKS | Remove redundant "MOV SP,BP" from epilogs | |||||||||
llmul | 1989-05-18 | SKS | Preserve BX | |||||||||
lldiv | llrem | ulldiv | ullrem | 1989-05-18 | SKS | Remove redundant "MOV SP,BP" from epilog | ||||||
lldiv | llmul | llrem | llshl | llshr | ulldiv | ullrem | ullshr | 1989-11-28 | GJF | Fixed copyright | ||
lldiv | llmul | llrem | llshl | llshr | ulldiv | ullrem | ullshr | 1993-11-19 | SMK | Modified to work on 64 bit integers | ||
lldiv | llmul | llrem | llshl | llshr | ulldiv | ullrem | ullshr | 1994-01-17 | GDF | Minor changes to build with NT's masm386. | ||
llshl | llshr | ullshr | 1994-07-08 | GDF | Faster, fatter version for NT. | |||||||
llshl | llshr | ullshr | 1994-07-13 | GDF | Further improvements from JonM. | |||||||
lldiv | llmul | llrem | ulldiv | ullrem | 1994-07-22 | GJF | Use esp-relative addressing for args. Shortened conditional jumps. Also, don't use xchg to do a simple move between regs. |
|||||
lldvrm | ulldvrm | 1998-10-06 | SMK | Initial version. |
_all*
and _aull*
RoutinesCYCLES
defined, the
following program measures the execution times of signed and
unsigned 64÷64-bit divisions as well as 64×64-bit
multiplications in processor clock cycles and runs on
Windows Vista® and later versions, else
it measures the execution times in nano-seconds and runs on all
versions of
Windows™ NT
– it executes each operation on 1 billion pairs of uniform
distributed 64-bit pseudo-random numbers in a first pass, then on 1
billion pairs of uniform distributed 33 to 64-bit pseudo-random
numbers in a second pass:
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
typedef LONGLONG SQWORD;
typedef ULONGLONG QWORD;
#define _(DIVIDEND, DIVISOR) {(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}
const struct _ull
{
QWORD ullDividend, ullDivisor, ullQuotient, ullRemainder;
} ullTable[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
_(0x0000000000000001ULL, 0x0000000000000001ULL),
_(0x0000000000000002ULL, 0x0000000000000001ULL),
_(0x0000000000000002ULL, 0x0000000000000002ULL),
_(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFDULL),
_(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
_(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
_(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
_(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
_(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
_(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
_(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
_(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
_(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
_(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFDULL),
_(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x8000000000000000ULL, 0x0000000000000001ULL),
_(0x8000000000000000ULL, 0x0000000000000002ULL),
_(0x8000000000000000ULL, 0x0000000000000003ULL),
_(0x8000000000000000ULL, 0x00000000FFFFFFFDULL),
_(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
_(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
_(0x8000000000000000ULL, 0x0000000100000000ULL),
_(0x8000000000000000ULL, 0x0000000100000001ULL),
_(0x8000000000000000ULL, 0x0000000100000002ULL),
_(0x8000000000000000ULL, 0x0000000100000003ULL),
_(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
_(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFDULL),
_(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
_(0x8000000080000000ULL, 0x0000000080000000ULL),
_(0x8000000080000001ULL, 0x0000000080000001ULL),
_(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFCULL, 0x00000000FFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFCULL, 0x0000000100000002ULL),
_(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000003ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFDULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFFULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000001ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000002ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
_(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
_(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
_(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};
const struct _ll
{
SQWORD llDividend, llDivisor, llQuotient, llRemainder;
} llTable[] = {_(0x0000000000000000LL, 0x0000000000000001LL), // 0, 1
_(0x0000000000000001LL, 0x0000000000000001LL), // 1, 1
_(0x0000000000000000LL, 0xFFFFFFFFFFFFFFFFLL), // 0, -1
_(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFFLL), // 1, -1
_(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFELL), // 1, -2
_(0x0000000000000002LL, 0xFFFFFFFFFFFFFFFELL), // 2, -2
_(0x000000000FFFFFFFLL, 0x0000000000000001LL),
_(0x0000000FFFFFFFFFLL, 0x000000000000000FLL),
_(0x0000000FFFFFFFFFLL, 0x0000000000000010LL),
_(0x0000000000000100LL, 0x000000000FFFFFFFLL),
_(0x00FFFFFFF0000000LL, 0x0000000010000000LL),
_(0x07FFFFFF80000000LL, 0x0000000080000000LL),
_(0x7FFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
_(0x7FFFFFFEFFFFFFF0LL, 0x0000FFFFFFFFFFFELL),
_(0x7FFFFFFEFFFFFFF0LL, 0x7FFFFFFEFFFFFFF0LL),
_(0x7FFFFFFFFFFFFFFFLL, 0x8000000000000000LL), // llmax, llmin
_(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFDLL), // llmax, -3
_(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL), // llmax, -2
_(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL), // llmax, -1
_(0x8000000000000000LL, 0x0000000000000001LL), // llmin, 1
_(0x8000000000000000LL, 0x0000000000000002LL), // llmin, 2
_(0x8000000000000000LL, 0x0000000000000003LL), // llmin, 3
_(0x8000000000000000LL, 0x00000000FFFFFFFELL),
_(0x8000000000000000LL, 0x00000000FFFFFFFFLL),
_(0x8000000000000000LL, 0x0000000100000000LL),
_(0x8000000000000000LL, 0x0000000100000001LL),
_(0x8000000000000000LL, 0x0000000100000002LL),
_(0x8000000000000000LL, 0x8000000000000000LL), // llmin, llmin
_(0x8000000000000000LL, 0xFFFFFFFF00000000LL),
_(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFDLL), // llmin, -3
_(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFELL), // llmin, -2
_(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFFLL), // llmin, -1
_(0x8000000080000000LL, 0x0000000080000000LL),
_(0x8000000080000001LL, 0x0000000080000001LL),
_(0xFFFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
_(0xFFFFFFFFFFFFFFFELL, 0x0000000080000000LL),
_(0xFFFFFFFFFFFFFFFELL, 0x0000000000000001LL), // -2, 1
_(0xFFFFFFFFFFFFFFFELL, 0x0000000000000002LL), // -2, 2
_(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFELL), // -2, -2
_(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFFLL), // -2, -1
_(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000001LL), // -1, 1
_(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000002LL), // -1, 2
_(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL), // -1, -2
_(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL)}; // -1, -1
#undef _
__declspec(naked)
__declspec(noinline)
QWORD WINAPI _aullnop(QWORD ullLeft, QWORD ullRight)
{
__asm ret 16
}
__forceinline // companion for __emulu()
struct
{
DWORD ulQuotient, ulRemainder;
} WINAPI __edivmodu(QWORD ullDividend, DWORD ulDivisor)
{
__asm mov eax, dword ptr ullDividend
__asm mov edx, dword ptr ullDividend+4
__asm div ulDivisor
}
__declspec(safebuffers)
BOOL CDECL PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
CHAR szFormat[1024];
DWORD dwFormat;
DWORD dwOutput;
va_list vaInput;
va_start(vaInput, lpFormat);
dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
va_end(vaInput);
if ((dwFormat == 0UL)
|| !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
return FALSE;
return dwOutput == dwFormat;
}
__declspec(noreturn)
VOID CDECL mainCRTStartup(VOID)
{
DWORD dw, dwCPUID[12];
QWORD qwT0, qwT1, qwT2, qwT3, qwT4, qwT5, qwT6, qwT7, qwT8, qwT9;
QWORD ullQuotient, ullRemainder;
SQWORD llQuotient, llRemainder;
volatile
QWORD qwQuotient, qwRemainder;
QWORD qwDividend, qwDivisor = ~0ULL;
QWORD qwLeft = 0x9E3779B97F4A7C15ULL; // 2**64 / golden ratio
QWORD qwRight = 0x28208A20A08A28ACULL; // bit-vector of prime numbers:
// 2**prime is set for each prime in [0, 63]
HANDLE hThread = GetCurrentThread();
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
if ((hOutput == INVALID_HANDLE_VALUE)
|| (SetThreadIdealProcessor(hThread, 0UL) == -1L)
|| (!SetThreadPriority(hThread, THREAD_PRIORITY_HIGHEST)))
ExitProcess(GetLastError());
__cpuid(dwCPUID, 0x80000000UL);
if (*dwCPUID >= 0x80000004UL)
{
__cpuid(dwCPUID, 0x80000002UL);
__cpuid(dwCPUID + 4, 0x80000003UL);
__cpuid(dwCPUID + 8, 0x80000004UL);
}
else
__movsb(dwCPUID, "unidentified processor", sizeof("unidentified processor"));
PrintFormat(hOutput, "\r\nTesting unsigned 64-bit division...\r\n");
for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
{
PrintFormat(hOutput, "\r%lu", dw);
ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;
ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;
if (ullQuotient != ullTable[dw].ullQuotient)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);
if (ullQuotient > ullTable[dw].ullDividend)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);
if (ullRemainder != ullTable[dw].ullRemainder)
PrintFormat(hOutput,
"\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
if (ullRemainder >= ullTable[dw].ullDivisor)
PrintFormat(hOutput,
"\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
}
for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
{
PrintFormat(hOutput, "\r%ld", ~dw);
ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;
if (ullQuotient != ullTable[dw].ullQuotient)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);
if (ullQuotient > ullTable[dw].ullDividend)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);
ullRemainder = ullTable[dw].ullDividend - ullTable[dw].ullDivisor * ullQuotient;
if (ullRemainder != ullTable[dw].ullRemainder)
PrintFormat(hOutput,
"\t%I64u / %I64u:\a remainder %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;
if (ullRemainder != ullTable[dw].ullRemainder)
PrintFormat(hOutput,
"\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
if (ullRemainder >= ullTable[dw].ullDivisor)
PrintFormat(hOutput,
"\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
}
PrintFormat(hOutput, "\r\nTesting signed 64-bit division...\r\n");
for (dw = 0UL; dw < sizeof(llTable) / sizeof(*llTable); dw++)
{
PrintFormat(hOutput, "\r%lu", dw);
llQuotient = llTable[dw].llDividend / llTable[dw].llDivisor;
llRemainder = llTable[dw].llDividend % llTable[dw].llDivisor;
if (llQuotient != llTable[dw].llQuotient)
PrintFormat(hOutput,
"\t%I64d / %I64d:\a quotient %I64d not equal %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient, llTable[dw].llQuotient);
if ((llTable[dw].llDividend < 0LL) && (llQuotient < llTable[dw].llDividend)
|| (llTable[dw].llDividend >= 0LL) && (llQuotient > llTable[dw].llDividend))
PrintFormat(hOutput,
"\t%I64d / %I64d:\a quotient %I64d greater dividend\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient);
if (llRemainder != llTable[dw].llRemainder)
PrintFormat(hOutput,
"\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
if ((llTable[dw].llDivisor < 0LL) && (llRemainder <= llTable[dw].llDivisor)
|| (llTable[dw].llDivisor > 0LL) && (llRemainder >= llTable[dw].llDivisor))
PrintFormat(hOutput,
"\t%I64d %% %I64d:\a remainder %I64d not less divisor\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder);
if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
PrintFormat(hOutput,
"\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
llRemainder = llTable[dw].llDividend - llTable[dw].llDivisor * llQuotient;
if (llRemainder != llTable[dw].llRemainder)
PrintFormat(hOutput,
"\t%I64d / %I64d:\a remainder %I64d not equal %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
PrintFormat(hOutput,
"\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
}
for (dw = 0UL; dw < sizeof(llTable) / sizeof(*llTable); dw++)
{
PrintFormat(hOutput, "\r%ld", ~dw);
llQuotient = llTable[dw].llDividend / llTable[dw].llDivisor;
if (llQuotient != llTable[dw].llQuotient)
PrintFormat(hOutput,
"\t%I64d / %I64d:\a quotient %I64d not equal %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient, llTable[dw].llQuotient);
if ((llTable[dw].llDividend < 0LL) && (llQuotient < llTable[dw].llDividend)
|| (llTable[dw].llDividend >= 0LL) && (llQuotient > llTable[dw].llDividend))
PrintFormat(hOutput,
"\t%I64d / %I64d:\a quotient %I64d greater dividend\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient);
llRemainder = llTable[dw].llDividend - llTable[dw].llDivisor * llQuotient;
if (llRemainder != llTable[dw].llRemainder)
PrintFormat(hOutput,
"\t%I64d / %I64d:\a remainder %I64d not equal %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
PrintFormat(hOutput,
"\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
llRemainder = llTable[dw].llDividend % llTable[dw].llDivisor;
if (llRemainder != llTable[dw].llRemainder)
PrintFormat(hOutput,
"\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
if ((llTable[dw].llDivisor < 0LL) && (llRemainder <= llTable[dw].llDivisor)
|| (llTable[dw].llDivisor > 0LL) && (llRemainder >= llTable[dw].llDivisor))
PrintFormat(hOutput,
"\t%I64d %% %I64d:\a remainder %I64d not less divisor\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder);
if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
PrintFormat(hOutput,
"\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
}
PrintFormat(hOutput, "\r\nTiming 64-bit division and multiplication on %.48hs\r\n", dwCPUID);
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0xAD93D23594C935A9
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwQuotient = _aullnop(qwLeft, qwRight);
// 64-bit linear feedback shift register (Galois form)
// using primitive polynomial 0x2B5926535897936B
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwRemainder = _aullnop(qwLeft, qwRight);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwQuotient = qwLeft / qwRight;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwQuotient = qwLeft / qwRight;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwRemainder = qwLeft % qwRight;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwRemainder = qwLeft % qwRight;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwQuotient = qwLeft / qwRight;
qwRemainder = qwLeft % qwRight;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwQuotient = qwLeft / qwRight;
qwRemainder = qwLeft % qwRight;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT4))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT4))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwQuotient = qwLeft * qwRight;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwRemainder = qwLeft * qwRight;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT5))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT5))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT6))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT6))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT7))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT7))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT8))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT8))
#endif
ExitProcess(GetLastError());
qwT9 = qwT8 - qwT0;
qwT8 -= qwT7;
qwT7 -= qwT6;
qwT6 -= qwT5;
qwT5 -= qwT4;
qwT4 -= qwT3;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"_aullnop() %6lu.%09lu 0\r\n"
"_aulldiv() %6lu.%09lu %6lu.%09lu\r\n"
"_aullrem() %6lu.%09lu %6lu.%09lu\r\n"
"_aulldvrm() %6lu.%09lu %6lu.%09lu\r\n"
"_allmul() %6lu.%09lu %6lu.%09lu\r\n"
"_alldiv() %6lu.%09lu %6lu.%09lu\r\n"
"_allrem() %6lu.%09lu %6lu.%09lu\r\n"
"_alldvrm() %6lu.%09lu %6lu.%09lu\r\n"
" %6lu.%09lu clock cycles\r\n",
__edivmodu(qwT1, 1000000000UL),
__edivmodu(qwT2, 1000000000UL), __edivmodu(qwT2 - qwT1, 1000000000UL),
__edivmodu(qwT3, 1000000000UL), __edivmodu(qwT3 - qwT1, 1000000000UL),
__edivmodu(qwT4, 1000000000UL), __edivmodu(qwT4 - qwT1, 1000000000UL),
__edivmodu(qwT5, 1000000000UL), __edivmodu(qwT5 - qwT1, 1000000000UL),
__edivmodu(qwT6, 1000000000UL), __edivmodu(qwT6 - qwT1, 1000000000UL),
__edivmodu(qwT7, 1000000000UL), __edivmodu(qwT7 - qwT1, 1000000000UL),
__edivmodu(qwT8, 1000000000UL), __edivmodu(qwT8 - qwT1, 1000000000UL),
__edivmodu(qwT9, 1000000000UL));
#else // CYCLES
PrintFormat(hOutput,
"\r\n"
"_aullnop() %6lu.%07lu 0\r\n"
"_aulldiv() %6lu.%07lu %6lu.%07lu\r\n"
"_aullrem() %6lu.%07lu %6lu.%07lu\r\n"
"_aulldvrm() %6lu.%07lu %6lu.%07lu\r\n"
"_allmul() %6lu.%07lu %6lu.%07lu\r\n"
"_alldiv() %6lu.%07lu %6lu.%07lu\r\n"
"_allrem() %6lu.%07lu %6lu.%07lu\r\n"
"_alldvrm() %6lu.%07lu %6lu.%07lu\r\n"
" %6lu.%07lu nano-seconds\r\n",
__edivmodu(qwT1, 10000000UL),
__edivmodu(qwT2, 10000000UL), __edivmodu(qwT2 - qwT1, 10000000UL),
__edivmodu(qwT3, 10000000UL), __edivmodu(qwT3 - qwT1, 10000000UL),
__edivmodu(qwT4, 10000000UL), __edivmodu(qwT4 - qwT1, 10000000UL),
__edivmodu(qwT5, 10000000UL), __edivmodu(qwT5 - qwT1, 10000000UL),
__edivmodu(qwT6, 10000000UL), __edivmodu(qwT6 - qwT1, 10000000UL),
__edivmodu(qwT7, 10000000UL), __edivmodu(qwT7 - qwT1, 10000000UL),
__edivmodu(qwT8, 10000000UL), __edivmodu(qwT8 - qwT1, 10000000UL),
__edivmodu(qwT9, 10000000UL));
#endif // CYCLES
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT0))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT0))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
qwQuotient = _aullnop(qwDividend, qwDivisor);
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
qwRemainder = _aullnop(qwDividend, qwDivisor);
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT1))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT1))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
qwQuotient = qwDividend / qwDivisor;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
qwQuotient = qwDividend / qwDivisor;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT2))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT2))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
qwRemainder = qwDividend % qwDivisor;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
qwRemainder = qwDividend % qwDivisor;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT3))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT3))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
qwQuotient = qwDividend / qwDivisor;
qwRemainder = qwDividend % qwDivisor;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
qwQuotient = qwDividend / qwDivisor;
qwRemainder = qwDividend % qwDivisor;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT4))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT4))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
qwQuotient = qwDividend * qwDivisor;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
qwRemainder = qwDividend * qwDivisor;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT5))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT5))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwDividend = __ll_rshift(qwLeft, qwLeft /* & 31 */);
qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwDivisor = __ll_rshift(qwRight, qwRight /* & 31 */);
qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT6))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT6))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwDividend = __ll_rshift(qwLeft, qwLeft /* & 31 */);
qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwDivisor = __ll_rshift(qwRight, qwRight /* & 31 */);
qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT7))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT7))
#endif
ExitProcess(GetLastError());
for (dw = 500000000UL; dw > 0UL; dw--)
{
qwLeft = (qwLeft << 1)
^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
qwDividend = __ll_rshift(qwLeft, qwLeft /* & 31 */);
qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
qwRight = (qwRight >> 1)
^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
qwDivisor = __ll_rshift(qwRight, qwRight /* & 31 */);
qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
}
#ifdef CYCLES
if (!QueryThreadCycleTime(hThread, &qwT8))
#else
if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT8))
#endif
ExitProcess(GetLastError());
qwT9 = qwT8 - qwT0;
qwT8 -= qwT7;
qwT7 -= qwT6;
qwT6 -= qwT5;
qwT5 -= qwT4;
qwT4 -= qwT3;
qwT3 -= qwT2;
qwT2 -= qwT1;
qwT1 -= qwT0;
#ifdef CYCLES
PrintFormat(hOutput,
"\r\n"
"_aullnop() %6lu.%09lu 0\r\n"
"_aulldiv() %6lu.%09lu %6lu.%09lu\r\n"
"_aullrem() %6lu.%09lu %6lu.%09lu\r\n"
"_aulldvrm() %6lu.%09lu %6lu.%09lu\r\n"
"_allmul() %6lu.%09lu %6lu.%09lu\r\n"
"_alldiv() %6lu.%09lu %6lu.%09lu\r\n"
"_allrem() %6lu.%09lu %6lu.%09lu\r\n"
"_alldvrm() %6lu.%09lu %6lu.%09lu\r\n"
" %6lu.%09lu clock cycles\r\n",
__edivmodu(qwT1, 1000000000UL),
__edivmodu(qwT2, 1000000000UL), __edivmodu(qwT2 - qwT1, 1000000000UL),
__edivmodu(qwT3, 1000000000UL), __edivmodu(qwT3 - qwT1, 1000000000UL),
__edivmodu(qwT4, 1000000000UL), __edivmodu(qwT4 - qwT1, 1000000000UL),
__edivmodu(qwT5, 1000000000UL), __edivmodu(qwT5 - qwT1, 1000000000UL),
__edivmodu(qwT6, 1000000000UL), __edivmodu(qwT6 - qwT1, 1000000000UL),
__edivmodu(qwT7, 1000000000UL), __edivmodu(qwT7 - qwT1, 1000000000UL),
__edivmodu(qwT8, 1000000000UL), __edivmodu(qwT8 - qwT1, 1000000000UL),
__edivmodu(qwT9, 1000000000UL));
#else // CYCLES
PrintFormat(hOutput,
"\r\n"
"_aullnop() %6lu.%07lu 0\r\n"
"_aulldiv() %6lu.%07lu %6lu.%07lu\r\n"
"_aullrem() %6lu.%07lu %6lu.%07lu\r\n"
"_aulldvrm() %6lu.%07lu %6lu.%07lu\r\n"
"_allmul() %6lu.%07lu %6lu.%07lu\r\n"
"_alldiv() %6lu.%07lu %6lu.%07lu\r\n"
"_allrem() %6lu.%07lu %6lu.%07lu\r\n"
"_alldvrm() %6lu.%07lu %6lu.%07lu\r\n"
" %6lu.%07lu nano-seconds\r\n",
__edivmodu(qwT1, 10000000UL),
__edivmodu(qwT2, 10000000UL), __edivmodu(qwT2 - qwT1, 10000000UL),
__edivmodu(qwT3, 10000000UL), __edivmodu(qwT3 - qwT1, 10000000UL),
__edivmodu(qwT4, 10000000UL), __edivmodu(qwT4 - qwT1, 10000000UL),
__edivmodu(qwT5, 10000000UL), __edivmodu(qwT5 - qwT1, 10000000UL),
__edivmodu(qwT6, 10000000UL), __edivmodu(qwT6 - qwT1, 10000000UL),
__edivmodu(qwT7, 10000000UL), __edivmodu(qwT7 - qwT1, 10000000UL),
__edivmodu(qwT8, 10000000UL), __edivmodu(qwT8 - qwT1, 10000000UL),
__edivmodu(qwT9, 10000000UL));
#endif // CYCLES
ExitProcess(ERROR_SUCCESS);
}
Save the
ANSI C
source presented above as i386-i64.c
in the directory
where you created the object library i386.lib
before,
then execute the following 4 command lines to compile it, link the
generated object file i386-i64.obj
with the routines
from the object library i386.lib
, and finally execute
the image file i386-i64.exe
:
SET CL=/GAFy /Oxy /W4 /Zl SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE /DCYCLES i386-i64.c i386.lib kernel32.lib user32.lib .\i386-i64.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-i64.c i386-i64.c(128) : warning C4100: 'ullRight' : unreferenced formal parameter i386-i64.c(128) : warning C4100: 'ullLeft' : unreferenced formal parameter Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE /out:i386-i64.exe i386-i64.obj i386.lib kernel32.lib user32.lib Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz _aullnop() 5.130620616 0 _aulldiv() 24.916274238 19.785653622 _aullrem() 24.686248015 19.555627399 _aulldvrm() 25.947651690 20.817031074 _allmul() 6.753417214 1.622796598 _alldiv() 27.691010847 22.560390231 _allrem() 27.923880075 22.793259459 _alldvrm() 29.695561448 24.564940832 172.744664143 clock cycles _aullnop() 8.388855142 0 _aulldiv() 25.816723410 17.427868268 _aullrem() 25.383319447 16.994464305 _aulldvrm() 26.106060709 17.717205567 _allmul() 10.095017621 1.706162479 _alldiv() 30.421659163 22.032804021 _allrem() 30.961920386 22.573065244 _alldvrm() 32.481759875 24.092904733 189.655315753 clock cycles Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz _aullnop() 4.323464204 0 _aulldiv() 20.430453818 16.106989614 _aullrem() 20.833517940 16.510053736 _aulldvrm() 21.894735828 17.571271624 _allmul() 5.704146716 1.380682512 _alldiv() 23.409770870 19.086306666 _allrem() 23.662985522 19.339521318 _alldvrm() 25.053725237 20.730261033 145.312800135 clock cycles _aullnop() 7.128891691 0 _aulldiv() 21.592044916 14.463153225 _aullrem() 21.438925969 14.310034278 _aulldvrm() 21.977489828 14.848598137 _allmul() 8.646452477 1.517560786 _alldiv() 25.699076924 18.570185233 _allrem() 26.119234835 18.990343144 _alldvrm() 27.478233473 20.349341782 160.080350113 clock cyclesOops: on this Intel® Core™ i7 processor, the division routines for signed 64-bit integers are up to 37 % slower than those for 64-bit unsigned integers.
Copy
i386-i64.exe
to systems with other processors and execute it there too:
Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on AMD Ryzen 7 5700X 8-Core Processor _aullnop() 5.122143146 0 _aulldiv() 20.223817270 15.101674124 _aullrem() 20.164726800 15.042583654 _aulldvrm() 21.157922084 16.035778938 _allmul() 7.981533048 2.859389902 _alldiv() 20.836136360 15.713993214 _allrem() 22.052193688 16.930050542 _alldvrm() 21.613653936 16.491510790 139.152126332 clock cycles _aullnop() 7.683783760 0 _aulldiv() 21.300318504 13.616534744 _aullrem() 21.362844062 13.679060302 _aulldvrm() 22.265051198 14.581267438 _allmul() 8.956695616 1.272911856 _alldiv() 23.670479992 15.986696232 _allrem() 24.239343886 16.555560126 _alldvrm() 25.328227580 17.644443820 154.806744598 clock cycles
Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz _aullnop() 8.978870978 0 _aulldiv() 41.378538512 32.399667534 _aullrem() 41.459120072 32.480249094 _aulldvrm() 42.496702958 33.517831980 _allmul() 13.102594044 4.123723066 _alldiv() 48.877112646 39.898241668 _allrem() 48.630810323 39.651939345 _alldvrm() 57.155683201 48.176812223 302.079432734 clock cycles _aullnop() 13.583675374 0 _aulldiv() 41.220087960 27.636412586 _aullrem() 39.885909615 26.302234241 _aulldvrm() 41.619714690 28.036039316 _allmul() 18.266469971 4.682794597 _alldiv() 52.360017066 38.776341692 _allrem() 52.759947948 39.176272574 _alldvrm() 58.291439368 44.707763994 317.987261992 clock cyclesOops: on this 15 year old Intel® Core™2 processor, the division routines for signed 64-bit integers are up to 54 % slower than those for unsigned 64-bit integers.
Generate the import library i386.lib
and link the
object file i386-i64.lib
with it, then execute the
image file i386-i64.exe
, now calling the (several times
slower) 64-bit integer division and multiplication routines of
NTDLL.dll
:
LINK.EXE /LIB /DEF /EXPORT:_alldiv /EXPORT:_alldvrm /EXPORT:_allmul /EXPORT:_allrem /EXPORT:_allshl /EXPORT:_allshr /EXPORT:_aulldiv /EXPORT:_aulldvrm /EXPORT:_aullrem /EXPORT:_aullshr /MACHINE:I386 /NAME:NTDLL /OUT:i386.lib LINK.EXE i386-i64.obj .\i386-i64.exeNote: the existing object library
i386.lib
and the image file i386-i64.exe
are overwritten!
Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Creating library i386.lib and object i386.exp Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz _aullnop() 5.173417920 0 _aulldiv() 102.145250711 96.971832791 _aullrem() 103.746558934 98.573141014 _aulldvrm() 103.711363598 98.537945678 _allmul() 9.640065030 4.466647110 _alldiv() 108.071904817 102.898486897 _allrem() 109.348691219 104.175273299 _alldvrm() 111.349818320 106.176400400 653.187070549 clock cycles _aullnop() 8.391292312 0 _aulldiv() 70.546796218 62.155503906 _aullrem() 72.698974389 64.307682077 _aulldvrm() 72.715990565 64.324698253 _allmul() 12.941190977 4.549898665 _alldiv() 85.064559941 76.673267629 _allrem() 86.951781779 78.560489467 _alldvrm() 89.225890859 80.834598547 498.536477040 clock cycles Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz _aullnop() 4.293480684 0 _aulldiv() 88.508826840 84.215346156 _aullrem() 86.771270732 82.477790048 _aulldvrm() 88.330881954 84.037401270 _allmul() 8.473599720 4.180119036 _alldiv() 91.503580475 87.210099791 _allrem() 92.444818641 88.151337957 _alldvrm() 94.478769821 90.185289137 554.805228867 clock cycles _aullnop() 7.264421067 0 _aulldiv() 61.477534879 54.213113812 _aullrem() 60.671739257 53.407318190 _aulldvrm() 60.621554727 53.357133660 _allmul() 11.255056723 3.990635656 _alldiv() 71.552476501 64.288055434 _allrem() 72.850638899 65.586217832 _alldvrm() 73.504622168 66.240201101 419.198044221 clock cyclesOUCH: here Microsoft’s division routines are 3.2 to 5.5 times slower than those presented above, and their multiplication routine is 2.5 to 4.5 times slower!
Also copy this variant of
i386-i64.exe
to systems with other processors and execute it there:
Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on AMD Ryzen 7 5700X 8-Core Processor _aullnop() 5.121743374 0 _aulldiv() 55.833332662 50.711589288 _aullrem() 56.467057234 51.345313860 _aulldvrm() 58.395325278 53.273581904 _allmul() 7.690866062 2.569122688 _alldiv() 60.314670158 55.192926784 _allrem() 62.087331252 56.965587878 _alldvrm() 63.125034508 58.003291134 369.035360528 clock cycles _aullnop() 7.682569518 0 _aulldiv() 36.154348392 28.471778874 _aullrem() 37.041158934 29.358589416 _aulldvrm() 39.300191898 31.617622380 _allmul() 10.278630848 2.596061330 _alldiv() 45.203128180 37.520558662 _allrem() 46.432703432 38.750133914 _alldvrm() 47.750769056 40.068199538 269.843500258 clock cyclesOUCH: there Microsoft’s division routines are up to 4 times slower than those presented above, and their multiplication routine is up to 2 times slower!
Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz _aullnop() 8.952309405 0 _aulldiv() 129.226132812 120.273823407 _aullrem() 134.708512677 125.756203272 _aulldvrm() 143.453665206 134.501355801 _allmul() 17.856662118 8.904352713 _alldiv() 151.624513041 142.672203636 _allrem() 152.423786688 143.471477283 _alldvrm() 171.419909639 162.467600234 909.665491586 clock cycles _aullnop() 13.637814045 0 _aulldiv() 97.951782280 84.313968235 _aullrem() 103.122246554 89.484432509 _aulldvrm() 108.556615922 94.918801877 _allmul() 23.786421340 10.148607295 _alldiv() 117.589788695 103.951974650 _allrem() 119.202419933 105.564605888 _alldvrm() 124.590005377 110.952191332 708.437094146 clock cyclesOUCH: all Microsoft routines are more than 2 times slower than those presented above, their division routines are even up to 4 times slower!
Finally execute the following 10 command lines to recreate the
object library i386.lib
, but now with the division
routines for processors which feature speculative execution, then
link the object file i386-i64.lib
generated before with
it and execute the image file i386-i64.exe
:
SET ML=/c /DJCCLESS /safeseh /W3 /X ML.EXE lldiv.asm ML.EXE lldvrm.asm ML.EXE llrem.asm ML.EXE ulldiv.asm ML.EXE ulldvrm.asm ML.EXE ullrem.asm LINK.EXE /LIB /OUT:i386.lib alldiv.obj alldvrm.obj allmul.obj alloca.obj alloca8.obj alloca16.obj allrem.obj allshl.obj allshr.obj aulldiv.obj aulldvrm.obj aullrem.obj aullshr.obj LINK.EXE i386-i64.obj i386.lib kernel32.lib user32.lib .\i386-i64.exe
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alldiv.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alldvrm.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: allrem.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aulldiv.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aulldvrm.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aullrem.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Creating library i386.lib and object i386.exp Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz _aullnop() 5.130523022 0 _aulldiv() 22.148703486 17.018180464 _aullrem() 25.618235017 20.487711995 _aulldvrm() 27.484994154 22.354471132 _allmul() 6.756777960 1.626254938 _alldiv() 30.000776927 24.870253905 _allrem() 31.364631604 26.234108582 _alldvrm() 34.146860035 29.016337013 182.651502205 clock cycles _aullnop() 8.384999331 0 _aulldiv() 26.479597369 18.094598038 _aullrem() 27.445690820 19.060691489 _aulldvrm() 28.586813456 20.201814125 _allmul() 10.095121587 1.710122256 _alldiv() 31.576787296 23.191787965 _allrem() 33.327090912 24.942091581 _alldvrm() 35.484606261 27.099606930 201.380707032 clock cycles Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz _aullnop() 4.441024919 0 _aulldiv() 19.396569893 14.955544974 _aullrem() 22.313397424 17.872372505 _aulldvrm() 23.100780021 18.659755102 _allmul() 5.659121355 1.218096436 _alldiv() 25.226574313 20.785549394 _allrem() 26.355810792 21.914785873 _alldvrm() 28.902499259 24.461474340 155.395777976 clock cycles _aullnop() 7.081996184 0 _aulldiv() 22.156429579 15.074433395 _aullrem() 22.934132959 15.852136775 _aulldvrm() 23.886275722 16.804279538 _allmul() 8.362922762 1.280926578 _alldiv() 26.645217765 19.563221581 _allrem() 27.942975969 20.860979785 _alldvrm() 29.720484454 22.638488270 168.730435394 clock cyclesOops: on this Intel® Core™ i7 processor, the branch-avoiding division routine runs for big unsigned integers about 7.5 % faster than its conditionally branching variant, while the others are up to 15 % slower.
Again copy this variant of
i386-i64.exe
to systems with other processors and execute it there too:
Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on AMD Ryzen 7 5700X 8-Core Processor _aullnop() 5.120327104 0 _aulldiv() 17.308029902 12.187702798 _aullrem() 18.250468490 13.130141386 _aulldvrm() 20.019941914 14.899614810 _allmul() 7.951225720 2.830898616 _alldiv() 22.051554624 16.931227520 _allrem() 23.685382974 18.565055870 _alldvrm() 24.193433346 19.073106242 138.580364074 clock cycles _aullnop() 7.679050314 0 _aulldiv() 21.642133458 13.963083144 _aullrem() 21.867652058 14.188601744 _aulldvrm() 23.638699750 15.959649436 _allmul() 8.959283832 1.280233518 _alldiv() 24.276285226 16.597234912 _allrem() 24.838598612 17.159548298 _alldvrm() 26.549134226 18.870083912 159.450837476 clock cyclesOops: on this AMD® Ryzen™ processor, the branch-avoiding division routines run for big unsigned integers up to 20 % faster than their conditionally branching variants, else but up to 15 % slower.
Testing unsigned 64-bit division... -58 Testing signed 64-bit division... -44 Timing 64-bit division and multiplication on Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz _aullnop() 8.956408592 0 _aulldiv() 40.006629475 31.050220883 _aullrem() 45.331733367 36.375324775 _aulldvrm() 52.817756545 43.861347953 _allmul() 13.100714751 4.144306159 _alldiv() 53.679959575 44.723550983 _allrem() 60.075486247 51.119077655 _alldvrm() 73.557314195 64.600905603 347.526002747 clock cycles _aullnop() 13.582502272 0 _aulldiv() 43.965933963 30.383431691 _aullrem() 46.175253389 32.592751117 _aulldvrm() 49.844225982 36.261723710 _allmul() 18.252733959 4.670231687 _alldiv() 53.847451160 40.264948888 _allrem() 59.859327003 46.276824731 _alldvrm() 65.922751282 52.340249010 351.450179010 clock cyclesOOPS: contrary to the expected results, on this 15 year old Intel® Core™2 processor the branch-avoiding division routines run up to 34 % slower than their conditionally branching variants!
_rotl64()
and _rotr64()
Intrinsic Functions for i386 Platform_rotl64()
and
_rotr64()
for rotation of 64-bit integers, with but a stupid
implementation.
Compiler intrinsics
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
unsigned long long _allrol(unsigned long long value, unsigned int count)
{
return _rotl64(value, count);
}
unsigned long long _allror(unsigned long long value, unsigned int count)
{
return _rotr64(value, count);
}
Save the
ANSI C
source presented above as i386-rotate.c
in an
arbitrary, preferable empty directory, then execute the following 2
command lines to compile it and display the generated assembly:
SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-rotate.cFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-rotate.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-rotate.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allrol ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-rotate.c _TEXT SEGMENT _value$ = 8 ; size = 8 _count$ = 16 ; size = 4 __allrol PROC ; 5 : return _rotl64(value, count); 00000 8a 4c 24 0c mov cl, BYTE PTR _count$[esp-4] 00004 56 push esi 00005 8b 74 24 08 mov esi, DWORD PTR _value$[esp] 00009 57 push edi 0000a 8b 7c 24 10 mov edi, DWORD PTR _value$[esp+8] 0000e 8b c6 mov eax, esi 00010 8b d7 mov edx, edi 00012 f6 c1 20 test cl, 32 ; 00000020H 00015 74 04 je SHORT $LN3@allrol 00017 8b c7 mov eax, edi 00019 8b d6 mov edx, esi $LN3@allrol: 0001b 80 e1 1f and cl, 31 ; 0000001fH 0001e 74 12 je SHORT $LN4@allrol 00020 8b f0 mov esi, eax 00022 8b c2 mov eax, edx 00024 8b d6 mov edx, esi 00026 0f a5 c2 shld edx, eax, cl 00029 0f a5 f0 shld eax, esi, cl 0002c 8b ca mov ecx, edx 0002e 8b d0 mov edx, eax 00030 8b c1 mov eax, ecx $LN4@allrol: ; 6 : } 00032 5f pop edi 00033 5e pop esi 00034 c3 ret 0 __allrol ENDP _TEXT ENDS PUBLIC __allror ; Function compile flags: /Ogtpy _TEXT SEGMENT _value$ = 8 ; size = 8 _count$ = 16 ; size = 4 __allror PROC ; 10 : return _rotr64(value, count); 00040 8a 4c 24 0c mov cl, BYTE PTR _count$[esp-4] 00044 56 push esi 00045 8b 74 24 08 mov esi, DWORD PTR _value$[esp] 00049 57 push edi 0004a 8b 7c 24 10 mov edi, DWORD PTR _value$[esp+8] 0004e 8b c6 mov eax, esi 00050 8b d7 mov edx, edi 00052 f6 c1 20 test cl, 32 ; 00000020H 00055 74 04 je SHORT $LN3@allror 00057 8b c7 mov eax, edi 00059 8b d6 mov edx, esi $LN3@allror: 0005b 80 e1 1f and cl, 31 ; 0000001fH 0005e 74 12 je SHORT $LN4@allror 00060 8b f0 mov esi, eax 00062 8b c2 mov eax, edx 00064 8b d6 mov edx, esi 00066 0f ad c2 shrd edx, eax, cl 00069 0f ad f0 shrd eax, esi, cl 0006c 8b ca mov ecx, edx 0006e 8b d0 mov edx, eax 00070 8b c1 mov eax, ecx $LN4@allror: ; 11 : } 00072 5f pop edi 00073 5e pop esi 00074 c3 ret 0 __allror ENDP _TEXT ENDS ENDWith 24 instructions in 53 bytes, each function is as bad as such
optimisedcode can get!
OUCH¹: instead to load its 64-bit argument
into register pair EDX:EAX
and swap them if the shift
count÷32 is odd, this stupid code loads the
64-bit argument into register pair EDI:ESI
first, from
there into register pair EDX:EAX
, then loads the latter
in reverse order from register pair EDI:ESI
if the
shift count÷32 is odd, clobbering registers EDI
and ESI
without necessity!
OUCH²: instead to load register
ESI
from register EDX
and then shift the
register pairs EDX:EAX
and EAX:ESI
, this
braindead code swaps registers EAX
and
EDX
through register ESI
, shifts register
pairs EDX:EAX
and EAX:ESI
and finally
swaps registers EAX
and EDX
once more, now
through ECX
!
Note: the evaluation of the code generated with
the compiler options /Oisy
is left as an exercise to
the reader.
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allrol()
and _allror()
Functions in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; Common "cdecl" calling and naming convention for i386 platform:
; - arguments are pushed on stack in reverse order (from right to left),
; 4-byte aligned;
; - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
; low part below high part;
; - 80-bit, 64-bit or 32-bit floating-point result is returned in FPU
; register ST0;
; - 64-bit integer result is returned in registers EAX (low part) and
; EDX (high part);
; - 32-bit integer or pointer result is returned in register EAX;
; - registers EAX, ECX and EDX are volatile and can be clobbered;
; - registers EBX, ESP, EBP, ESI and EDI must be preserved;
; - function names are prefixed with an underscore.
.386
.model flat, C
.code
_allrol proc public ; qword _allrol(qword value, dword count)
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = value
mov ecx, [esp+12] ; ecx = count
ifndef SPACE
test cl, 63
jz short return ; count % 64 = 0?
endif
test cl, 32
jz short shift
swap:
xchg eax, edx
shift:
push ebx
mov ebx, edx
shld edx, eax, cl
shld eax, ebx, cl
pop ebx
return:
ret
_allrol endp
_allror proc public ; qword _allror(qword value, dword count)
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = value
mov ecx, [esp+12] ; ecx = count
ifndef SPACE
test cl, 63
jz short return ; count % 64 = 0?
endif
test cl, 32
jz short shift
swap:
xchg eax, edx
shift:
push ebx
mov ebx, eax
shrd eax, edx, cl
shrd edx, ebx, cl
pop ebx
return:
ret
_allror endp
end
_abs64()
Intrinsic Function for i386 Platform_abs64()
alias
llabs()
provided by the Visual C compiler returns the
absolute value of its 64-bit integer argument – but even this
trivial routine is not fully optimised.
Compiler intrinsics
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
long long _allabs(long long argument)
{
return _abs64(argument);
}
Save the
ANSI C
source presented above as i386-magnitude.c
in an
arbitrary, preferable empty directory, then execute the following 2
command lines to compile it and display the generated assembly:
SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-magnitude.cFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-magnitude.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-magnitude.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allabs ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-magnitude.c ; COMDAT __allabs _TEXT SEGMENT _argument$ = 8 ; size = 8 __allabs PROC ; COMDAT ; 5 : return _abs64(argument); 00000 8b 44 24 08 mov eax, DWORD PTR _argument$[esp] 00004 8b 4c 24 04 mov ecx, DWORD PTR _argument$[esp-4] 00008 99 cdq 00009 33 c2 xor eax, edx 0000b 33 ca xor ecx, edx 0000d 2b ca sub ecx, edx 0000f 1b c2 sbb eax, edx 00011 8b d0 mov edx, eax 00013 8b c1 mov eax, ecx ; 6 : } 00015 c3 ret 0 __allabs ENDP _TEXT ENDS END10 instructions in 22 bytes.
Spoiler: if you are curious whether the current version of the Visual C compiler has its flaws fixed, view its output in Compiler Explorer.
_allabs()
Function in i386 AssemblerXOR
s commutativity, saving
a MOV
instruction and 2 bytes:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
_allabs proc public ; sqword _allabs(sqword argument)
mov eax, [esp+8]
mov ecx, [esp+4] ; eax:ecx = argument
cdq ; edx = (argument < 0) ? -1 : 0
add ecx, edx
adc eax, edx ; eax:ecx = (argument < 0) ? argument - 1 : argument
; = (argument < 0) ? ~-argument : argument
xor ecx, edx
xor edx, eax ; edx:ecx = (argument < 0) ? -argument : argument
; = |argument|
mov eax, ecx ; edx:eax = |argument|
ret
_allabs endp
end
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
long long _allneg(long long argument)
{
return -argument;
}
Save the
ANSI C
source presented above as i386-negate.c
in an
arbitrary, preferable empty directory, then execute the following 2
command lines to compile it and display the generated assembly:
SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-negate.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-negate.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-negate.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allneg ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-negate.c ; COMDAT __allneg _TEXT SEGMENT _argument$ = 8 ; size = 8 __allneg PROC ; COMDAT ; 5 : return -argument; 00000 8b 44 24 04 mov eax, DWORD PTR _argument$[esp-4] 00004 8b 54 24 08 mov edx, DWORD PTR _argument$[esp] 00008 f7 d8 neg eax 0000a 83 d2 00 adc edx, 0 0000d f7 da neg edx ; 6 : } 0000f c3 ret 0 __allneg ENDP _TEXT ENDS END6 instructions in 16 bytes.
Spoiler: if you are curious whether the current version of the Visual C compiler has its flaws fixed, view its output in Compiler Explorer.
_allneg()
Function in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
_allneg proc public ; sqword _allneg(sqword argument)
xor eax, eax
cdq ; edx:eax = 0
sub eax, [esp+4]
sbb edx, [esp+8] ; edx:eax = -argument
ret
_allneg endp
end
Note: the following code for inline use performs
the negation in place; it avoids one of the dependencies of the code
generated by the Visual C compiler:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
_allneg proc public ; sqword _allneg(sqword argument)
mov edx, [esp+8]
mov eax, [esp+4] ; edx:eax = argument
not edx
neg eax
sbb edx, -1 ; edx:eax = -argument
ret
_allneg endp
end
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
void blunder(long long *argument)
{
*argument = -*argument;
}
Save the
ANSI C
source presented above as i386-blunder.c
in an
arbitrary, preferable empty directory, then execute the following 2
command lines to compile it and display the generated assembly:
SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-blunder.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-blunder.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC _blunder ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-blunder.c ; COMDAT _blunder _TEXT SEGMENT _argument$ = 8 ; size = 4 _blunder PROC ; COMDAT ; 5 : *argument = -*argument; 00000 8b 44 24 04 mov eax, DWORD PTR _argument$[esp-4] 00004 8b 08 mov ecx, DWORD PTR [eax] 00006 8b 50 04 mov edx, DWORD PTR [eax+4] 00009 f7 d9 neg ecx 0000b 83 d2 00 adc edx, 0 0000e f7 da neg edx 00010 89 08 mov DWORD PTR [eax], ecx 00012 89 50 04 mov DWORD PTR [eax+4], edx ; 6 : } 00015 c3 ret 0 _blunder ENDP _TEXT ENDS ENDOUCH: this atrocity is a perfect declaration of bankruptcy!
Even a non-optimising compiler should but generate the following straightforward code, using 5 instructions in 14 bytes, instead of the 9 instructions in 22 bytes generated by the Visual C compiler:
.code
mov eax, [esp+4] ; eax = address of argument
neg dword ptr [eax]
not dword ptr [eax+4]
sbb dword ptr [eax+4], -1
ret
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
int _allsgn(long long argument)
{
return (argument > 0) - (argument < 0);
}
Save the
ANSI C
source presented above as i386-signum.c
in an
arbitrary, preferable empty directory, then execute the following 2
command lines to compile it and display the generated assembly:
SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-signum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-signum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-signum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allsgn ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-signum.c ; COMDAT __allsgn _TEXT SEGMENT _argument$ = 8 ; size = 8 __allsgn PROC ; COMDAT ; 5 : return (argument > 0) - (argument < 0); 00000 8b 4c 24 08 mov ecx, DWORD PTR _argument$[esp] 00004 8b 54 24 04 mov edx, DWORD PTR _argument$[esp-4] 00008 85 c9 test ecx, ecx 0000a 7c 0d jl SHORT $LN5@allsgn 0000c 7f 04 jg SHORT $LN7@allsgn 0000e 85 d2 test edx, edx 00010 74 07 je SHORT $LN5@allsgn $LN7@allsgn: 00012 b8 01 00 00 00 mov eax, 1 00017 eb 02 jmp SHORT $LN6@allsgn $LN5@allsgn: 00019 33 c0 xor eax, eax $LN6@allsgn: 0001b 85 c9 test ecx, ecx 0001d 7f 0e jg SHORT $LN3@allsgn 0001f 7c 04 jl SHORT $LN8@allsgn 00021 85 d2 test edx, edx 00023 73 08 jae SHORT $LN3@allsgn $LN8@allsgn: 00025 b9 01 00 00 00 mov ecx, 1 0002a 2b c1 sub eax, ecx ; 6 : } 0002c c3 ret 0 $LN3@allsgn: ; 5 : return (argument > 0) - (argument < 0); 0002d 33 c9 xor ecx, ecx 0002f 2b c1 sub eax, ecx ; 6 : } 00031 c3 ret 0 __allsgn ENDP _TEXT ENDS END21 instructions in 50 bytes.
OUCH¹: 6 (in words: six) superfluous conditional branch instructions – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
OUCH²: the first 2 highlighted instructions
should be replaced with a single
DEC
instruction, saving 6 bytes.
OUCH³: the last 2 highlighted instructions
should be replaced with a 4 byte
NOP
, which can then be removed
together with the following
RET
instruction, saving
5 more bytes.
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allsgn()
Function in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
_allsgn proc public ; int _allsgn(sqword argument)
if 0
xor eax, eax ; eax = 0
cmp eax, [esp+4] ; CF = (low dword of argument != 0)
mov ecx, [esp+8] ; ecx = high dword of argument
cdq ; edx = 0
sbb edx, ecx ; edx:... = 0 - argument
setl al ; eax = (argument > 0) ? 1 : 0
sar ecx, 31 ; ecx = (argument < 0) ? -1 : 0
add eax, ecx ; eax = (argument > 0) - (argument < 0)
; = {1, 0, -1}
elseif 0
mov eax, [esp+8] ; eax = high dword of argument
xor edx, edx ; edx = 0
cmp edx, [esp+4] ; CF = (low dword of argument != 0)
sbb edx, eax ; edx:... = 0 - argument
cdq ; edx = (argument < 0) ? -1 : 0
setl al
movzx eax, al ; eax = (argument > 0) ? 1 : 0
add eax, edx ; eax = (argument > 0) - (argument < 0)
; = {1, 0, -1}
else
xor eax, eax
cmp eax, [esp+4] ; CF = (low dword of argument != 0)
mov edx, [esp+8] ; edx = high dword of argument
sbb eax, edx ; eax:... = 0 - argument
sar edx, 31 ; edx = (argument < 0) ? -1 : 0
shr eax, 31 ; eax = (argument > 0) ? 1 : 0
or eax, edx ; eax = (argument > 0) - (argument < 0)
; = {1, 0, -1}
endif
ret
_allsgn endp
end
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifdef UNSIGNED
int _aullcmp(unsigned long long p, unsigned long long q)
#else
int _allcmp(long long p, long long q)
#endif
{
return (p > q) - (p < q);
}
Save the
ANSI C
source presented above as i386-compare.c
in an
arbitrary, preferable empty directory, then execute the following 2
command lines to compile it and display the generated assembly a
first time:
SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-compare.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-compare.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-compare.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allcmp ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-compare.c ; COMDAT __allcmd _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __allcmp PROC ; COMDAT ; 9 : return (p > q) - (p < q); 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 57 push edi 0000e 8b 7c 24 0c mov edi, DWORD PTR _p$[esp+4] 00012 3b ca cmp ecx, edx 00014 7c 0d jl SHORT $LN5@allcmp 00016 7f 04 jg SHORT $LN7@allcmp 00018 3b fe cmp edi, esi 0001a 76 07 jbe SHORT $LN5@allcmp $LN7@allcmp: 0001c b8 01 00 00 00 mov eax, 1 00021 eb 02 jmp SHORT $LN6@allcmp $LN5@allcmp: 00023 33 c0 xor eax, eax $LN6@allcmp: 00025 3b ca cmp ecx, edx 00027 7f 10 jg SHORT $LN3@allcmp 00029 7c 04 jl SHORT $LN8@allcmp 0002b 3b fe cmp edi, esi 0002d 73 0a jae SHORT $LN3@allcmp $LN8@allcmp: 0002f b9 01 00 00 00 mov ecx, 1 00034 5f pop edi 00035 2b c1 sub eax, ecx 00037 5e pop esi ; 10 : } 00038 c3 ret 0 $LN3@allcmp: ; 9 : return (p > q) - (p < q); 00039 33 c9 xor ecx, ecx 0003b 5f pop edi 0003c 2b c1 sub eax, ecx 0003e 5e pop esi ; 10 : } 0003f c3 ret 0 __allcmp ENDP _TEXT ENDS END29 instructions in 64 bytes.
OUCH¹: 6 (in words: six) superfluous conditional branch instructions, and 2 registers clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
OUCH²: the first 2 highlighted instructions
should be replaced with a single
DEC
instruction, saving 6 bytes.
OUCH³: the last 2 highlighted instructions
should be replaced with 2 2 byte
NOP
s, which can then be removed
together with the 2
POP
and the
following RET
instruction, saving 7 more bytes.
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
Compile the source file i386-compare.c
a second time,
now with the preprocessor macro UNSIGNED
defined, and
display the generated assembly:
CL.EXE /DUNSIGNED i386-compare.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-compare.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-compare.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __aullcmp ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-compare.c ; COMDAT __allcmd _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __aullcmp PROC ; COMDAT ; 9 : return (p > q) - (p < q); 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 57 push edi 0000e 8b 7c 24 0c mov edi, DWORD PTR _p$[esp+4] 00012 3b ca cmp ecx, edx 00014 72 0d jb SHORT $LN5@aullcmp 00016 77 04 ja SHORT $LN7@aullcmp 00018 3b fe cmp edi, esi 0001a 76 07 jbe SHORT $LN5@aullcmp $LN7@aullcmp: 0001c b8 01 00 00 00 mov eax, 1 00021 eb 02 jmp SHORT $LN6@aullcmp $LN5@aullcmp: 00023 33 c0 xor eax, eax $LN6@aullcmp: 00025 3b ca cmp ecx, edx 00027 77 10 ja SHORT $LN3@aullcmp 00029 72 04 jb SHORT $LN8@aullcmp 0002b 3b fe cmp edi, esi 0002d 73 0a jae SHORT $LN3@aullcmp $LN8@aullcmp: 0002f b9 01 00 00 00 mov ecx, 1 00034 5f pop edi 00035 2b c1 sub eax, ecx 00037 5e pop esi ; 10 : } 00038 c3 ret 0 $LN3@aullcmp: ; 9 : return (p > q) - (p < q); 00039 33 c9 xor ecx, ecx 0003b 5f pop edi 0003c 2b c1 sub eax, ecx 0003e 5e pop esi ; 10 : } 0003f c3 ret 0 __aullcmp ENDP _TEXT ENDS ENDAlso 29 instructions in 64 bytes.
OUCH¹: again 6 (in words: six) superfluous conditional branch instructions, and 2 registers clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
OUCH²: the first 2 highlighted instructions
should be replaced with a single
DEC
instruction, saving 6 bytes.
OUCH³: the last 2 highlighted instructions
should be replaced with 2 2 byte
NOP
s, which can then be removed
together with the 2
POP
and the
following RET
instruction, saving 7 more bytes.
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allcmp()
and _aullcmp()
Functions in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
_allcmp proc public ; int _allcmp(sqword p, sqword q)
xor eax, eax ; eax = 0
mov ecx, [esp+4]
mov edx, [esp+8] ; edx:ecx = p
cmp ecx, [esp+12]
sbb edx, [esp+16] ; edx:... = p - q,
; eflags = (p - q)
cdq ; edx = 0
setg al ; eax = (p > q)
setl dl ; edx = (p < q)
sub eax, edx ; eax = (p > q) - (p < q)
; = {1, 0, -1}
ret
_allcmp endp
_aullcmp proc public ; int _aullcmp(qword p, qword q)
xor eax, eax ; eax = 0
mov ecx, [esp+4]
mov edx, [esp+8] ; edx:ecx = p
cmp ecx, [esp+12]
sbb edx, [esp+16] ; edx:... = p - q,
; eflags = (p - q)
seta al ; eax = (p > q)
sbb eax, 0 ; eax = (p > q) - (p < q)
; = {1, 0, -1}
ret
_aullcmp endp
end
*max()
functions, the
Visual C compiler provides a preprocessor macro
__max
:
#define __max(a,b) (((a) > (b)) ? (a) : (b))
Note: the header files shipped with the
Windows platform software development kit provide the
preprocessor macros
min
and
max
,
which are (fortunately) but not defined when the preprocessor macro
NOMINMAX
is defined.
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifdef UNSIGNED
unsigned long long _aullmax(unsigned long long p, unsigned long long q)
#else
long long _allmax(long long p, long long q)
#endif
{
return p > q ? p : q;
}
Save the
ANSI C
source presented above as i386-maximum.c
in an
arbitrary, preferable empty directory, then execute the following 2
command lines to compile it and display the generated assembly a
first time:
SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-maximum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-maximum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-maximum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allmax ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-maximum.c ; COMDAT __allmax _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __allmax PROC ; COMDAT ; 9 : return p > q ? p : q; 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 3b ca cmp ecx, edx 0000f 7c 0e jl SHORT $LN3@allmax 00011 8b 44 24 08 mov eax, DWORD PTR _p$[esp] 00015 7f 04 jg SHORT $LN5@allmax 00017 3b c6 cmp eax, esi 00019 76 04 jbe SHORT $LN3@allmax $LN5@allmax: 0001b 8b d1 mov edx, ecx 0001d 5e pop esi ; 10 : } 0001e c3 ret 0 $LN3@allmax: ; 9 : return p > q ? p : q; 0001f 8b c6 mov eax, esi 00021 5e pop esi ; 10 : } 00022 c3 ret 0 __allmax ENDP _TEXT ENDS END16 instructions in 35 bytes.
OUCH: 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
Compile the source file i386-maximum.c
a second time,
now with the preprocessor macro UNSIGNED
defined, and
display the generated assembly:
CL.EXE /DUNSIGNED i386-maximum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-maximum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-maximum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __aullmax ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-maximum.c ; COMDAT __aullmax _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __aullmax PROC ; COMDAT ; 9 : return p > q ? p : q; 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 3b ca cmp ecx, edx 0000f 72 0e jb SHORT $LN3@aullmax 00011 8b 44 24 08 mov eax, DWORD PTR _p$[esp] 00015 77 04 ja SHORT $LN5@aullmax 00017 3b c6 cmp eax, esi 00019 76 04 jbe SHORT $LN3@aullmax $LN5@aullmax: 0001b 8b d1 mov edx, ecx 0001d 5e pop esi ; 10 : } 0001e c3 ret 0 $LN3@aullmax: ; 9 : return p > q ? p : q; 0001f 8b c6 mov eax, esi 00021 5e pop esi ; 10 : } 00022 c3 ret 0 __aullmax ENDP _TEXT ENDS ENDAlso 16 instructions in 35 bytes.
OUCH: again 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allmax()
and _aullmax()
Functions in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
_allmax proc public ; sqword _allmax(sqword p, sqword q)
mov ecx, [esp+4]
mov eax, [esp+8] ; eax:ecx = p
sub ecx, [esp+12]
sbb eax, [esp+16] ; eax:ecx = p - q
cdq ; edx = (p < q) ? -1 : 0
not edx ; edx = (p < q) ? 0 : -1
and ecx, edx
and edx, eax ; edx:ecx = (p < q) ? 0 : p - q
add ecx, [esp+12]
adc edx, [esp+16] ; edx:ecx = (p < q) ? q : p
mov eax, ecx ; edx:eax = max(p, q)
ret
_allmax endp
_aullmax proc public ; qword _aullmax(qword p, qword q)
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = p
sub eax, [esp+12]
sbb edx, [esp+16] ; edx:eax = p - q
cmc ; CF = ~(p < q)
sbb ecx, ecx ; ecx = (p >= q) ? -1 : 0
and eax, ecx
and edx, ecx ; edx:eax = (p >= q) ? p - q : 0
add eax, [esp+12]
adc edx, [esp+16] ; edx:eax = (p >= q) ? p : q
; = max(p, q)
ret
_aullmax endp
end
*min()
functions, the
Visual C compiler provides a preprocessor macro
min
:
#define __min(a,b) (((a) < (b)) ? (a) : (b))
Note: the header files shipped with the
Windows platform software development kit provide the
preprocessor macros
min
and
max
,
which are (fortunately) but not defined when the preprocessor macro
NOMINMAX
is defined.
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifdef UNSIGNED
unsigned long long _aullmin(unsigned long long p, unsigned long long q)
#else
long long _allmin(long long p, long long q)
#endif
{
return p < q ? p : q;
}
Save the
ANSI C
source presented above as i386-minimum.c
in an
arbitrary, preferable empty directory, then execute the following 2
command lines to compile it and display the generated assembly a
first time:
SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-maximum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-minimum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-minimum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allmin ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-minimum.c ; COMDAT __allmin _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __allmin PROC ; COMDAT ; 9 : return p < q ? p : q; 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 3b ca cmp ecx, edx 0000f 7f 0e jg SHORT $LN3@allmin 00011 8b 44 24 08 mov eax, DWORD PTR _p$[esp] 00015 7c 04 jl SHORT $LN5@allmin 00017 3b c6 cmp eax, esi 00019 73 04 jae SHORT $LN3@allmin $LN5@allmin: 0001b 8b d1 mov edx, ecx 0001d 5e pop esi ; 10 : } 0001e c3 ret 0 $LN3@allmin: ; 9 : return p < q ? p : q; 0001f 8b c6 mov eax, esi 00021 5e pop esi ; 10 : } 00022 c3 ret 0 __allmin ENDP _TEXT ENDS END16 instructions in 35 bytes.
OUCH: 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
Compile the source file i386-minimum.c
a second time,
now with the preprocessor macro UNSIGNED
defined, and
display the generated assembly:
CL.EXE /DUNSIGNED i386-minimum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-minimum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-minimum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __aullmin ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-minimum.c ; COMDAT __aullmin _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __aullmin PROC ; COMDAT ; 9 : return p < q ? p : q; 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 3b ca cmp ecx, edx 0000f 77 0e ja SHORT $LN3@aullmin 00011 8b 44 24 08 mov eax, DWORD PTR _p$[esp] 00015 72 04 jb SHORT $LN5@aullmin 00017 3b c6 cmp eax, esi 00019 73 04 jae SHORT $LN3@aullmin $LN5@aullmin: 0001b 8b d1 mov edx, ecx 0001d 5e pop esi ; 10 : } 0001e c3 ret 0 $LN3@aullmin: ; 9 : return p < q ? p : q; 0001f 8b c6 mov eax, esi 00021 5e pop esi ; 10 : } 00022 c3 ret 0 __aullmin ENDP _TEXT ENDS ENDAlso 16 instructions in 35 bytes.
OUCH: again 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allmin()
and _aullmin()
Functions in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
_allmin proc public ; sqword _allmin(sqword p, sqword q)
mov ecx, [esp+4]
mov eax, [esp+8] ; eax:ecx = p
sub ecx, [esp+12]
sbb eax, [esp+16] ; eax:ecx = p - q
cdq ; edx = (p < q) ? -1 : 0
and ecx, edx
and edx, eax ; edx:ecx = (p < q) ? p - q : 0
add ecx, [esp+12]
adc edx, [esp+16] ; edx:ecx = (p < q) ? p : q
mov eax, ecx ; edx:eax = min(p, q)
ret
_allmin endp
_aullmin proc public ; qword _aullmin(qword p, qword q)
mov eax, [esp+4]
mov edx, [esp+8] ; edx:eax = p
sub eax, [esp+12]
sbb edx, [esp+16] ; edx:eax = p - q
sbb ecx, ecx ; ecx = (p < q) ? -1 : 0
and eax, ecx
and edx, ecx ; edx:eax = (p < q) ? p - q : 0
add eax, [esp+12]
adc edx, [esp+16] ; edx:eax = (p < q) ? p : q
; = min(p, q)
ret
_aullmin endp
end
acos()
, asin()
, atan()
, atan2()
, cos()
, cosh()
, exp()
, fmod()
, log()
, log10()
, pow()
, sin()
, sinh()
, sqrt()
, tan()
and tanh()
Standard Functions for i386 PlatformThe following math library functions have intrinsic forms on all architectures:
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
double blunder(double x)
{
x = acos(x);
x = asin(x);
x = atan(x);
x = atan2(x, x);
x = cos(x);
x = cosh(x);
x = exp(x);
x = fmod(x, x);
x = log(x);
x = log10(x);
x = pow(x, x);
x = sin(x);
x = sinh(x);
x = sqrt(x);
x = tan(x);
x = tanh(x);
return x;
}
Save the
ANSI C
source presented above as i386-blunder.c
in an
arbitrary, preferable empty directory, then execute the following 3
command lines to compile and (attempt to) link it a first time:
SET CL=/Oi /W4 SET LINK=/ENTRY:blunder /MACHINE:I386 /SUBSYSTEM:CONSOLE CL.EXE /MD i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
i386-blunder.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation. All rights reserved.
/ENTRY:blunder /MACHINE:I386 /SUBSYSTEM:CONSOLE
/out:i386-blunder.exe
i386-blunder.obj
i386-blunder.obj : error LNK2019: unresolved external symbol __CItanh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CItan referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIsinh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIlog10 referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIfmod referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIexp referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIcosh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIatan2 referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIatan referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIasin referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIacos referenced in function _fault
i386-blunder.exe : fatal error LNK1120: 11 unresolved externals
OUCH¹: most obviously nobody at
Microsoft ever built an application for the
i386 platform which uses one of the floating-point
functions
acos()
,
asin()
,
atan()
,
atan2()
,
cosh()
,
exp()
,
fmod()
,
log10()
,
sinh()
,
tan()
or
tanh()
with
msvcrt.lib
!
Repeat the first trial without using the intrinsic functions:
CL.EXE /fp:strict /FImath.h /MD i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
i386-blunder.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation. All rights reserved.
/ENTRY:blunder /MACHINE:I386 /SUBSYSTEM:CONSOLE
/out:i386-blunder.exe
i386-blunder.obj
i386-blunder.obj : error LNK2019: unresolved external symbol _tanh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _tan referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _sqrt referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _sinh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _sin referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _pow referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _log10 referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _log referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _fmod referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _exp referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _cosh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _cos referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _atan2 referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _atan referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _asin referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _acos referenced in function _fault
i386-blunder.exe : fatal error LNK1120: 16 unresolved externals
OUCH²: the steaming pile of crap got even
worse!
Execute the following 2 command lines to compile and (attempt to)
link i386-blunder.c
a third time, now as
DLL
and with the static runtime library
libcmt.lib
:
SET LINK=/MACHINE:I386 /NOENTRY CL.EXE /LD i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
i386-blunder.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation. All rights reserved.
/MACHINE:I386 /NOENTRY
/out:i386-blunder.dll
/dll
/implib:i386-blunder.lib
i386-blunder.obj
LIBCMT.lib(crt0.obj) : error LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup
i386-blunder.dll : fatal error LNK1120: 1 unresolved externals
OUCH³: despite building a
DLL,
the (intrinsic) floating-point functions reference an undocumented
internal (startup) routine __tmainCRTStartup()
for
console applications, which in turn references a
main()
function – most obviously also nobody at
Microsoft ever tried to build a
DLL
which uses floating-point functions!
Repeat the third trial without using the intrinsic functions:
CL.EXE /fp:strict /FImath.h /MD i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
i386-blunder.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation. All rights reserved.
/MACHINE:I386 /NOENTRY
/out:i386-blunder.dll
/dll
/implib:i386-blunder.lib
i386-blunder.obj
LIBCMT.lib(crt0.obj) : error LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup
i386-blunder.dll : fatal error LNK1120: 1 unresolved externals
OUCH⁴: like before!
Note: a repetition of the last 2 trials in the 64-bit build environment is left as an exercise to the reader!
_CI*
and _ftol*
Routines; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.686
.model flat, C
single record sign:1, exponent:8, mantissa:23
bias equ 1 shl (width exponent - 1) - 1
.const
public _fltused
_fltused dword 9876h
.code
; MSC internal intrinsic _CIacos():
; receives argument in FPU st(0), returns result in FPU st(0)
; NOTE: _CIacos() returns correct result for ±0.0 and ±1.0
_CIacos proc public
fld st(0) ; st(0) = st(1) = argument
fmul st(0), st(0) ; st(0) = argument**2,
; st(1) = argument
fld1 ; st(0) = 1.0,
; st(1) = argument**2,
; st(2) = argument
fsubrp st(1), st(0) ; st(0) = 1.0 - argument**2,
; st(1) = argument
fsqrt ; st(0) = square root of (1.0 - argument**2),
; st(1) = argument
fxch st(1) ; st(0) = argument,
; st(1) = square root of (1.0 - argument**2)
fpatan ; st(0) = inverse circular cosine of argument
ret
_CIacos endp
; MSC internal intrinsic _CIasin():
; receives argument in FPU st(0), returns result in FPU st(0)
; NOTE: _CIasin() returns correct result for ±0.0 and ±1.0
_CIasin proc public
fld st(0) ; st(0) = st(1) = argument
fmul st(0), st(0) ; st(0) = argument**2,
; st(1) = argument
fld1 ; st(0) = 1.0,
; st(1) = argument**2,
; st(2) = argument
fsubrp st(1), st(0) ; st(0) = 1.0 - argument**2,
; st(1) = argument
fsqrt ; st(0) = square root of (1.0 - argument**2),
; st(1) = argument
fpatan ; st(0) = inverse circular sine of argument
ret
_CIasin endp
; MSC internal intrinsic _CIatan():
; receives argument in FPU st(0), returns result in FPU st(0)
; NOTE: _CIatan() returns correct result for ±0.0 and ±INFINITY
_CIatan proc public
fld1 ; st(0) = 1.0,
; st(1) = argument
fpatan ; st(0) = inverse circular tangent of (argument / 1.0)
ret
_CIatan endp
; MSC internal intrinsic _CIatan2():
; receives arguments in FPU st(0) and st(1), returns result in FPU st(0)
; NOTE: _CIatan2() returns correct result for ±0.0 and ±INFINITY
_CIatan2 proc public
fxch st(1) ; st(0) = denominator,
; st(1) = numerator
fpatan ; st(0) = inverse circular tangent of (numerator / denominator)
ret
_CIatan2 endp
; MSC internal intrinsic _CIcos():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIcos proc public
fcos ; st(0) = cosine of argument
fstsw ax ; ax = FPU status word,
; ah = B:C3:T:O:P:C2:C1:C0
sahf ; SF:ZF:0:AF:0:PF:1:CF = ah
jnp short return ; |argument| < 2**63?
fldpi ; st(0) = pi,
; st(1) = argument
fadd st(0), st(0) ; st(0) = 2.0 * pi,
; st(1) = argument
fxch st(1) ; st(0) = argument,
; st(1) = 2.0 * pi
reduce:
fprem1 ; st(0) = argument modulo (2.0 * pi),
; st(1) = 2.0 * pi
fstsw ax ; ax = FPU status word,
; ah = B:C3:T:O:P:C2:C1:C0
sahf ; SF:ZF:0:AF:0:PF:1:CF = ah
jp short reduce
fstp st(1) ; st(0) = argument modulo (2.0 * pi)
fcos ; st(0) = cosine of argument modulo (2.0 * pi)
return:
ret
_CIcos endp
; MSC internal intrinsic _CIcosh():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIcosh proc public
call _CIexp ; st(0) = e**argument
fld1 ; st(0) = 1.0,
; st(1) = e**argument
fdiv st(0), st(1) ; st(0) = 1.0 / e**argument = e**-argument,
; st(1) = e**argument
faddp st(1), st(0) ; st(0) = e**argument + e**-argument
push (bias - 1) shl width mantissa
; [esp] = 0x3F000000
; = 0.5F
fmul real4 ptr [esp] ; st(0) = hyperbolic cosine of argument
pop eax
ret
_CIcosh endp
; MSC internal intrinsic _CIexp():
; receives argument in FPU st(0), returns result in FPU st(0)
; NOTE: _CIexp() returns correct result for ±INFINITY
_CIexp proc public
fldl2e ; st(0) = log2(e),
; st(1) = exponent
fmulp st(1), st(0) ; st(0) = exponent * log2(e)
if 0
fld1 ; st(0) = 1.0,
; st(1) = exponent * log2(e)
fld st(1) ; st(0) = exponent * log2(e),
; st(1) = 1.0,
; st(2) = exponent * log2(e)
fprem ; st(0) = (exponent * log2(e)) modulo 1.0,
; st(1) = 1.0,
; st(2) = exponent * log2(e)
f2xm1 ; st(0) = 2.0**((exponent * log2(e)) modulo 1.0) - 1.0,
; st(1) = 1.0,
; st(2) = exponent * log2(e)
faddp st(1), st(0) ; st(0) = 2.0**((exponent * log2(e)) modulo 1.0),
; st(1) = exponent * log2(e)
fscale ; st(0) = e**exponent,
; st(1) = exponent * log2(e)
else
fld st(0) ; st(0) = st(1) = exponent * log2(e)
frndint ; st(0) = integer(exponent * log2(e)),
; st(1) = exponent * log2(e)
fsub st(1), st(0) ; st(0) = integer(exponent * log2(e)),
; st(1) = fraction(exponent * log2(e))
fxch st(1) ; st(0) = fraction(exponent * log2(e)),
; st(1) = integer(exponent * log2(e))
f2xm1 ; st(0) = 2.0**fraction(exponent * log2(e)) - 1.0,
; st(1) = integer(exponent * log2(e))
fld1 ; st(0) = 1.0,
; st(1) = 2.0**fraction(exponent * log2(e)) - 1.0,
; st(2) = integer(exponent * log2(e))
faddp st(1), st(0) ; st(0) = 2.0**fraction(exponent * log2(e)),
; st(1) = integer(exponent * log2(e))
fscale ; st(0) = e**exponent,
; st(1) = integer(exponent * log2(e))
endif
fstp st(1) ; st(0) = e**exponent
ret
_CIexp endp
; MSC internal intrinsic _CIfmod():
; receives arguments in FPU st(0) and st(1), returns result in FPU st(0)
_CIfmod proc public
reduce:
fprem ; st(0) = remainder,
; st(1) = divisor
fstsw ax ; ax = FPU status word,
; ah = B:C3:T:O:P:C2:C1:C0
sahf ; SF:ZF:0:AF:0:PF:1:CF = ah
jp short reduce
fstp st(1) ; st(0) = remainder
ret
_CIfmod endp
; MSC internal intrinsic _CIlog():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIlog proc public
fldln2 ; st(0) = ln(2.0),
; st(1) = argument
fxch st(1) ; st(0) = argument,
; st(1) = ln(2.0)
fyl2x ; st(0) = natural logarithm of argument
ret
_CIlog endp
; MSC internal intrinsic _CIlog10():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIlog10 proc public
fldlg2 ; st(0) = log10(2.0),
; st(1) = argument
fxch st(1) ; st(0) = argument,
; st(1) = log10(2.0)
fyl2x ; st(0) = logarithm to base 10 of argument
ret
_CIlog10 endp
; MSC internal intrinsic _CIpow():
; receives arguments in FPU st(0) and st(1), returns result in FPU st(0)
_CIpow proc public
fxch st(1) ; st(0) = base,
; st(1) = exponent
fyl2x ; st(0) = exponent * log2(base)
fld st(0) ; st(0) = st(1) = exponent * log2(base)
frndint ; st(0) = integer(exponent * log2(base)),
; st(1) = exponent * log2(base)
fsub st(1), st(0) ; st(0) = integer(exponent * log2(base)),
; st(1) = fraction(exponent * log2(base))
fxch st(1) ; st(0) = fraction(exponent * log2(base)),
; st(1) = integer(exponent * log2(base))
f2xm1 ; st(0) = 2.0**fraction(exponent * log2(base)) - 1.0,
; st(1) = integer(exponent * log2(base))
fld1 ; st(0) = 1.0,
; st(1) = 2.0**fraction(exponent * log2(base)) - 1.0,
; st(2) = integer(exponent * log2(base))
faddp st(1), st(0) ; st(0) = 2.0**fraction(exponent * log2(base)),
; st(1) = integer(exponent * log2(base))
fscale ; st(0) = base**exponent,
; st(1) = integer(exponent * log2(base))
fstp st(1) ; st(0) = base**exponent
ret
_CIpow endp
; MSC internal intrinsic _CIsin():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIsin proc public
fsin ; st(0) = sine of argument
fstsw ax ; ax = FPU status word,
; ah = B:C3:T:O:P:C2:C1:C0
sahf ; SF:ZF:0:AF:0:PF:1:CF = ah
jnp short return ; |argument| < 2**63?
fldpi ; st(0) = pi,
; st(1) = argument
fadd st(0), st(0) ; st(0) = 2.0 * pi,
; st(1) = argument
fxch st(1) ; st(0) = argument,
; st(1) = 2.0 * pi
reduce:
fprem1 ; st(0) = argument modulo (2.0 * pi),
; st(1) = 2.0 * pi
fstsw ax ; ax = FPU status word,
; ah = B:C3:T:O:P:C2:C1:C0
sahf ; SF:ZF:0:AF:0:PF:1:CF = ah
jp short reduce
fstp st(1) ; st(0) = argument modulo (2.0 * pi)
fsin ; st(0) = sine of argument modulo (2.0 * pi)
return:
ret
_CIsin endp
; MSC internal intrinsic _CIsinh():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIsinh proc public
call _CIexp ; st(0) = e**argument
fld1 ; st(0) = 1.0,
; st(1) = e**argument
fdiv st(0), st(1) ; st(0) = 1.0 / e**argument = e**-argument,
; st(1) = e**argument
fsubp st(1), st(0) ; st(0) = e**argument - e**-argument
push (bias - 1) shl width mantissa
; [esp] = 0x3F000000
; = 0.5F
fmul real4 ptr [esp] ; st(0) = hyperbolic sine of argument
pop eax
ret
_CIsinh endp
; MSC internal intrinsic _CIsqrt():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIsqrt proc public
fsqrt ; st(0) = square root of radicand
ret
_CIsqrt endp
; MSC internal intrinsic _CItan():
; receives argument in FPU st(0), returns result in FPU st(0)
_CItan proc public
fptan ; st(0) = 1.0,
; st(1) = tangent of argument
fstsw ax ; ax = FPU status word,
; ah = B:C3:T:O:P:C2:C1:C0
sahf ; SF:ZF:0:AF:0:PF:1:CF = ah
jnp short return ; |argument| < 2**63?
fldpi ; st(0) = pi,
; st(1) = argument
fadd st(0), st(0) ; st(0) = 2.0 * pi,
; st(1) = argument
fxch st(1) ; st(0) = argument,
; st(1) = 2.0 * pi
reduce:
fprem1 ; st(0) = argument modulo (2.0 * pi),
; st(1) = 2.0 * pi
fstsw ax ; ax = FPU status word,
; ah = B:C3:T:O:P:C2:C1:C0
sahf ; SF:ZF:0:AF:0:PF:1:CF = ah
jp short reduce
fstp st(1) ; st(0) = argument modulo (2.0 * pi)
fptan ; st(0) = 1.0,
; st(1) = tangent of argument modulo (2.0 * pi)
return:
fstp st(0) ; st(0) = tangent of argument
ret
_CItan endp
; MSC internal intrinsic _CItanh():
; receives argument in FPU st(0), returns result in FPU st(0)
_CItanh proc public
call _CIexp ; st(0) = e**argument
fmul st(0), st(0) ; st(0) = e**argument * e**argument
; = e**(argument + argument)
fld1 ; st(0) = 1.0,
; st(1) = e**(argument + argument)
fadd st(1), st(0) ; st(0) = 1.0,
; st(1) = e**(argument + argument) + 1.0
fadd st(0), st(0) ; st(0) = 2.0,
; st(1) = e**(argument + argument) + 1.0
fdivrp st(1), st(0) ; st(0) = 2.0 / (e**(argument + argument) + 1.0)
fld1 ; st(0) = 1.0,
; st(1) = 2.0 / (e**(argument + argument) + 1.0)
fsubrp st(1), st(0) ; st(0) = 1.0 - 2.0 / (e**(argument + argument) + 1.0)
; = hyperbolic tangent of argument
ret
_CItanh endp
; MSC internal intrinsic _ftol():
; receives argument in FPU st(0), returns result in eax
; NOTE: fistp rounds to nearest (even) integer!
_ftol proc public
push eax
fistp dword ptr [esp] ; [esp] = integer(argument)
pop eax ; eax = integer(argument)
ret
_ftol endp
; MSC internal intrinsic _ftol2():
; receives argument in FPU st(0), returns result in edx:eax
; NOTE: fistp rounds to nearest (even) integer!
_ftol2 proc public
push edx
push eax
fistp qword ptr [esp] ; [esp] = integer(argument)
pop eax
pop edx ; edx:eax = integer(argument)
ret
_ftol2 endp
; MSC internal intrinsic _ftol2_sse():
; receives argument in FPU st(0), returns result in edx:eax
; NOTE: fisttp truncates, i.e. rounds towards ±0!
_ftol2_sse proc public
push edx
push eax
fisttp qword ptr [esp] ; [esp] = integer(argument)
pop eax
pop edx ; edx:eax = integer(argument)
ret
_ftol2_sse endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
i386-fpu.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 3 command lines to generate the object file
i386-fpu.obj
and add it to the existing object library
i386.lib
:
SET ML=/c /Gy /safeseh /W3 /X ML.EXE i386-fpu.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-fpu.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: if the
/Gy
option
to package every function in its own, separately linkable
COMDAT
section is not available with the version of the macro assembler
ML.EXE
you use,
split the i386 assembler source into multiple pieces,
with one function per source file.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-fpu.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
memchr()
Standard Function for i386 Platformmemchr()
function is neither a compiler helper nor an
intrinsic
function, it is included here for entertainment due to its
DIR "%source%\intel\mem*.asm" TYPE "%source%\intel\memchr.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 4,097 memccpy.asm 02/18/2011 03:40 PM 5,003 memchr.asm 02/18/2011 03:40 PM 22,486 memcpy.asm 02/18/2011 03:40 PM 475 memmove.asm 02/18/2011 03:40 PM 4,426 memset.asm 5 File(s) 36,307 bytes 0 Dir(s) 9,876,543,210 bytes free page ,132 title memchr - search memory for a given character ;*** ;memchr.asm - search block of memory for a given character ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; defines memchr() - search memory until a character is ; found or a limit is reached. ; ;******************************************************************************* .xlist include cruntime.inc .list page ;*** ;char *memchr(buf, chr, cnt) - search memory for given character. ; ;Purpose: ; Searched at buf for the given character, stopping when chr is ; first found or cnt bytes have been searched through. ; ; Algorithm: ; char * ; memchr (buf, chr, cnt) ; char *buf; ; int chr; ; unsigned cnt; ; { ; while (cnt && *buf++ != c) ; cnt--; ; return(cnt ? --buf : NULL); ; } ; ;Entry: ; char *buf - memory buffer to be searched ; char chr - character to search for ; unsigned cnt - max number of bytes to search ; ;Exit: ; returns pointer to first occurence of chr in buf ; returns NULL if chr not found in the first cnt bytes ; ;Uses: ; ;Exceptions: ; ;******************************************************************************* CODESEG public memchr memchr proc \ buf:ptr byte, \ chr:byte, \ cnt:dword OPTION PROLOGUE:NONE, EPILOGUE:NONE .FPO ( 0, 1, 0, 0, 0, 0 ) mov eax,[esp+0ch] ; eax = count push ebx ; Preserve ebx test eax,eax ; check if count=0 jz short retnull ; if count=0, leave mov edx,[esp+8] ; edx = bufferWith 76 instructions in 173 bytes, this routine is yet another true gem!xor ebx,ebx mov bl,[esp+0ch]; bl = search char movzx ebx,byte ptr [esp+12]test edx,3; test if string is aligned on 32 bits test dl,3 jz short main_loop_start str_misaligned: ; simple byte loop until string is aligned mov cl,byte ptr [edx]add edx,1 xor cl,blinc edx cmp cl,bl je short foundsub eax,1; counter-- dec eax jz short retnulltest edx,3; already aligned ? test dl,3 jne short str_misaligned main_loop_start: sub eax,4 jb short tail_less_then_4 ; set all 4 bytes of ebx to [value] push edi ; Preserve edimov edi,ebx; edi=0/0/0/charshl ebx,8; ebx=0/0/char/0add ebx,edi; ebx=0/0/char/charmov edi,ebx; edi=0/0/char/charshl ebx,10h; ebx=char/char/0/0add ebx,edi; ebx = all 4 bytes = [search char] imul ebx,01010101h jmp short main_loop_entry ; ecx >=0 return_from_main: pop edi tail_less_then_4: add eax,4 jz retnull tail_loop: ; 0 < eax < 4 mov cl,byte ptr [edx]add edx,1 xor cl,blinc edx cmp cl,bl je short foundsub eax,1dec eax jnz short tail_loop retnull: pop ebx ret ; _cdecl return main_loop: sub eax,4 jb short return_from_main main_loop_entry: mov ecx,dword ptr [edx] ; read 4 bytes xor ecx,ebx ; ebx is byte\byte\byte\bytemov edi,7efefeffh add edi,ecx xor ecx,-1 xor ecx,ediadd edx,4 lea edi,[ecx-01010101h] not ecx and ecx,80808080h and ecx,ediand ecx,81010100hje short main_loop ; found zero byte in the loop? char_is_found: bsf ecx,ecx shr ecx,3 lea eax,[edx+ecx-4] pop edi pop ebx retmov ecx,[edx - 4] xor cl,bl; is it byte 0je short byte_0 xor ch,bl; is it byte 1je short byte_1 shr ecx,10h; is it byte 2xor cl,bl je short byte_2 xor ch,bl; is it byte 3je short byte_3 jmp short main_loop; taken if bits 24-30 are clear and bit ; 31 is setbyte_3: pop edi; restore edi found: lea eax,[edx - 1] pop ebx ; restore ebx ret ; _cdecl returnbyte_2: lea eax,[edx - 2] pop edi pop ebx ret; _cdecl returnbyte_1: lea eax,[edx - 3] pop edi pop ebx ret; _cdecl returnbyte_0: lea eax,[edx - 4] pop edi; restore edipop ebx; restore ebxret; _cdecl return memchr endp end
Oops¹: the deleted
XOR
instruction followed
by the deleted MOV
instruction
should be replaced with the inserted
MOVZX
instruction.
Oops²: the deleted
TEST
instructions with
immediate value 3 should be replaced with the inserted
shorter ones, saving 6 bytes.
OUCH¹: the deleted
ADD
and SUB
instructions which increment respectively decrement by 1, should be
replaced with the inserted shorter
INC
or
DEC
instructions, saving
4 bytes!
OUCH²: instead of the 6 deleted
XOR
instructions which
perform superfluous partial register write operations the
inserted CMP
instructions should be used!
OUCH³: instead of the 6 deleted
instructions which copy the byte from register BL
into
the upper 3 bytes of register EBX
the
single inserted
IMUL
instruction should be
used, saving 8 bytes!
OUCH⁴: instead of the deleted
XOR
instruction with
immediate operand -1 the inserted shorter
NOT
instruction
should be used, saving 1 byte!
OUCH⁵: when the 5 deleted
instructions after label main_loop_entry:
are replaced
with the 4 inserted instructions, the 24 (in words:
twenty-four) deleted instructions after
label char_is_found:
can be replaced with the 6 faster
and shorter instructions inserted there, saving 45 (in
words: fourty-five) bytes!
Note: Alan Mycroft posted the better, faster and shorter method on April 8, 1987 to the USENET news group comp.lang.c
You might be interested to know that such detection of null bytes in words
can be done in 3 or 4 instructions on almost any hardware (nay even in C).
(Code that follows relies on x being a 32 bit unsigned (or 2's complement
int with overflow ignored)...)
#define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
Then if e is an expression without side effects (e.g. variable)
has_nullbyte_(e)
is nonzero iff the value of e has a null byte.
(One can view this as explicit programming of the Manchester carry chain
present in many processors which is hidden from the instruction level).
Note: see Bit Twiddling Hacks – Determine if a word has a zero byte for a comparison of both methods and more details.
Note: with the modifications shown in the source, this routine has 51 instructions in 118 bytes, i.e. two thirds of the original’s instructions and bytes.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
memchr proc public ; void *memchr(void const *buffer, int character, size_t count)
mov eax, [esp+12] ; eax = count
test eax, eax
jz short return ; count = 0?
movzx ecx, byte ptr [esp+8]
mov edx, [esp+4] ; edx = address of buffer
head:
test dl, 3
jz short aligned ; count % 4 = 0?
unaligned:
cmp cl, [edx]
je short break
inc edx
dec eax
jnz short head ; count > 0?
ret
aligned:
sub eax, 4
jb short tail ; count < 4?
push edi
push esi
imul ecx, 01010101h ; ecx = character
; | character << 8
; | character << 16
; | character << 24
next:
mov edi, [edx] ; edi = next 4 aligned bytes
xor edi, ecx
lea esi, [edi-01010101h]
not edi
and edi, 80808080h
and edi, esi
jnz short match
add edx, 4
sub eax, 4
jae short next ; count >= 4?
pop esi
pop edi
tail:
add eax, 4
jz short return ; count = 0?
@@:
cmp cl, [edx]
je short break
inc edx
dec eax
jnz short @b ; count > 0?
return:
ret
break:
mov eax, edx ; eax = address of character
ret
match:
bsf eax, edi ; eax = offset of character * 8 + 7
; = {7, 15, 23, 31}
shr eax, 3 ; eax = offset of character
; = {0, 1, 2, 3}
add eax, edx ; eax = address of character
pop esi
pop edi
ret
memchr endp
end
Save the i386 assembler source presented above as
i386-memchr.asm
and the
ANSI C
source presented below as i386-memchr.c
, then execute
the 6 command lines following the
ANSI C
source to assemble i386-memchr.asm
, compile
i386-memchr.c
, link the generated object files
i386-memchr.obj
and i386-memchr.tmp
, and
finally execute the image file i386-memchr.exe
to
demonstrate the correct operation:
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
__declspec(safebuffers)
BOOL CDECL PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
CHAR szFormat[1024];
DWORD dwFormat;
DWORD dwOutput;
va_list vaInput;
va_start(vaInput, lpFormat);
dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
va_end(vaInput);
if ((dwFormat == 0UL)
|| !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
return FALSE;
return dwOutput == dwFormat;
}
const CHAR szString[] = "^^9876543210$$";
const LPCSTR szFormat[] = {"0x%p: memchr(\"%hs\", \'$\', %lu) = NULL\r\n",
"0x%p: memchr(\"%hs\", \'$\', %lu) = 0x%p = \"%hs\"\r\n"};
__declspec(noreturn)
VOID CDECL mainCRTStartup(VOID)
{
LPCSTR lp;
LPCSTR lpString = szString + sizeof(szString);
DWORD dwError = ERROR_SUCCESS;
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
if (hOutput == INVALID_HANDLE_VALUE)
dwError = GetLastError();
else
while (--lpString >= szString)
{
lp = memchr(lpString, '$', szString + sizeof(szString) - lpString);
PrintFormat(hOutput,
szFormat[lp != NULL],
lpString, lpString, szString + sizeof(szString) - lpString, lp, lp);
}
ExitProcess(dwError);
}
SET ML=/c /safeseh /W3 /X ML.EXE i386-memchr.asm SET CL=/GAFy /Oy /W4 /Zl SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE /Foi386-memchr.tmp i386-memchr.obj i386-memchr.c kernel32.lib user32.lib .\i386-memchr.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-memchr.asm Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-memchr.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE /out:i386-memchr.exe i386-memchr.obj i386-memchr.tmp kernel32.lib user32.lib 0x002F2082: memchr("", '$', 1) = NULL 0x002F2081: memchr("$", '$', 2) = 0x002F2080 = "$$" 0x002F2080: memchr("$$", '$', 3) = 0x002F2080 = "$$" 0x002F207F: memchr("0$$", '$', 4) = 0x002F2080 = "$$" 0x002F207E: memchr("10$$", '$', 5) = 0x002F2080 = "$$" 0x002F207D: memchr("210$$", '$', 6) = 0x002F2080 = "$$" 0x002F207C: memchr("3210$$", '$', 7) = 0x002F2080 = "$$" 0x002F207B: memchr("43210$$", '$', 8) = 0x002F2080 = "$$" 0x002F207A: memchr("543210$$", '$', 9) = 0x002F2080 = "$$" 0x002F2079: memchr("6543210$$", '$', 10) = 0x002F2080 = "$$" 0x002F2078: memchr("76543210$$", '$', 11) = 0x002F2080 = "$$" 0x002F2077: memchr("876543210$$", '$', 12) = 0x002F2080 = "$$" 0x002F2076: memchr("9876543210$$", '$', 13) = 0x002F2080 = "$$" 0x002F2075: memchr("^9876543210$$", '$', 14) = 0x002F2080 = "$$" 0x002F2074: memchr("^^9876543210$$", '$', 15) = 0x002F2080 = "$$"
SmartImplementation in i386 Assembler
smartimplementation without loops for head and tail needs only 40 instructions in 102 bytes, i.e. about half the instructions of Microsoft’s poor implementation; the corresponding
smartimplementation of the missing
memrchr()
function has also 40
instructions in 101 bytes:
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
memchr proc public ; void *memchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short return ; count = 0?
cdq ; edx = 0
mov dl, [esp+8] ; edx = character
imul edx, 01010101h ; edx = character
; | character << 8
; | character << 16
; | character << 24
mov ecx, [esp+4] ; ecx = address of buffer
add [esp+12], ecx ; count = address after buffer
push ebx
mov ebx, ecx ; ebx = address of buffer
and ecx, 3 ; ecx = address of buffer % 4
; = 4 - number of unaligned bytes
jz short aligned ; address of buffer % 4 = 0?
unaligned:
sub ebx, ecx ; ebx = aligned address before buffer
shl ecx, 3 ; ecx = (4 - number of unaligned bytes) * 8
; = 32 - number of unaligned bits
dec eax ; eax = ~0
shl eax, cl ; eax = ~0 for unaligned bytes, 0 elsewhere
not eax ; eax = 0 for unaligned bytes, ~0 elsewhere
mov ecx, [ebx] ; ecx = unaligned bytes
xor ecx, edx ; ecx = unaligned bytes ^ multiplied character
or eax, ecx ; eax = '\0' for matching bytes
jmp short mycroft
next:
add ebx, 4 ; ebx = address of next 4 aligned bytes
cmp ebx, [esp+16]
jae short null ; address after buffer?
aligned:
mov eax, [ebx] ; eax = next 4 aligned bytes
xor eax, edx ; eax = next 4 aligned bytes ^ multiplied character
; = '\0' for matching bytes
mycroft:
mov ecx, eax
sub eax, 01010101h
not ecx
and eax, 80808080h
and eax, ecx ; eax = '\200' for matching bytes, '\0' elsewhere
jz short next ; no match in any byte?
match:
bsf eax, eax ; eax = offset of matching byte * 8 + 7
; = {7, 15, 23, 31}
shr eax, 3 ; eax = offset of matching byte
; = {0, 1, 2, 3}
add eax, ebx ; eax = address of matching byte
cmp eax, [esp+16] ; CF = (address inside buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
pop ebx
return:
ret
memchr endp
memrchr proc public ; void *memrchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short return ; count = 0?
cdq ; edx = 0
mov dl, [esp+8] ; edx = character
imul edx, 01010101h ; edx = character
; | character << 8
; | character << 16
; | character << 24
mov ecx, [esp+4] ; ecx = address of buffer
add ecx, [esp+12] ; ecx = address after buffer
push ebx
mov ebx, ecx ; ebx = address after buffer
and ecx, 3 ; ecx = address after buffer % 4
; = number of tail bytes
jz short aligned ; address after buffer % 4 = 0?
unaligned:
sub ebx, ecx ; ebx = aligned address of tail bytes
shl ecx, 3 ; ecx = number of tail bytes * 8
; = number of tail bits
dec eax ; eax = ~0
shl eax, cl ; eax = 0 for tail bytes, ~0 elsewhere
mov ecx, [ebx] ; ecx = tail bytes
xor ecx, edx ; ecx = tail bytes ^ multiplied character
or eax, ecx ; eax = '\0' for matching tail bytes
jmp short mycroft
previous:
sub ebx, 4 ; ebx = address of previous 4 aligned bytes
cmp ebx, [esp+8]
jb short null ; address before buffer?
aligned:
mov eax, [ebx] ; eax = previous 4 aligned bytes
xor eax, edx ; eax = previous 4 aligned bytes ^ multiplied character
; = '\0' for matching bytes
mycroft:
mov ecx, eax
sub eax, 01010101h
not ecx
and eax, 80808080h
and eax, ecx ; eax = '\200' for matching bytes, '\0' elsewhere
jz short previous ; no match in any byte?
match:
bsr eax, eax ; eax = offset of matching byte * 8 + 7
; = {31, 23, 15, 7}
shr eax, 3 ; eax = offset of matching byte
; = {3, 2, 1, 0}
add eax, ebx ; eax = address of matching byte
cmp eax, [esp+8] ; CF = (address of matching byte < address of buffer)
cmc ; CF = (address of matching byte >= address of buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
pop ebx
return:
ret
memrchr endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.xmm
.model flat, C
.code
memchr proc public ; void *memchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short null ; count = 0?
if 0
movd xmm0, dword ptr [esp+8]
punpcklbw xmm0, xmm0
punpcklwd xmm0, xmm0
else
mov al, [esp+8] ; eax = character
imul eax, 01010101h ; eax = character
; | character << 8
; | character << 16
; | character << 24
movd xmm0, eax
endif
pshufd xmm0, xmm0, 0 ; xmm0 = multiplied character
mov ecx, [esp+4] ; ecx = address of buffer
add [esp+12], ecx ; count = address after buffer
mov edx, ecx
and ecx, 15 ; ecx = address of buffer % 16
; = 16 - number of unaligned bytes
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before buffer
movdqa xmm1, [edx] ; xmm1 = chunk of 16 bytes
pcmpeqb xmm1, xmm0 ; xmm1 = '\377' for each matching byte in chunk
pmovmskb eax, xmm1 ; eax = bitmask for matching bytes in chunk
shr eax, cl
shl eax, cl ; eax = bitmask for matching bytes in buffer
jnz short match
next:
add edx, 16 ; edx = address of next chunk of aligned bytes
cmp edx, [esp+12]
jae short null ; address after buffer?
aligned:
movdqa xmm1, [edx] ; xmm1 = chunk of 16 bytes
pcmpeqb xmm1, xmm0 ; xmm1 = '\377' for each matching byte in chunk
pmovmskb eax, xmm1 ; eax = bitmask for matching bytes in chunk
test eax, eax
jz short next ; no matching byte in chunk?
match:
bsf eax, eax ; eax = offset of matching byte in chunk
add eax, edx ; eax = address of matching byte
cmp eax, [esp+12] ; CF = (address inside buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
ret
memchr endp
memrchr proc public ; void *memrchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short null ; count = 0?
if 0
movd xmm0, dword ptr [esp+8]
punpcklbw xmm0, xmm0
punpcklwd xmm0, xmm0
else
mov al, [esp+8] ; eax = character
imul eax, 01010101h ; eax = character
; | character << 8
; | character << 16
; | character << 24
movd xmm0, eax
endif
pshufd xmm0, xmm0, 0 ; xmm0 = multiplied character
mov ecx, [esp+4] ; ecx = address of buffer
add ecx, [esp+12] ; ecx = address after buffer
mov edx, ecx
and ecx, 15 ; ecx = address after buffer % 16
; = number of tail bytes
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address of tail bytes
movdqa xmm1, [edx] ; xmm1 = chunk of 16 bytes
pcmpeqb xmm1, xmm0 ; xmm1 = '\377' for each matching byte in chunk
pmovmskb eax, xmm1 ; eax = bitmask for matching bytes in chunk
neg ecx ; ecx = -number of tail bytes
shl eax, cl
shr eax, cl ; eax = bitmask for matching bytes in buffer
jnz short match
previous:
sub edx, 16 ; edx = address of previous chunk of aligned bytes
cmp edx, [esp+4]
jb short null ; address before buffer?
aligned:
movdqa xmm1, [edx] ; xmm1 = chunk of 16 bytes
pcmpeqb xmm1, xmm0 ; xmm1 = '\377' for each matching byte in chunk
pmovmskb eax, xmm1 ; eax = bitmask for matching bytes in chunk
test eax, eax
jz short previous ; no matching byte in chunk?
match:
bsr eax, eax ; eax = offset of matching byte in chunk
add eax, edx ; eax = address of matching byte
cmp eax, [esp+4] ; CF = (address of matching byte < address of buffer)
cmc ; CF = (address of matching byte >= address of buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
ret
memrchr endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.xmm
.model flat, C
.code
memchr proc public ; void *memchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short null ; count = 0?
pxor xmm0, xmm0 ; xmm0 = 0
movd xmm1, dword ptr [esp+8]
; xmm1 = character
pshufb xmm1, xmm0 ; xmm1 = multiplied character
mov ecx, [esp+4] ; ecx = address of buffer
add [esp+12], ecx ; count = address after buffer
mov edx, ecx
and ecx, 15 ; ecx = address of buffer % 16
; = 16 - number of unaligned bytes
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before buffer
movdqa xmm0, [edx] ; xmm0 = chunk of 16 bytes
pcmpeqb xmm0, xmm1 ; xmm0 = '\377' for each matching byte in chunk
pmovmskb eax, xmm0 ; eax = bitmask for matching bytes in chunk
shr eax, cl
shl eax, cl ; eax = bitmask for matching bytes in buffer
jnz short match
next:
add edx, 16 ; edx = address of next chunk of aligned bytes
cmp edx, [esp+12]
jae short null ; address after buffer?
aligned:
movdqa xmm0, [edx] ; xmm0 = chunk of 16 bytes
pcmpeqb xmm0, xmm1 ; xmm0 = '\377' for each matching byte in chunk
pmovmskb eax, xmm0 ; eax = bitmask for matching bytes in chunk
test eax, eax
jz short next ; no matching byte in chunk?
match:
bsf eax, eax ; eax = offset of matching byte in chunk
add eax, edx ; eax = address of matching byte
cmp eax, [esp+12] ; CF = (address inside buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
ret
memchr endp
memrchr proc public ; void *memrchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short null ; count = 0?
pxor xmm0, xmm0 ; xmm0 = 0
movd xmm1, dword ptr [esp+8]
; xmm1 = character
pshufb xmm1, xmm0 ; xmm1 = multiplied character
mov ecx, [esp+4] ; ecx = address of buffer
add ecx, [esp+12] ; ecx = address after buffer
mov edx, ecx
and ecx, 15 ; ecx = address after buffer % 16
; = number of tail bytes
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address of tail bytes
movdqa xmm0, [edx] ; xmm0 = chunk of 16 bytes
pcmpeqb xmm0, xmm1 ; xmm0 = '\377' for each matching byte in chunk
pmovmskb eax, xmm0 ; eax = bitmask for matching bytes in chunk
neg ecx ; ecx = -number of tail bytes
shl eax, cl
shr eax, cl ; eax = bitmask for matching bytes in buffer
jnz short match
previous:
sub edx, 16 ; edx = address of previous chunk of aligned bytes
cmp edx, [esp+4]
jb short null ; address before buffer?
aligned:
movdqa xmm0, [edx] ; xmm0 = chunk of 16 bytes
pcmpeqb xmm0, xmm1 ; xmm0 = '\377' for each matching byte in chunk
pmovmskb eax, xmm0 ; eax = bitmask for matching bytes in chunk
test eax, eax
jz short previous ; no matching byte in chunk?
match:
bsr eax, eax ; eax = offset of matching byte in chunk
add eax, edx ; eax = address of matching byte
cmp eax, [esp+4] ; CF = (address of matching byte < address of buffer)
cmc ; CF = (address of matching byte >= address of buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
ret
memrchr endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.xmm
.model flat, C
.code
memchr proc public ; void *memchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short null ; count = 0?
vpxor xmm0, xmm0, xmm0; xmm0 = 0
vmovd xmm1, dword ptr [esp+8]
; xmm1 = character
vpshufb xmm1, xmm1, xmm0; xmm1 = multiplied character
mov ecx, [esp+4] ; ecx = address of buffer
add [esp+12], ecx ; count = address after buffer
mov edx, ecx
and ecx, 15 ; ecx = address of buffer % 16
; = 16 - number of unaligned bytes
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before buffer
vpcmpeqb xmm0, xmm1, [edx]
; xmm0 = '\377' for each matching byte in chunk
vpmovmskb eax, xmm0 ; eax = bitmask for matching bytes in chunk
shr eax, cl
shl eax, cl ; eax = bitmask for matching bytes in buffer
jnz short match
next:
add edx, 16 ; edx = address of next chunk of aligned bytes
cmp edx, [esp+12]
jae short null ; address after buffer?
aligned:
vpcmpeqb xmm0, xmm1, [edx]
; xmm0 = '\377' for each matching byte in chunk
vpmovmskb eax, xmm0 ; eax = bitmask for matching bytes in chunk
test eax, eax
jz short next ; no matching byte in chunk?
match:
bsf eax, eax ; eax = offset of matching byte in chunk
add eax, edx ; eax = address of matching byte
cmp eax, [esp+12] ; CF = (address inside buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
ret
memchr endp
memrchr proc public ; void *memrchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short null ; count = 0?
vpxor xmm0, xmm0, xmm0; xmm0 = 0
vmovd xmm1, dword ptr [esp+8]
; xmm1 = character
vpshufb xmm1, xmm1, xmm0; xmm1 = multiplied character
mov ecx, [esp+4] ; ecx = address of buffer
add ecx, [esp+12] ; ecx = address after buffer
mov edx, ecx
and ecx, 15 ; ecx = address after buffer % 16
; = number of tail bytes
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address of tail bytes
vpcmpeqb xmm0, xmm1, [edx]
; xmm0 = '\377' for each matching byte in chunk
vpmovmskb eax, xmm0 ; eax = bitmask for matching bytes in chunk
neg ecx ; ecx = -number of tail bytes
shl eax, cl
shr eax, cl ; eax = bitmask for matching bytes in buffer
jnz short match
previous:
sub edx, 16 ; edx = address of previous chunk of aligned bytes
cmp edx, [esp+4]
jb short null ; address before buffer?
aligned:
vpcmpeqb xmm0, xmm1, [edx]
; xmm0 = '\377' for each matching byte in chunk
vpmovmskb eax, xmm0 ; eax = bitmask for matching bytes in chunk
test eax, eax
jz short previous ; no matching byte in chunk?
match:
bsr eax, eax ; eax = offset of matching byte in chunk
add eax, edx ; eax = address of matching byte
cmp eax, [esp+4] ; CF = (address of matching byte < address of buffer)
cmc ; CF = (address of matching byte >= address of buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
ret
memrchr endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.ymm
.model flat, C
.code
memchr proc public ; void *memchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short null ; count = 0?
vpbroadcastb ymm0, byte ptr [esp+8]
; ymm0 = multiplied character
mov ecx, [esp+4] ; ecx = address of buffer
add [esp+12], ecx ; count = address after buffer
mov edx, ecx
and ecx, 31 ; ecx = address of buffer % 32
; = 32 - number of unaligned bytes
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before buffer
vpcmpeqb ymm1, ymm0, [edx]
; ymm1 = '\377' for each matching byte in chunk
vpmovmskb eax, ymm1 ; eax = bitmask for matching bytes in chunk
shr eax, cl
shl eax, cl ; eax = bitmask for matching bytes in buffer
jnz short match
next:
add edx, 32 ; edx = address of next chunk of aligned bytes
cmp edx, [esp+12]
jae short null ; address after buffer?
aligned:
vpcmpeqb ymm1, ymm0, [edx]
; ymm1 = '\377' for each matching byte in chunk
vpmovmskb eax, ymm1 ; eax = bitmask for matching bytes in chunk
test eax, eax
jz short next ; no matching byte in chunk?
match:
bsf eax, eax ; eax = offset of matching byte in chunk
add eax, edx ; eax = address of matching byte
cmp eax, [esp+12] ; CF = (address inside buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
ret
memchr endp
memrchr proc public ; void *memrchr(void const *buffer, int character, size_t count)
xor eax, eax ; eax = 0
cmp eax, [esp+12]
je short null ; count = 0?
vpbroadcastb ymm0, byte ptr [esp+8]
; ymm0 = multiplied character
mov ecx, [esp+4] ; ecx = address of buffer
add ecx, [esp+12] ; ecx = address after buffer
mov edx, ecx
and ecx, 31 ; ecx = address after buffer % 32
; = number of tail bytes
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address of tail bytes
vpcmpeqb ymm1, ymm0, [edx]
; ymm1 = '\377' for each matching byte in chunk
vpmovmskb eax, ymm1 ; eax = bitmask for matching bytes in chunk
neg ecx ; ecx = -number of tail bytes
shl eax, cl
shr eax, cl ; eax = bitmask for matching bytes in buffer
jnz short match
previous:
sub edx, 32 ; edx = address of previous chunk of aligned bytes
cmp edx, [esp+4]
jb short null ; address before buffer?
aligned:
vpcmpeqb ymm1, ymm0, [edx]
; ymm1 = '\377' for each matching byte in chunk
vpmovmskb eax, ymm1 ; eax = bitmask for matching bytes in chunk
test eax, eax
jz short previous ; no matching byte in chunk?
match:
bsr eax, eax ; eax = offset of matching byte in chunk
add eax, edx ; eax = address of matching byte
cmp eax, [esp+4] ; CF = (address of matching byte < address of buffer)
cmc ; CF = (address of matching byte >= address of buffer)
sbb ecx, ecx ; ecx = (address inside buffer) ? -1 : 0
and eax, ecx ; eax = address of character
null:
ret
memrchr endp
end
SmartImplementation in AMD64 Assembler
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.code
memchr proc public ; void *memchr(void const *buffer, int character, size_t count)
xor eax, eax ; rax = 0
test r8, r8
jz short null ; count = 0?
mov r10, 0101010101010101h
if 0
mov r11, 8080808080808080h
elseif 0
imul r11, r10, 128 ; r11 = 0x8080808080808080
else
mov r11, r10
ror r11, 1 ; r11 = 0x8080808080808080
endif
movzx edx, dl ; rdx = character & 255
imul rdx, r10 ; rdx = character
; | character << 8
; | character << 16
; | character << 24
; | character << 32
; | character << 40
; | character << 48
; | character << 56
add r8, rcx ; r8 = address after buffer
mov r9, rcx ; r9 = address of buffer
and ecx, 7 ; rcx = address of buffer % 8
; = 8 - number of unaligned bytes
jz short aligned ; address of buffer % 8 = 0?
unaligned:
sub r9, rcx ; r9 = aligned address before buffer
shl ecx, 3 ; rcx = (8 - number of unaligned bytes) * 8
; = 64 - number of unaligned bits
dec rax ; rax = ~0
shl rax, cl ; rax = ~0 for unaligned bytes, 0 elsewhere
not rax ; rax = 0 for unaligned bytes, ~0 elsewhere
mov rcx, [r9] ; rcx = unaligned bytes
xor rcx, rdx ; rcx = unaligned bytes ^ multiplied character
or rcx, rax ; rcx = '\0' for matching bytes
jmp short mycroft
next:
add r9, 8 ; r9 = address of next 8 aligned bytes
cmp r9, r8
jae short null ; address after buffer?
aligned:
mov rcx, [r9] ; rcx = next 8 aligned bytes
xor rcx, rdx ; rcx = next 8 aligned bytes ^ multiplied character
; = '\0' for matching bytes
mycroft:
mov rax, rcx
sub rcx, r10
not rax
and rcx, r11
and rax, rcx ; rax = '\200' for matching bytes, '\0' elsewhere
jz short next ; no match in any byte?
match:
bsf rax, rax ; rax = offset of matching byte * 8 + 7
; = {7, 15, 23, 31, 39, 47, 55, 63}
shr eax, 3 ; rax = offset of matching byte
; = {0, 1, 2, 3, 4, 5, 6, 7}
add rax, r9 ; rax = address of matching byte
if 0
cmp rax, r8 ; CF = (address inside buffer)
sbb rcx, rcx ; rcx = (address inside buffer) ? -1 : 0
and rax, rcx ; rax = address of character
else
xor ecx, ecx ; rcx = 0
cmp rax, r8 ; CF = (address inside buffer)
cmovnb rax, rcx ; rax = address of character
endif
null:
ret
memchr endp
memrchr proc public ; void *memrchr(void const *buffer, int character, size_t count)
xor eax, eax ; rax = 0
test r8, r8
jz short null ; count = 0?
mov r10, 0101010101010101h
if 0
mov r11, 8080808080808080h
elseif 0
imul r11, r10, 128 ; r11 = 0x8080808080808080
else
mov r11, r10
ror r11, 1 ; r11 = 0x8080808080808080
endif
movzx edx, dl ; rdx = character & 255
imul rdx, r10 ; rdx = character
; | character << 8
; | character << 16
; | character << 24
; | character << 32
; | character << 40
; | character << 48
; | character << 56
add r8, rcx ; r8 = address after buffer
mov r9, rcx ; r9 = address of buffer
mov rcx, r8
and ecx, 7 ; rcx = address after buffer % 8
; = 8 - number of tail bytes
jz short aligned ; address after buffer % 8 = 0?
unaligned:
sub r8, rcx ; r8 = aligned address of tail bytes
shl ecx, 3 ; rcx = (8 - number of tail bytes) * 8
; = 64 - number of tail bits
dec rax ; rax = ~0
shl rax, cl ; rax = '\0' for tail bytes, ~0 elsewhere
mov rcx, [r8] ; rcx = tail bytes
xor rcx, rdx ; rcx = tail bytes ^ multiplied character
or rcx, rax ; rcx = '\0' for matching tail bytes
jmp short mycroft
previous:
sub r8, 8 ; r8 = address of previous 8 aligned bytes
cmp r8, r9
jb short null ; address before buffer?
aligned:
mov rcx, [r8] ; rcx = previous 8 aligned bytes
xor rcx, rdx ; rcx = previous 8 aligned bytes ^ multiplied character
; = '\0' for matching bytes
mycroft:
mov rax, rcx
sub rcx, r10
not rax
and rcx, r11
and rax, rcx ; rax = '\200' for matching bytes, '\0' elsewhere
jz short previous ; no match in any byte?
match:
bsr rax, rax ; rax = offset of matching byte * 8 + 7
; = {63, 55, 47, 39, 31, 23, 15, 7}
shr eax, 3 ; rax = offset of matching byte
; = {7, 6, 5, 4, 3, 2, 1, 0}
add rax, r8 ; rax = address of matching byte
if 0
cmp rax, r9 ; CF = (address of matching byte < address of buffer)
cmc ; CF = (address of matching byte >= address of buffer)
sbb rcx, rcx ; rcx = (address inside buffer) ? -1 : 0
and rax, rcx ; rax = address of character
else
xor ecx, ecx ; rcx = 0
cmp rax, r9 ; CF = (address of matching byte < address of buffer)
cmovb rax, rcx ; rax = address of character
endif
null:
ret
memrchr endp
end
mem*()
Standard Functionsmemcpy()
and
memset()
are
intrinsic
functions, the Visual C compiler provides no
inline implementation, but generates calls to external routines.
Proper implementations of these plus the
memchr()
,
memcmp()
,
memmem()
,
memmove()
and memrchr()
functions for the i386 and
the AMD64 platform follow with build instructions.
Note: the
memmem()
function is
like the
strstr()
function and uses the same algorithm!
Note: both
memmem()
and
memrchr()
are not provided by the
Visual C compiler or its runtime libraries!
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define NULL (void *) 0
#ifndef _WIN64
typedef unsigned int size_t;
#endif
#pragma function(memcmp, memcpy, memset)
void *memchr(void const *destination, int character, size_t count)
{
char const *mem = (unsigned char const *) destination;
while (count)
{
if (*mem == (unsigned char) character)
return (void *) mem;
mem++, --count;
}
return NULL;
}
int memcmp(void const *source, void const *destination, size_t count)
{
char *dst = (unsigned char *) destination;
char *src = (unsigned char *) source;
if (count && source != destination)
do
if (*src - *dst)
#if 0
return *src - *dst;
#else
return (*src > *dst) - (*src < *dst);
#endif
while (src++, dst++, --count);
return 0;
}
void *memcpy(void *destination, void const *source, size_t count)
{
char *dst = (unsigned char *) destination;
char *src = (unsigned char *) source;
while (count)
*dst++ = *src++, --count;
return destination;
}
void *memmem(void const *haystack, size_t count, void const *needle, size_t length)
{
char const *mem;
char const *hay = (unsigned char const *) haystack;
char const *pin = (unsigned char const *) needle;
if (!count || length > count)
return NULL;
if (!length)
return (void *) haystack;
if (!--length) // needle is a single character?
return memchr(haystack, *pin, count);
count -= length; // maximum number of characters to scan in haystack
while (mem = hay, hay = (unsigned char const *) memchr(hay, *pin, count), hay)
{ // *hay is first character of pin; compare
// last character of pin first, then proceed
if (hay[length] == pin[length]
#if 0
&& length == 1 || !memcmp(hay + 1, pin + 1, length - 1))
#else
&& !memcmp(hay, pin, length))
#endif
return (void *) hay;
// skip character in haystack,
// adjust number of characters left in haystack
count -= ++hay - mem;
if (!count)
break;
}
return NULL;
}
void *memmove(void *destination, void const *source, size_t count)
{
char *dst = (unsigned char *) destination;
char *src = (unsigned char *) source;
if (dst < src || dst - src >= count)
while (count)
*dst++ = *src++, --count;
else if (dst > src)
{ // overlapping buffers
dst += count;
src += count;
while (count)
*--dst = *--src, --count;
}
return destination;
}
void *memrchr(void const *destination, int character, size_t count)
{
char const *mem = (unsigned char const *) destination + count;
while (count)
{
if (*--mem == (unsigned char) character)
return (void *) mem;
--count;
}
return NULL;
}
void *memset(void *destination, int character, size_t count)
{
char *dst = (unsigned char *) destination;
while (count)
*dst++ = (unsigned char) character, --count;
return destination;
}
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
memchr proc public ; void *memchr(void const *destination, int character, size_t count)
mov edx, edi
mov edi, [esp+4] ; edi = address of destination
mov eax, [esp+8] ; eax = character
mov ecx, [esp+12] ; ecx = count
repne scasb
dec edi ; edi = address of character
neg ecx ; CF = (ecx <> 0)
; = ([edi] = character)
sbb eax, eax ; eax = (ecx = 0) ? 0 : -1
and eax, edi ; eax = (ecx = 0) ? 0 : address of character
mov edi, edx
ret
memchr endp
memcmp proc public ; int memcmp(void const *source, void const *destination, size_t count)
mov eax, [esp+4] ; eax = address of source
mov edx, [esp+8] ; edx = address of destination
cmp edx, eax
je short equal ; address of destination = address of source?
mov ecx, [esp+12] ; ecx = count
if 0
jecxz short equal ; count = 0?
else
cmp ebx, ebx ; CF = 0,
; ZF = 1 (required when count is 0)
endif
xchg esi, eax ; esi = address of source
xchg edi, edx ; edi = address of destination
repe cmpsb
mov edi, edx
mov esi, eax
seta al
movzx eax, al ; eax = (*source > *destination)
sbb eax, 0 ; eax = (*source > *destination)
; - (*source < *destination)
; = {1, 0, -1}
ret
equal:
xor eax, eax
ret
memcmp endp
memcpy proc public ; void *memcpy(void *destination, void const *source, size_t count)
mov eax, [esp+4] ; eax = address of destination
mov edx, [esp+8] ; edx = address of source
cmp edx, eax
je short return ; address of source = address of destination?
mov ecx, [esp+12] ; ecx = count
;; jecxz short return ; count = 0?
xchg esi, edx ; esi = address of source
xchg edi, eax ; edi = address of destination
if 1
rep movsb
else
shr ecx, 1 ; ecx = count / 2
jnc short @f ; count % 2 = 0?
movsb
@@:
shr ecx, 1 ; ecx = count / 4
jnc short @f ; count % 4 = 0?
movsw
@@:
rep movsd
endif
mov esi, edx
mov edi, eax
mov eax, [esp+4] ; eax = address of destination
return:
ret
memcpy endp
memmem proc public ; void *memmem(void const *haystack, size_t count,
; void const *needle, size_t length)
xor eax, eax ; eax = address of needle in haystack = 0
mov ecx, [esp+8] ; ecx = length of haystack
test ecx, ecx
jz short empty ; length of haystack = 0?
mov edx, [esp+16] ; edx = length of needle
cmp edx, ecx
ja short empty ; length of needle > length of haystack?
mov eax, [esp+4] ; eax = address of haystack
test edx, edx
jz short empty ; length of needle = 0?
push ebx
push edi
mov edi, eax ; edi = address of haystack
push esi
search:
mov esi, [esp+24] ; esi = address of needle
mov al, [esi] ; al = first character of needle
; edi = current address in haystack
repne scasb ; edi = next address in haystack,
; ecx = current length of haystack
jne short break ; (first character of) needle not in haystack?
dec ecx ; ecx = next length of haystack
mov al, [esi+edx-1] ; al = last character of needle
cmp al, [edi+edx-2]
jne short continue ; last character of needle not in haystack?
compare:
mov eax, edi ; eax = next address in haystack
mov ebx, ecx ; ebx = next length of haystack
if 0
dec edi ; edi = current address in haystack
; = address of matching character
; esi = address of needle
mov ecx, edx ; ecx = length of needle
else
; edi = next address in haystack
inc esi ; esi = address of needle + 1
mov ecx, edx
dec ecx ; ecx = length of needle - 1,
; ZF = (ecx = 0)
;; jz short match ; needle in haystack?
endif
repe cmpsb
je short match ; needle in haystack?
mov edi, eax ; edi = current address in haystack
mov ecx, ebx ; ecx = current length of haystack
continue:
cmp ecx, edx
jae short search ; length of haystack >= length of needle?
break:
xor eax, eax
pop esi
pop edi
pop ebx
empty:
ret
match:
dec eax ; eax = address of needle in haystack
pop esi
pop edi
pop ebx
ret
memmem endp
memmove proc public ; void *memmove(void *destination, void const *source, size_t count)
mov eax, [esp+4] ; eax = address of destination
mov edx, [esp+8] ; edx = address of source
cmp edx, eax
je short return ; address of source = address of destination?
mov ecx, [esp+12] ; ecx = count
;; jecxz short return ; count = 0?
xchg esi, edx ; esi = address of source
xchg edi, eax ; edi = address of destination
ja short default ; address of source > address of destination?
overlap:
lea edi, [edi+ecx-1]
lea esi, [esi+ecx-1]
std
default:
rep movsb
cld
mov esi, edx
mov edi, eax
mov eax, [esp+4] ; eax = address of destination
return:
ret
memmove endp
memrchr proc public ; void *memrchr(void const *destination, int character, size_t count)
mov edx, edi
mov edi, [esp+4] ; edi = address of destination
mov eax, [esp+8] ; eax = character
mov ecx, [esp+12] ; ecx = count
lea edi, [edi+ecx-1]
std
repne scasb
cld
inc edi ; edi = address of character
neg ecx ; CF = (ecx <> 0)
; = ([edi] = character)
sbb eax, eax ; eax = (ecx = 0) ? 0 : -1
and eax, edi ; eax = (ecx = 0) ? 0 : address of character
mov edi, edx
ret
memrchr endp
memset proc public ; void *memset(void *destination, int character, size_t count)
mov edx, edi
mov edi, [esp+4] ; edi = address of destination
mov eax, [esp+8] ; eax = character
mov ecx, [esp+12] ; ecx = count
;; jecxz short @f ; count = 0?
rep stosb
@@:
mov eax, [esp+4] ; eax = address of destination
mov edi, edx
ret
memset endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
i386-mem.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 3 command lines to generate the object file
i386-mem.obj
and add it to the existing object library
i386.lib
:
SET ML=/c /Gy /safeseh /W3 /X ML.EXE i386-mem.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-mem.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: if the
/Gy
option
to package every function in its own, separately linkable
COMDAT
section is not available with the version of the macro assembler
ML.EXE
you use,
split the i386 assembler source into multiple pieces,
with one function per source file.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-mem.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
; Copyright © 2009-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; Microsoft calling convention for AMD64 platform:
; - first 4 integer or pointer arguments (from left to right) are passed
; in registers RCX/R1, RDX/R2, R8 and R9;
; - arguments larger than 8 bytes are passed by reference;
; - surplus arguments are pushed on stack in reverse order (from right
; to left), 8-byte aligned;
; - caller allocates memory for return value larger than 8 bytes and
; passes pointer to it as (hidden) first argument, thus shifting
; all other arguments;
; - caller always allocates "home space" for 4 arguments on stack,
; even when less than 4 arguments are passed, but does not need to push
; first 4 arguments;
; - callee can spill first 4 arguments from registers to "home space";
; - callee can clobber "home space";
; - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
; when it calls other functions (CALL instruction pushes 8 bytes);
; - 64-bit integer or pointer result is returned in register RAX/R0;
; - 32-bit integer result is returned in register EAX;
; - registers RAX/R0, RCX/R1, RDX/R2, R8, R9, R10 and R11 are volatile
; and can be clobbered;
; - registers RBX/R3, RSP/R4, RBP/R5, RSI/R6, RDI/R7, R12, R13, R14 and
; R15 must be preserved.
.code
memchr proc public ; void *memchr(void const *destination, int character, size_t count)
mov r9, rdi
mov rdi, rcx ; rdi = address of destination
mov rcx, r8 ; rcx = count
mov eax, edx ; rax = character
repne scasb
lea rax, [rdi-1]
test rcx, rcx
cmovz rax, rcx ; rax = (rcx = 0) ? 0 : address of character
mov rdi, r9
ret
memchr endp
memcmp proc public ; int memcmp(void const *source, void const *destination, size_t count)
;; xor eax, eax ; rax = 0
;; test r8, r8
;; jz short equal ; count = 0?
;; cmp rcx, rdx
;; je short equal ; address of source = address of destination?
mov r9, rsi
mov rsi, rcx ; rsi = address of source
mov rcx, r8 ; rcx = count
mov r8, rdi
mov rdi, rdx ; rdi = address of destination
xor eax, eax ; rax = 0,
; CF = 0,
; ZF = 1 (required when count is 0)
repe cmpsb
seta al ; rax = (*source > *destination)
sbb rax, 0 ; rax = (*source > *destination)
; - (*source < *destination)
; = {1, 0, -1}
mov rdi, r8
mov rsi, r9
equal:
ret
memcmp endp
memcpy proc public ; void *memcpy(void *destination, void const *source, size_t count)
mov rax, rcx ; rax = address of destination
;; test r8, r8
;; jz short return ; count = 0?
;; cmp rcx, rdx
;; je short return ; address of destination = address of source?
mov r9, rdi
mov rdi, rcx ; rdi = address of destination
mov rcx, r8 ; rcx = count
mov r8, rsi
mov rsi, rdx ; rsi = address of source
if 1
rep movsb
else
shr rcx, 1 ; rcx = count / 2
jnc short @f ; count % 2 = 0?
movsb
@@:
shr rcx, 1 ; rcx = count / 4
jnc short @f ; count % 4 = 0?
movsw
@@:
shr rcx, 1 ; rcx = count / 8
jnc short @f ; count % 8 = 0?
movsd
@@:
rep movsq
endif
mov rdi, r9
mov rsi, r8
return:
ret
memcpy endp
memmem proc public ; void *memmem(void const *haystack, size_t count,
; void const *needle, size_t length)
xor eax, eax ; rax = address of needle in haystack = 0
test rdx, rdx
jz short empty ; length of haystack = 0?
cmp rdx, r9
jb short empty ; length of haystack < length of needle?
mov rax, rcx ; rax = address of haystack
test r9, r9
jz short empty ; length of needle = 0?
mov r10, rdi
mov rdi, rcx ; rdi = address of haystack
mov rcx, rdx ; rcx = length of haystack
mov r11, rsi
search:
mov al, [r8] ; al = first character of needle
; rdi = current address in haystack
repne scasb ; rdi = next address in haystack,
; rcx = current length of haystack
jne short break ; (first character of) needle not in haystack?
dec rcx ; rcx = next length of haystack
mov al, [r8+r9-1] ; al = last character of needle
cmp al, [rdi+r9-2]
jne short continue ; last character of needle not in haystack?
compare:
mov rax, rdi ; rax = next address in haystack
mov rdx, rcx ; rdx = next length of haystack
if 0
dec rdi ; rdi = current address in haystack
; = address of matching character
mov rsi, r8 ; rsi = address of needle
mov rcx, r9 ; rcx = length of needle
else
; rdi = next address in haystack
mov rsi, r8
inc rsi ; rsi = address of needle + 1
mov rcx, r9
dec rcx ; rcx = length of needle - 1,
; ZF = (rcx = 0)
;; jz short match ; needle in haystack?
endif
repe cmpsb
je short match ; needle in haystack?
mov rdi, rax ; rdi = current address in haystack
mov rcx, rdx ; rcx = current length of haystack
continue:
cmp rcx, r9
jae short search ; length of haystack >= length of needle?
break:
xor eax, eax
mov rdi, r10
mov rsi, r11
empty:
ret
match:
dec rax ; rax = address of needle in haystack
mov rdi, r10
mov rsi, r11
ret
memmem endp
memmove proc public ; void *memmove(void *destination, void const *source, size_t count)
mov rax, rcx ; rax = address of destination
;; test r8, r8
;; jz short return ; count = 0?
cmp rcx, rdx
je short return ; address of destination = address of source?
mov r9, rdi
mov rdi, rcx ; rdi = address of destination
mov rcx, r8 ; rcx = count
mov r8, rsi
mov rsi, rdx ; rsi = address of source
jb short default ; address of destination < address of source?
add rdx, rcx ; rdx = address of source + count
cmp rdi, rdx
jae short default ; address of destination >= address of source + count?
overlap:
lea rdi, [rdi+rcx-1]
lea rsi, [rsi+rcx-1]
std
default:
rep movsb
cld
mov rdi, r9
mov rsi, r8
return:
ret
memmove endp
memrchr proc public ; void *memrchr(void const *destination, int character, size_t count)
mov r9, rdi
lea rdi, [rcx+r8-1] ; rdi = address of destination + count - 1
mov eax, edx ; rax = character
mov rcx, r8 ; rcx = count
std
repne scasb
cld
lea rax, [rdi+1]
test rcx, rcx
cmovz rax, rcx ; rax = (rcx = 0) ? 0 : address of character
mov rdi, r9
ret
memrchr endp
memset proc public ; void *memset(void *destination, int character, size_t count)
mov r9, rcx ; r9 = address of destination
mov rcx, r8 ; rcx = count
;; jrcxz short @f ; count = 0?
mov r8, rdi
mov rdi, r9 ; rdi = address of destination
mov eax, edx ; rax = character
rep stosb
mov rdi, r8
@@:
mov rax, r9 ; rax = address of destination
ret
memset endp
end
Microsoft Macro Assembler Reference
Save the AMD64 assembler source presented above as
amd64-mem.asm
in the directory where you created the
object library amd64.lib
before, then execute the
following 3 command lines to generate the object file
amd64-mem.obj
and add it to the existing object library
amd64.lib
:
SET ML=/c /Gy /W3 /X ML64.EXE amd64-mem.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-mem.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: if the
/Gy
option
to package every function in its own, separately linkable
COMDAT
section is not available with the version of the macro assembler
ML64.EXE
you use, split the AMD64 assembler source into
multiple pieces, with one function per source file.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-mem.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
memcpy()
and memset()
with Intrinsic Functionsmemcpy()
function as
__movsb()
alias REP MOVSB
and the
memset()
function as
__stosb()
alias REP STOSB
, but without
shuffling as many registers as the external functions:
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef _WIN64
typedef unsigned int size_t;
#endif
#pragma function(memcpy, memset)
#pragma intrinsic(__movsb, __stosb)
__inline
void *memcpy(void *destination, void const *source, size_t count)
{
__movsb((unsigned char *) destination, (unsigned char const *) source, count);
return destination;
}
__inline
void *memset(void *destination, int character, size_t count)
{
__stosb((unsigned char *) destination, (unsigned char) character, count);
return destination;
}
intrinsic pragma
function pragma
strchr()
Standard Function for i386 Platformstrchr()
function is not a compiler helper function, it is like the
memchr()
included here for entertainment due to its DIR "%source%\intel\strchr.asm" TYPE "%source%\intel\strchr.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 5,904 strchr.asm 1 File(s) 5,904 bytes 0 Dir(s) 9,876,543,210 bytes free page ,132 title strchr - search string for given character ;*** ;strchr.asm - search a string for a given character ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; defines strchr() - search a string for a character ; ;******************************************************************************* .xlist include cruntime.inc .list page ;*** ;char *strchr(string, chr) - search a string for a character ; ;Purpose: ; Searches a string for a given character, which may be the ; null character '\0'. ; ; Algorithm: ; char * ; strchr (string, chr) ; char *string, chr; ; { ; while (*string && *string != chr) ; string++; ; if (*string == chr) ; return(string); ; return((char *)0); ; } ; ;Entry: ; char *string - string to search in ; char chr - character to search for ; ;Exit: ; returns pointer to the first occurence of c in string ; returns NULL if chr does not occur in string ; ;Uses: ; ;Exceptions: ; ;******************************************************************************* CODESEG found_bx: lea eax,[edx - 1] pop ebx ; restore ebx ret ; _cdecl return align 16 public strchr, __from_strstr_to_strchr strchr proc \ string:ptr byte, \ chr:byte OPTION PROLOGUE:NONE, EPILOGUE:NONE .FPO ( 0, 2, 0, 0, 0, 0 ) xor eax,eax mov al,[esp + 8] ; al = chr (search char) __from_strstr_to_strchr label proc push ebx ; PRESERVE EBX mov ebx,eax ; ebx = 0/0/0/chr shl eax,8 ; eax = 0/0/chr/0 mov edx,[esp + 8] ; edx = buffer test edx,3 ; test if string is aligned on 32 bits jz short main_loop_start str_misaligned: ; simple byte loop until string is aligned mov cl,[edx] add edx,1 cmp cl,bl je short found_bx test cl,cl jz short retnull_bx test edx,3 ; now aligned ? jne short str_misaligned main_loop_start: ; set all 4 bytes of ebx to [chr] or ebx,eax ; ebx = 0/0/chr/chr push edi ; PRESERVE EDI mov eax,ebx ; eax = 0/0/chr/chr shl ebx,10h ; ebx = chr/chr/0/0 push esi ; PRESERVE ESI or ebx,eax ; ebx = all 4 bytes = [chr] ; in the main loop (below), we are looking for chr or for EOS (end of string) main_loop: mov ecx,[edx] ; read dword (4 bytes) mov edi,7efefeffh ; work with edi & ecx for looking for chr mov eax,ecx ; eax = dword mov esi,edi ; work with esi & eax for looking for EOS xor ecx,ebx ; eax = dword xor chr/chr/chr/chr add esi,eax add edi,ecx xor ecx,-1 xor eax,-1 xor ecx,edi xor eax,esi add edx,4 and ecx,81010100h ; test for chr jnz short chr_is_found ; chr probably has been found ; chr was not found, check for EOS and eax,81010100h ; is any flag set ?? jz short main_loop ; EOS was not found, go get another dword and eax,01010100h ; is it in high byte? jnz short retnull ; no, definitely found EOS, return failure and esi,80000000h ; check was high byte 0 or 80h jnz short main_loop ; it just was 80h in high byte, go get ; another dword retnull: pop esi pop edi retnull_bx: pop ebx xor eax,eax ret ; _cdecl return chr_is_found: mov eax,[edx - 4] ; let's look one more time on this dword cmp al,bl ; is chr in byte 0? je short byte_0 test al,al ; test if low byte is 0 je retnull cmp ah,bl ; is it byte 1 je short byte_1 test ah,ah ; found EOS ? je retnull shr eax,10h ; is it byte 2 cmp al,bl je short byte_2 test al,al ; if in al some bits were set, bl!=bh je retnull cmp ah,bl je short byte_3 test ah,ah jz retnull jmp short main_loop ; neither chr nor EOS found, go get ; another dword byte_3: pop esi pop edi lea eax,[edx - 1] pop ebx ; restore ebx ret ; _cdecl return byte_2: lea eax,[edx - 2] pop esi pop edi pop ebx ret ; _cdecl return byte_1: lea eax,[edx - 3] pop esi pop edi pop ebx ret ; _cdecl return byte_0: lea eax,[edx - 4] pop esi ; restore esi pop edi ; restore edi pop ebx ; restore ebx ret ; _cdecl return strchr endp endWith 89 instructions in 206 bytes, this implementation is even worse than that of the
memchr()
routine shown and dissected above!
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
strchr proc public ; char *strchr(unsigned char const *string, int character)
xor eax, eax ; eax = 0
cdq ; edx = 0
mov dl, [esp+8] ; edx = character
imul edx, 01010101h ; edx = character
; | character << 8
; | character << 16
; | character << 24
mov [esp+8], edx
mov ecx, [esp+4] ; ecx = address of string
push ebx
mov ebx, ecx
and ecx, 3 ; ecx = address of string % 4
; = 4 - number of unaligned characters
jz short aligned
unaligned:
sub ebx, ecx ; ebx = aligned address before string
shl ecx, 3 ; ecx = (4 - number of unaligned characters) * 8
; = 32 - number of unaligned bits
dec eax ; eax = ~0
shl eax, cl ; eax = ~0 for unaligned characters, 0 elsewhere
not eax ; eax = 0 for unaligned characters, ~0 elsewhere
mov ecx, [ebx] ; ecx = unaligned characters
xor edx, ecx ; edx = unaligned characters ^ multiplied character
or edx, eax ; edx = '\0' for matching characters
or eax, ecx ; eax = unaligned characters, ~0 elsewhere
jmp mycroft
next:
add ebx, 4 ; ebx = address of next 4 aligned characters
aligned:
mov edx, [esp+12] ; edx = multiplied character
mov eax, [ebx] ; eax = next 4 aligned characters
xor edx, eax ; edx = next 4 aligned characters ^ multiplied character
; = '\0' for matching characters
mycroft:
mov ecx, eax
sub eax, 01010101h
not ecx
and ecx, eax
mov eax, edx
sub eax, 01010101h
not edx
and eax, edx
or eax, ecx
and eax, 80808080h ; eax = '\200' for '\0' or matching characters
jz short next ; neither '\0' nor any matching character?
match:
bsf eax, eax ; eax = offset of '\0' or matching character * 8 + 7
; = {7, 15, 23, 31}
shr eax, 3 ; eax = offset of '\0' or matching character
; = {0, 1, 2, 3}
cdq ; edx = 0
add eax, ebx ; eax = address of '\0' or matching character
mov dl, [esp+12] ; edx = character
cmp dl, [eax] ; ZF = (character = matching character)
setne dl ; edx = (character = matching character) ? 0 : 1
dec edx ; edx = (character = matching character) ? -1 : 0
and eax, edx ; eax = address of matching character
pop ebx
ret
strchr endp
end
Save the i386 assembler source presented above as
i386-strchr.asm
and the
ANSI C
source presented below as i386-strchr.c
, then execute
the 6 command lines following the
ANSI C
source to assemble i386-strchr.asm
, compile
i386-strchr.c
, link the generated object files
i386-strchr.obj
and i386-strchr.tmp
, and
finally execute the image file i386-strchr.exe
to
demonstrate the correct operation:
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
__declspec(safebuffers)
BOOL CDECL PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
CHAR szFormat[1024];
DWORD dwFormat;
DWORD dwOutput;
va_list vaInput;
va_start(vaInput, lpFormat);
dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
va_end(vaInput);
if ((dwFormat == 0UL)
|| !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
return FALSE;
return dwOutput == dwFormat;
}
const CHAR szString[] = "01234567890";
__declspec(noreturn)
VOID CDECL mainCRTStartup(VOID)
{
LPCSTR lpString = szString + sizeof(szString);
DWORD dwError = ERROR_SUCCESS;
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
if (hOutput == INVALID_HANDLE_VALUE)
dwError = GetLastError();
else
while (--lpString >= szString)
{
PrintFormat(hOutput,
"0x%p: strchr(\"%hs\", '0') = 0x%p\r\n",
lpString, lpString, strchr(lpString, '0'));
PrintFormat(hOutput,
"0x%p: strchr(\"%hs\", '%hc') = 0x%p\r\n",
lpString, lpString, *lpString, strchr(lpString, *lpString));
PrintFormat(hOutput,
"0x%p: strchr(\"%hs\", '\\0') = 0x%p\r\n",
lpString, lpString, strchr(lpString, '\0'));
}
ExitProcess(dwError);
}
SET ML=/c /safeseh /W3 /X ML.EXE i386-strchr.asm SET CL=/GAFy /Oy /W4 /Zl SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE /Foi386-strchr.tmp i386-strchr.obj i386-strchr.c kernel32.lib user32.lib .\i386-strchr.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-strchr.asm Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-strchr.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE /out:i386-strchr.exe i386-strchr.obj i386-strchr.tmp kernel32.lib user32.lib 0x00922027: strchr("", '0') = 0x00000000 0x00922027: strchr("", '▯') = 0x00922027 0x00922027: strchr("", '\0') = 0x00922027 0x00922026: strchr("0", '0') = 0x00922026 0x00922026: strchr("0", '0') = 0x00922026 0x00922026: strchr("0", '\0') = 0x00922027 0x00922025: strchr("90", '0') = 0x00922026 0x00922025: strchr("90", '9') = 0x00922025 0x00922025: strchr("90", '\0') = 0x00922027 0x00922024: strchr("890", '0') = 0x00922026 0x00922024: strchr("890", '8') = 0x00922024 0x00922024: strchr("890", '\0') = 0x00922027 0x00922023: strchr("7890", '0') = 0x00922026 0x00922023: strchr("7890", '7') = 0x00922023 0x00922023: strchr("7890", '\0') = 0x00922027 0x00922022: strchr("67890", '0') = 0x00922026 0x00922022: strchr("67890", '6') = 0x00922022 0x00922022: strchr("67890", '\0') = 0x00922027 0x00922021: strchr("567890", '0') = 0x00922026 0x00922021: strchr("567890", '5') = 0x00922021 0x00922021: strchr("567890", '\0') = 0x00922027 0x00922020: strchr("4567890", '0') = 0x00922026 0x00922020: strchr("4567890", '4') = 0x00922020 0x00922020: strchr("4567890", '\0') = 0x00922027 0x0092201F: strchr("34567890", '0') = 0x00922026 0x0092201F: strchr("34567890", '3') = 0x0092201F 0x0092201F: strchr("34567890", '\0') = 0x00922027 0x0092201E: strchr("234567890", '0') = 0x00922026 0x0092201E: strchr("234567890", '2') = 0x0092201E 0x0092201E: strchr("234567890", '\0') = 0x00922027 0x0092201D: strchr("1234567890", '0') = 0x00922026 0x0092201D: strchr("1234567890", '1') = 0x0092201D 0x0092201D: strchr("1234567890", '\0') = 0x00922027 0x0092201C: strchr("01234567890", '0') = 0x0092201C 0x0092201C: strchr("01234567890", '0') = 0x0092201C 0x0092201C: strchr("01234567890", '\0') = 0x00922027
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.xmm
.model flat, C
.code
strchr proc public ; char *strchr(unsigned char const *string, int character)
if 0
xor eax, eax ; eax = 0
mov al, [esp+8] ; eax = character
imul eax, 01010101h ; eax = character
; | character << 8
; | character << 16
; | character << 24
movd xmm0, eax
else
movd xmm0, dword ptr [esp+8]
punpcklbw xmm0, xmm0
punpcklwd xmm0, xmm0
endif
pshufd xmm0, xmm0, 0 ; xmm0 = multiplied character
mov ecx, [esp+4] ; ecx = address of string
mov edx, ecx
and ecx, 15 ; ecx = address of string % 16
; = 16 - number of unaligned characters
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before string
movdqa xmm1, [edx] ; xmm1 = chunk of 16 characters
pxor xmm2, xmm2 ; xmm2 = 0
pcmpeqb xmm2, xmm1 ; xmm2 = '\377' for each '\0' in chunk
pcmpeqb xmm1, xmm0 ; xmm1 = '\377' for each matching character in chunk
por xmm1, xmm2 ; xmm1 = '\377' for each '\0' or matching character in chunk
pmovmskb eax, xmm1 ; eax = bitmask for '\0' or matching characters in chunk
shr eax, cl
shl eax, cl ; eax = bitmask for '\0' or matching characters in string
jnz short match
next:
add edx, 16 ; edx = address of next chunk of aligned characters
aligned:
movdqa xmm1, [edx] ; xmm1 = chunk of 16 characters
pxor xmm2, xmm2 ; xmm2 = 0
pcmpeqb xmm2, xmm1 ; xmm2 = '\377' for each '\0' in chunk
pcmpeqb xmm1, xmm0 ; xmm1 = '\377' for each matching character in chunk
por xmm1, xmm2 ; xmm1 = '\377' for each '\0' or matching character in chunk
pmovmskb eax, xmm1 ; eax = bitmask for '\0' or matching characters in chunk
test eax, eax
jz short next ; no '\0' or matching character in chunk?
match:
bsf eax, eax ; eax = offset of '\0' or matching character in chunk
add eax, edx ; eax = address of '\0' or matching character
mov cl, [esp+8] ; cl = character
cmp cl, [eax] ; ZF = (character = matching character)
setne cl ; ecx = (character = matching character) ? 0 : 1
dec ecx ; ecx = (character = matching character) ? -1 : 0
and eax, ecx ; eax = address of matching character
ret
strchr endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.xmm
.model flat, C
.code
strchr proc public ; char *strchr(unsigned char const *string, int character)
pxor xmm0, xmm0 ; xmm0 = 0
movd xmm1, dword ptr [esp+8]
; xmm1 = character
pshufb xmm1, xmm0 ; xmm1 = multiplied character
mov ecx, [esp+4] ; ecx = address of string
mov edx, ecx
and ecx, 15 ; ecx = address of string % 16
; = 16 - number of unaligned characters
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before string
movdqa xmm2, [edx] ; xmm2 = chunk of 16 characters
;; pxor xmm0, xmm0 ; xmm0 = 0
pcmpeqb xmm0, xmm2 ; xmm0 = '\377' for each '\0' in chunk
pcmpeqb xmm2, xmm1 ; xmm2 = '\377' for each matching character in chunk
por xmm0, xmm2 ; xmm0 = '\377' for each '\0' or matching character in chunk
pmovmskb eax, xmm0 ; eax = bitmask for '\0' or matching characters in chunk
shr eax, cl
shl eax, cl ; eax = bitmask for '\0' or matching characters in string
jnz short match
next:
add edx, 16 ; edx = address of next chunk of aligned characters
aligned:
movdqa xmm2, [edx] ; xmm2 = chunk of 16 characters
pxor xmm0, xmm0 ; xmm0 = 0
pcmpeqb xmm0, xmm2 ; xmm0 = '\377' for each '\0' in chunk
pcmpeqb xmm2, xmm1 ; xmm2 = '\377' for each matching character in chunk
por xmm0, xmm2 ; xmm0 = '\377' for each '\0' or matching character in chunk
pmovmskb eax, xmm0 ; eax = bitmask for '\0' or matching characters in chunk
test eax, eax
jz short next ; no '\0' or matching character in chunk?
match:
bsf eax, eax ; eax = offset of '\0' or matching character in chunk
add eax, edx ; eax = address of '\0' or matching character
mov cl, [esp+8] ; cl = character
cmp cl, [eax] ; ZF = (character = matching character)
setne cl ; ecx = (character = matching character) ? 0 : 1
dec ecx ; ecx = (character = matching character) ? -1 : 0
and eax, ecx ; eax = address of matching character
ret
strchr endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.xmm
.model flat, C
.code
strchr proc public ; char *strchr(unsigned char const *string, int character)
vpxor xmm0, xmm0, xmm0; xmm0 = 0
vmovd xmm1, dword ptr [esp+8]
; xmm1 = character
vpshufb xmm1, xmm1, xmm0; xmm1 = multiplied character
mov ecx, [esp+4] ; ecx = address of string
mov edx, ecx
and ecx, 15 ; ecx = address of string % 16
; = 16 - number of unaligned characters
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before string
vmovdqa xmm2, xmmword ptr [edx]
; xmm2 = chunk of 16 characters
vpcmpeqb xmm3, xmm2, xmm1
; xmm3 = '\377' for each matching character in chunk
vpcmpeqb xmm2, xmm2, xmm0
; xmm2 = '\377' for each '\0' in chunk
vpor xmm2, xmm2, xmm3; xmm2 = '\377' for each '\0' or matching character in chunk
vpmovmskb eax, xmm2 ; eax = bitmask for '\0' or matching characters in chunk
shr eax, cl
shl eax, cl ; eax = bitmask for '\0' or matching characters in string
jnz short match
next:
add edx, 16 ; edx = address of next chunk of aligned characters
aligned:
vmovdqa xmm2, xmmword ptr [edx]
; xmm2 = chunk of 16 characters
vpcmpeqb xmm3, xmm2, xmm1
; xmm3 = '\377' for each matching character in chunk
vpcmpeqb xmm2, xmm2, xmm0
; xmm2 = '\377' for each '\0' in chunk
vpor xmm2, xmm2, xmm3; xmm2 = '\377' for each '\0' or matching character in chunk
vpmovmskb eax, xmm2 ; eax = bitmask for '\0' or matching characters in chunk
test eax, eax
jz short next ; no '\0' or matching character in chunk?
match:
bsf eax, eax ; eax = offset of '\0' or matching character in chunk
add eax, edx ; eax = address of '\0' or matching character
mov cl, [esp+8] ; cl = character
cmp cl, [eax] ; ZF = (character = matching character)
setne cl ; ecx = (character = matching character) ? 0 : 1
dec ecx ; ecx = (character = matching character) ? -1 : 0
and eax, ecx ; eax = address of matching character
ret
strchr endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.ymm
.model flat, C
.code
strchr proc public ; char *strchr(unsigned char const *string, int character)
vpxor ymm0, ymm0, ymm0; ymm0 = 0
vpbroadcastb ymm1, byte ptr [esp+8]
; ymm1 = multiplied character
mov ecx, [esp+4] ; ecx = address of string
mov edx, ecx
and ecx, 31 ; ecx = address of string % 32
; = 32 - number of unaligned characters
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before string
vmovdqa ymm2, ymmword ptr [edx]
; ymm2 = chunk of 32 characters
vpcmpeqb ymm3, ymm2, ymm1
; ymm3 = '\377' for each matching character in chunk
vpcmpeqb ymm2, ymm2, ymm0
; ymm2 = '\377' for each '\0' in chunk
vpor ymm2, ymm2, ymm3; ymm2 = '\377' for each '\0' or matching character in chunk
vpmovmskb eax, ymm2 ; eax = bitmask for '\0' or matching characters in chunk
shr eax, cl
shl eax, cl ; eax = bitmask for '\0' or matching characters in string
jnz short match
next:
add edx, 32 ; edx = address of next chunk of aligned characters
aligned:
vmovdqa ymm2, ymmword ptr [edx]
; ymm2 = chunk of 32 characters
vpcmpeqb ymm3, ymm2, ymm1
; ymm3 = '\377' for each matching character in chunk
vpcmpeqb ymm2, ymm2, ymm0
; ymm2 = '\377' for each '\0' in chunk
vpor ymm2, ymm2, ymm3; ymm2 = '\377' for each '\0' or matching character in chunk
vpmovmskb eax, ymm2 ; eax = bitmask for '\0' or matching characters in chunk
test eax, eax
jz short next ; no '\0' or matching character in chunk?
match:
bsf eax, eax ; eax = offset of '\0' or matching character in chunk
add eax, edx ; eax = address of '\0' or matching character
mov cl, [esp+8] ; cl = character
cmp cl, [eax] ; ZF = (character = matching character)
setne cl ; ecx = (character = matching character) ? 0 : 1
dec ecx ; ecx = (character = matching character) ? -1 : 0
and eax, ecx ; eax = address of matching character
ret
strchr endp
end
strlen()
Standard Function for i386 Platformstrlen()
function is not a compiler helper function, it is like the
memchr()
included here for entertainment due to its DIR "%source%\intel\strlen.asm" TYPE "%source%\intel\strlen.asm"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel 02/18/2011 03:40 PM 3,208 strlen.asm 1 File(s) 3,208 bytes 0 Dir(s) 9,876,543,210 bytes free page ,132 title strlen - return the length of a null-terminated string ;*** ;strlen.asm - contains strlen() routine ; ; Copyright (c) Microsoft Corporation. All rights reserved. ; ;Purpose: ; strlen returns the length of a null-terminated string, ; not including the null byte itself. ; ;******************************************************************************* .xlist include cruntime.inc .list page ;*** ;strlen - return the length of a null-terminated string ; ;Purpose: ; Finds the length in bytes of the given string, not including ; the final null character. ; ; Algorithm: ; int strlen (const char * str) ; { ; int length = 0; ; ; while( *str++ ) ; ++length; ; ; return( length ); ; } ; ;Entry: ; const char * str - string whose length is to be computed ; ;Exit: ; EAX = length of the string "str", exclusive of the final null byte ; ;Uses: ; EAX, ECX, EDX ; ;Exceptions: ; ;******************************************************************************* CODESEG public strlen strlen proc \ buf:ptr byte OPTION PROLOGUE:NONE, EPILOGUE:NONE .FPO ( 0, 1, 0, 0, 0, 0 ) string equ [esp + 4] mov ecx,string ; ecx -> stringWith 44 instructions in 139 bytes, this routine is a real gem too – which nobody with a sane mind should but consider to use!test ecx,3; test if string is aligned on 32 bits test cl,3 je short main_loop str_misaligned: ; simple byte loop until string is aligned mov al,byte ptr [ecx]add ecx,1inc ecx test al,al je short byte_3test ecx,3test cl,3 jne short str_misaligned jmp short main_loop byte_3: lea eax,[ecx - 1] mov ecx,string sub eax,ecx retadd eax,dword ptr 0; 5 byte nop to align label below align 16 ; should be redundant main_loop: mov eax,dword ptr [ecx] ; read 4 bytesmov edx,7efefeffh add edx,eax xor eax,-1 xor eax,edxadd ecx,4test eax,81010100hlea edx,[eax-01010101h] not eax and eax,edx and eax,80808080h je short main_loop ; found zero byte in the loop bsf eax,eax shr eax,3 lea eax,[eax+ecx-4] mov ecx,string sub eax,ecx retmov eax,[ecx - 4] test al,al; is it byte 0je short byte_0 test ah,ah; is it byte 1je short byte_1 test eax,00ff0000h; is it byte 2je short byte_2 test eax,0ff000000h; is it byte 3je short byte_3 jmp short main_loop; taken if bits 24-30 are clear and bit ; 31 is setbyte_3: lea eax,[ecx - 1] mov ecx,string sub eax,ecx ret byte_2: lea eax,[ecx - 2] mov ecx,string sub eax,ecx ret byte_1: lea eax,[ecx - 3] mov ecx,string sub eax,ecx ret byte_0: lea eax,[ecx - 4] mov ecx,string sub eax,ecx retstrlen endp end
CAVEAT: Intel’s current Optimization Reference Manual: Volume 1, published January 2024, presents this dumb implementation as Example 14-3!
OOPS: the deleted
TEST
instructions with
immediate value 3 should be replaced with the inserted
shorter ones, saving 6 bytes.
OUCH¹: the deleted
ADD
instruction which increment by 1 should
be replaced with the inserted shorter
INC
instruction, saving 1 byte.
Note: the 7 saved bytes allow to move the 4
instructions after label byte_3:
before the label
main_loop:
, (ab)using them to align the loop.
OUCH²: instead of the deleted
XOR
instruction with
immediate operand -1 the inserted shorter
NOT
instruction
should be used, saving 1 byte!
OUCH³: when the 5 deleted
instructions after label main_loop:
are replaced with
the 4 instructions inserted there, the 22 (in words:
twenty-two) deleted instructions at the
end of the function can be replaced with the 6 faster and shorter
instructions inserted there, saving 42 (in words:
fourty-two) bytes!
Note: Alan Mycroft posted the better, faster and shorter method on April 8, 1987 to the USENET news group comp.lang.c
You might be interested to know that such detection of null bytes in words
can be done in 3 or 4 instructions on almost any hardware (nay even in C).
(Code that follows relies on x being a 32 bit unsigned (or 2's complement
int with overflow ignored)...)
#define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
Then if e is an expression without side effects (e.g. variable)
has_nullbyte_(e)
is nonzero iff the value of e has a null byte.
(One can view this as explicit programming of the Manchester carry chain
present in many processors which is hidden from the instruction level).
Note: see Bit Twiddling Hacks – Determine if a word has a zero byte for a comparison of both methods and more details.
Note: Microsoft’s
strcat.asm
, strchr.asm
,
strncat.asm
and strncpy.asm
sources suffer
from the same plus some more deficiencies and flaws!
Note: with the modifications shown in the source, this routine has 27 instructions in 87 bytes, i.e. less than two thirds of the original’s instructions and bytes.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
strlen proc public ; size_t strlen(unsigned char const *string)
mov edx, [esp+4] ; edx = address of string
mov ecx, edx
and ecx, 3 ; ecx = address of string % 4
; = 4 - number of unaligned characters
jz short aligned ; address of string % 4 = 0?
unaligned:
sub edx, ecx ; edx = aligned address before string
shl ecx, 3 ; ecx = (4 - number of unaligned characters) * 8
; = 32 - number of unaligned bits
if 0
xor eax, eax
dec ; eax = ~0
else
or eax, -1 ; eax = ~0
endif
shl eax, cl ; eax = ~0 for unaligned characters, 0 elsewhere
not eax ; eax = 0 for unaligned characters, ~0 elsewhere
or eax, [edx] ; eax = unaligned characters
jmp short mycroft
next:
add edx, 4 ; edx = address of next 4 aligned characters
aligned:
mov eax, [edx] ; eax = next 4 aligned characters
mycroft:
mov ecx, eax
sub eax, 01010101h
not ecx
and eax, 80808080h
and eax, ecx ; eax = '\200' for matching characters, '\0' elsewhere
jz short next ; no '\0' in any character?
match:
bsf eax, eax ; eax = offset of '\0' * 8 + 7
; = {7, 15, 23, 31}
shr eax, 3 ; eax = offset of '\0'
; = {0, 1, 2, 3}
add eax, edx ; eax = address of '\0'
sub eax, [esp+4] ; eax = length of string
ret
strlen endp
end
Save the i386 assembler source presented above as
i386-strlen.asm
and the
ANSI C
source presented below as i386-strlen.c
, then execute
the 6 command lines following the
ANSI C
source to assemble i386-strlen.asm
, compile
i386-strlen.c
, link the generated object files
i386-strlen.obj
and i386-strlen.tmp
, and
finally execute the image file i386-strlen.exe
to
demonstrate the correct operation:
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#pragma function(strlen)
__declspec(safebuffers)
BOOL CDECL PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
CHAR szFormat[1024];
DWORD dwFormat;
DWORD dwOutput;
va_list vaInput;
va_start(vaInput, lpFormat);
dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
va_end(vaInput);
if ((dwFormat == 0UL)
|| !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
return FALSE;
return dwOutput == dwFormat;
}
const CHAR szString[] = "987654321";
__declspec(noreturn)
VOID CDECL mainCRTStartup(VOID)
{
LPCSTR lpString = szString + sizeof(szString);
DWORD dwError = ERROR_SUCCESS;
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
if (hOutput == INVALID_HANDLE_VALUE)
dwError = GetLastError();
else
while (--lpString >= szString)
PrintFormat(hOutput,
"0x%p: strlen(\"%hs\") = %lu\r\n",
lpString, lpString, strlen(lpString));
ExitProcess(dwError);
}
SET ML=/c /safeseh /W3 /X ML.EXE i386-strlen.asm SET CL=/GAFy /Oy /W4 /Zl SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE /Foi386-strlen.tmp i386-strlen.obj i386-strlen.c kernel32.lib user32.lib .\i386-strlen.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-strlen.asm Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-strlen.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE /out:i386-strlen.exe i386-strlen.obj i386-strlen.tmp kernel32.lib user32.lib 0x01202025: strlen("") = 0 0x01202024: strlen("1") = 1 0x01202023: strlen("21") = 2 0x01202022: strlen("321") = 3 0x01202021: strlen("4321") = 4 0x01202020: strlen("54321") = 5 0x0120201F: strlen("654321") = 6 0x0120201E: strlen("7654321") = 7 0x0120201D: strlen("87654321") = 8 0x0120201C: strlen("987654321") = 9
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.xmm
.model flat, C
.code
strlen proc public ; size_t strlen(unsigned char const *string)
mov ecx, [esp+4] ; ecx = address of string
mov edx, ecx
and ecx, 15 ; ecx = address of string % 16
; = 16 - number of unaligned characters
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before string
pxor xmm0, xmm0 ; xmm0 = 0
pcmpeqb xmm0, [edx] ; xmm0 = '\377' for each '\0' in chunk of characters
pmovmskb eax, xmm0 ; eax = bitmask for '\0' in chunk of characters
shr eax, cl
shl eax, cl ; eax = bitmask for '\0' in string
jnz short match
next:
add edx, 16 ; edx = address of next chunk of aligned characters
aligned:
pxor xmm0, xmm0 ; xmm0 = 0
pcmpeqb xmm0, [edx] ; xmm0 = '\377' for each '\0' in chunk of characters
pmovmskb eax, xmm0 ; eax = bitmask for '\0' in chunk of characters
test eax, eax
jz short next ; no '\0' in chunk?
match:
bsf eax, eax ; eax = offset of '\0' in chunk of characters
add eax, edx ; eax = address of '\0'
sub eax, [esp+4] ; eax = length of string
ret
strlen endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.ymm
.model flat, C
.code
strlen proc public ; size_t strlen(unsigned char const *string)
vpxor xmm0, xmm0, xmm0; xmm0 = 0
mov ecx, [esp+4] ; ecx = address of string
mov edx, ecx
and ecx, 15 ; ecx = address of string % 16
; = 16 - number of unaligned characters
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before string
vpcmpeqb xmm1, xmm0, [edx]
; xmm1 = '\377' for each '\0' in chunk of characters
vpmovmskb eax, xmm1 ; eax = bitmask for '\0' in chunk of characters
shr eax, cl
shl eax, cl ; eax = bitmask for '\0' in string
jnz short match
next:
add edx, 16 ; edx = address of next chunk of aligned characters
aligned:
vpcmpeqb xmm1, xmm0, [edx]
; xmm1 = '\377' for each '\0' in chunk of characters
vpmovmskb eax, xmm1 ; eax = bitmask for '\0' in chunk of characters
test eax, eax
jz short next ; no '\0' in chunk?
match:
bsf eax, eax ; eax = offset of '\0' in chunk of characters
add eax, edx ; eax = address of '\0'
sub eax, [esp+4] ; eax = length of string
ret
strlen endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.ymm
.model flat, C
.code
strlen proc public ; size_t strlen(unsigned char const *string)
vpxor ymm0, ymm0, ymm0; ymm0 = 0
mov ecx, [esp+4] ; ecx = address of string
mov edx, ecx
and ecx, 31 ; ecx = address of string % 32
; = 32 - number of unaligned characters
jz short aligned
unaligned:
sub edx, ecx ; edx = aligned address before string
vpcmpeqb ymm1, ymm0, [edx]
; ymm1 = '\377' for each '\0' in chunk of characters
vpmovmskb eax, ymm1 ; eax = bitmask for '\0' in chunk of characters
shr eax, cl
shl eax, cl ; eax = bitmask for '\0' in string
jnz short match
next:
add edx, 32 ; edx = address of next chunk of aligned characters
aligned:
vpcmpeqb ymm1, ymm0, [edx]
; ymm1 = '\377' for each '\0' in chunk of characters
vpmovmskb eax, ymm1 ; eax = bitmask for '\0' in chunk of characters
test eax, eax
jz short next ; no '\0' in chunk?
match:
bsf eax, eax ; eax = offset of '\0' in chunk of characters
add eax, edx ; eax = address of '\0'
sub eax, [esp+4] ; eax = length of string
ret
strlen endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.code
strlen proc public ; size_t strlen(unsigned char const *string)
mov r8, 0101010101010101h
if 0
mov r9, 8080808080808080h
elseif 0
imul r9, r8, 128 ; r9 = 0x8080808080808080
else
mov r9, r8
ror r9, 1 ; r9 = 0x8080808080808080
endif
mov r10, rcx
mov rdx, rcx ; rdx = address of string
and rcx, 7 ; rcx = address of string % 8
; = 8 - number of unaligned characters
jz short aligned ; address of string % 8 = 0?
unaligned:
sub rdx, rcx ; rdx = aligned address before string
shl ecx, 3 ; rcx = (8 - number of unaligned characters) * 8
; = 64 - number of unaligned bits
ifdef AMD
stc
sbb rax, rax ; rax = ~0
else
or rax, -1 ; rax = ~0
endif
shl rax, cl ; rax = ~0 for unaligned characters, 0 elsewhere
not rax ; rax = 0 for unaligned characters, ~0 elsewhere
or rax, [rdx] ; rax = unaligned characters
jmp short mycroft
next:
add rdx, 8 ; rdx = address of next 8 aligned characters
aligned:
mov rax, [rdx] ; rax = next 8 aligned characters
mycroft:
mov rcx, rax
sub rax, r8
not rcx
and rax, r9
and rcx, rax ; rax = '\200' for matching characters, '\0' elsewhere
jz short next ; no '\0' in any character?
match:
bsf rax, rcx ; rax = offset of '\0' * 8 + 7
; = {7, 15, 23, 31, 39, 47, 55, 63}
shr eax, 3 ; rax = offset of '\0'
; = {0, 1, 2, 3, 4, 5, 6, 7}
add rax, rdx ; rax = address of '\0'
sub rax, r10 ; rax = length of string
ret
strlen endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.code
strlen proc public ; size_t strlen(unsigned char const *string)
mov rdx, rcx ; rdx = address of string
mov r8, rcx
and ecx, 15 ; rcx = address of string % 16
; = 16 - number of unaligned characters
jz short aligned
unaligned:
sub rdx, rcx ; rdx = aligned address before string
pxor xmm0, xmm0 ; xmm0 = 0
pcmpeqb xmm0, [rdx] ; xmm0 = '\377' for each '\0' in chunk of characters
pmovmskb eax, xmm0 ; rax = bitmask for '\0' in chunk of characters
shr eax, cl
shl eax, cl ; rax = bitmask for '\0' in string
jnz short match
next:
add rdx, 16 ; rdx = address of next chunk of aligned characters
aligned:
pxor xmm0, xmm0 ; xmm0 = 0
pcmpeqb xmm0, [rdx] ; xmm0 = '\377' for each '\0' in chunk of characters
pmovmskb eax, xmm0 ; rax = bitmask for '\0' in chunk of characters
test eax, eax
jz short next ; no '\0' in chunk?
match:
bsf eax, eax ; rax = offset of '\0' in chunk of characters
add rax, rdx ; rax = address of '\0'
sub rax, r8 ; rax = length of string
ret
strlen endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.code
strlen proc public ; size_t strlen(unsigned char const *string)
vpxor xmm0, xmm0, xmm0; xmm0 = 0
mov rdx, rcx ; rdx = address of string
mov r8, rcx
and ecx, 15 ; rcx = address of string % 16
; = 16 - number of unaligned characters
jz short aligned
unaligned:
sub rdx, rcx ; rdx = aligned address before string
vpcmpeqb xmm1, xmm0, [rdx]
; xmm1 = '\377' for each '\0' in chunk of characters
vpmovmskb eax, xmm1 ; rax = bitmask for '\0' in chunk of characters
shr eax, cl
shl eax, cl ; rax = bitmask for '\0' in string
jnz short match
next:
add rdx, 16 ; rdx = address of next chunk of aligned characters
aligned:
vpcmpeqb xmm1, xmm0, [rdx]
; xmm1 = '\377' for each '\0' in chunk of characters
vpmovmskb eax, xmm1 ; rax = bitmask for '\0' in chunk of characters
test eax, eax
jz short next ; no '\0' in chunk?
match:
bsf eax, eax ; rax = offset of '\0' in chunk of characters
add rax, rdx ; rax = address of '\0'
sub rax, r8 ; rax = length of string
ret
strlen endp
end
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.code
strlen proc public ; size_t strlen(unsigned char const *string)
vpxor ymm0, ymm0, ymm0; ymm0 = 0
mov rdx, rcx ; rdx = address of string
mov r8, rcx
and ecx, 31 ; rcx = address of string % 32
; = 32 - number of unaligned characters
jz short aligned
unaligned:
sub rdx, rcx ; rdx = aligned address before string
vpcmpeqb ymm1, ymm0, [rdx]
; ymm1 = '\377' for each '\0' in chunk of characters
vpmovmskb eax, ymm1 ; rax = bitmask for '\0' in chunk of characters
shr eax, cl
shl eax, cl ; rax = bitmask for '\0' in string
jnz short match
next:
add rdx, 32 ; rdx = address of next chunk of aligned characters
aligned:
vpcmpeqb ymm1, ymm0, [rdx]
; ymm1 = '\377' for each '\0' in chunk of characters
vpmovmskb eax, ymm1 ; rax = bitmask for '\0' in chunk of characters
test eax, eax
jz short next ; no '\0' in chunk?
match:
bsf eax, eax ; rax = offset of '\0' in chunk of characters
add rax, rdx ; rax = address of '\0'
sub rax, r8 ; rax = length of string
ret
strlen endp
end
strrchr()
and strstr()
Standard Functions for i386 Platformstrrchr()
and
strstr()
functions are no compiler helper functions, they are like the
memchr()
function included here for entertainment due to their
extra ordinary DIR "%source%\str*.c" TYPE "%source%\strrchr.c" TYPE "%source%\strstr.c"
Volume in drive C has no label. Volume Serial Number is 1957-0427 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src 02/18/2011 03:40 PM 1,998 strcat.c 02/18/2011 03:40 PM 541 strcat_s.c 02/18/2011 03:40 PM 1,102 strchr.c 02/18/2011 03:40 PM 1,566 strcmp.c 02/18/2011 03:40 PM 2,532 strcoll.c 02/18/2011 03:40 PM 479 strcpy_s.c 02/18/2011 03:40 PM 337 strcspn.c 02/18/2011 03:40 PM 3,227 strdate.c 02/18/2011 03:40 PM 1,895 strdup.c 02/18/2011 03:40 PM 4,085 stream.c 02/18/2011 03:40 PM 4,414 strerror.c 02/18/2011 03:40 PM 42,150 strftime.c 02/18/2011 03:40 PM 2,757 stricmp.c 02/18/2011 03:40 PM 2,570 stricoll.c 02/18/2011 03:40 PM 1,009 strlen.c 02/18/2011 03:40 PM 1,276 strlen_s.c 02/18/2011 03:40 PM 5,994 strlwr.c 02/18/2011 03:40 PM 1,496 strncat.c 02/18/2011 03:40 PM 564 strncat_s.c 02/18/2011 03:40 PM 2,546 strncmp.c 02/18/2011 03:40 PM 1,250 strncnt.c 02/18/2011 03:40 PM 3,108 strncoll.c 02/18/2011 03:40 PM 1,464 strncpy.c 02/18/2011 03:40 PM 536 strncpy_s.c 02/18/2011 03:40 PM 3,628 strnicmp.c 02/18/2011 03:40 PM 2,988 strnicol.c 02/18/2011 03:40 PM 1,243 strnset.c 02/18/2011 03:40 PM 580 strnset_s.c 02/18/2011 03:40 PM 337 strpbrk.c 02/18/2011 03:40 PM 1,460 strrchr.c 02/18/2011 03:40 PM 1,204 strrev.c 02/18/2011 03:40 PM 1,204 strset.c 02/18/2011 03:40 PM 519 strset_s.c 02/18/2011 03:40 PM 4,922 strspn.c 02/18/2011 03:40 PM 1,371 strstr.c 02/18/2011 03:40 PM 3,226 strtime.c 02/18/2011 03:40 PM 3,500 strtod.c 02/18/2011 03:40 PM 4,167 strtok.c 02/18/2011 03:40 PM 450 strtok_s.c 02/18/2011 03:40 PM 8,862 strtol.c 02/18/2011 03:40 PM 7,726 strtoq.c 02/18/2011 03:40 PM 6,094 strupr.c 02/18/2011 03:40 PM 4,739 strxfrm.c 43 File(s) 147,116 bytes 0 Dir(s) 9,876,543,210 bytes free /*** *strrchr.c - find last occurrence of character in string * * Copyright (c) Microsoft Corporation. All rights reserved. * *Purpose: * defines strrchr() - find the last occurrence of a given character * in a string. * *******************************************************************************/ #include <cruntime.h> #include <string.h> /*** *char *strrchr(string, ch) - find last occurrence of ch in string * *Purpose: * Finds the last occurrence of ch in string. The terminating * null character is used as part of the search. * *Entry: * char *string - string to search in * char ch - character to search for * *Exit: * returns a pointer to the last occurrence of ch in the given * string * returns NULL if ch does not occurr in the string * *Exceptions: * *******************************************************************************/ char * __cdecl strrchr ( const char * string, int ch ) { char *start = (char *)string; while (*string++) /* find end of string */ ; /* search towards front */ while (--string != start && *string != (char)ch) ; if (*string == (char)ch) /* char found ? */ return( (char *)string ); return(NULL); } /*** *strstr.c - search for one string inside another * * Copyright (c) Microsoft Corporation. All rights reserved. * *Purpose: * defines strstr() - search for one string inside another * *******************************************************************************/ #include <cruntime.h> #include <string.h> /*** *char *strstr(string1, string2) - search for string2 in string1 * *Purpose: * finds the first occurrence of string2 in string1 * *Entry: * char *string1 - string to search in * char *string2 - string to search for * *Exit: * returns a pointer to the first occurrence of string2 in * string1, or NULL if string2 does not occur in string1 * *Uses: * *Exceptions: * *******************************************************************************/ char * __cdecl strstr ( const char * str1, const char * str2 ) { char *cp = (char *) str1; char *s1, *s2; if ( !*str2 ) return((char *)str1); while (*cp) { s1 = cp; s2 = (char *) str2; while ( *s1 && *s2 && !(*s1-*s2) ) s1++, s2++; if (!*s2) return(cp); cp++; } return(NULL); }OUCH¹: the
strrchr()
function traverses its input string without necessity
twice!
OUCH²: the
strstr()
function has quadratic, i.e.
𝒪(n2) runtime – a real
shame!
strrchr()
function for processors which support the
Streaming SIMD Extensions 4.2
alias Nehalem New Instructions, introduced
November 11, 2008
with the Core™i* line of processors,
needs only 14 instructions in just 42 bytes:
; Copyright © 2009-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.xmm
.model flat, C
.code
strrchr proc public ; char *strrchr(unsigned char const *string, int character)
xor eax, eax ; eax = 0
mov edx, [esp+4] ; edx = address of string
and edx, -16 ; edx = aligned address before string
movzx ecx, byte ptr [esp+8]
movd xmm0, ecx ; xmm0 = prototype string "‹character›"
@@:
pcmpistri xmm0, [edx], 40h
; CF = ('‹character›' in chunk of characters),
; ZF = ('\0' in chunk of characters),
; ecx = ('\0' or '‹character›' in chunk of characters)
; ? index of '\0' or last matching '‹character›' : 16
lea ecx, [ecx+edx]
cmovc eax, ecx ; eax = address of last matching '‹character›'
lea edx, [edx+16]
jnz short @b ; no '\0' in chunk of characters?
xor ecx, ecx ; ecx = 0
cmp eax, [esp+4]
cmovb eax, ecx ; eax = (address of '‹character›' < address of string) ? 0
ret
strrchr endp
end
str*()
Standard Functionsstrcat()
,
strchr()
,
strcmp()
,
strcoll()
,
strcpy()
,
strcspn()
,
strlen()
,
strncat()
,
strncmp()
,
strncpy()
,
strnlen()
,
strnset()
,
strpbrk()
,
strrchr()
,
strrev()
,
strset()
,
_strset()
strspn()
,
strstr()
strtok()
strtok_s()
and strtok_r()
functions for the i386 and
the AMD64 platform follow with build instructions.
Note: only
strcat()
,
strcmp()
,
strcpy()
,
strlen()
and
strset()
are available as
intrinsic
functions.
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define NULL (void *) 0
#ifndef _WIN64
typedef unsigned int size_t;
#endif
#pragma function(strcat, strcmp, strcpy, strlen, strset)
#pragma intrinsic(memcmp)
void *memchr(void const *memory, int character, size_t count);
int memcmp(void const *source, void const *destination, size_t count);
size_t strlen(unsigned char const *string);
char *strstr(unsigned char const *haystack, unsigned char const *needle)
{
#if 0
if (*needle == '\0') // needle is an empty string?
return (char *) haystack;
if (*haystack == '\0') // haystack is an empty string?
return NULL;
return (char *) memmem(haystack, strlen(haystack), needle, strlen(needle));
#else
unsigned char const *string;
size_t length = strlen(needle);
size_t count = strlen(haystack);
if (!count || length > count)
return NULL;
if (!length) // needle is an empty string?
return (char *) haystack;
if (!--length) // needle is a single character?
return memchr(haystack, *needle, count);
count -= length; // maximum number of characters to scan in haystack
while (string = haystack,
haystack = (unsigned char const *) memchr(haystack, *needle, count),
haystack) // *haystack is first character of needle; compare
{ // last character of needle first, then proceed
if (haystack[length] == needle[length]
#if 0
&& length == 1 || !memcmp(haystack + 1, needle + 1, length - 1))
#else
&& !memcmp(haystack, needle, length))
#endif
return (char *) haystack;
// skip character in haystack,
// adjust number of characters left in haystack
count -= ++haystack - string;
if (!count)
break;
}
return NULL;
#endif
}
char *strrchr(unsigned char const *string, int character)
{
char *address = NULL;
do
if (*string == (unsigned char) character)
address = (char *) string;
while (*string++);
return address;
}
char *strchr(unsigned char const *string, int character)
{
do
if (*string == (unsigned char) character)
return (char *) string;
while (*string++);
return NULL;
}
char *strcat(unsigned char *destination, unsigned char const *source)
{
char *string = (char *) destination;
#if 0
destination += strlen(destination);
#else
while (*destination)
destination++;
#endif
while (*source)
*destination++ = *source++;
return string;
}
char *strncat(unsigned char *destination, unsigned char const *source, size_t count)
{
char *string = (char *) destination;
#if 0
destination += strlen(destination);
#else
while (*destination)
destination++;
#endif
while (count && *source)
*destination++ = *source++, --count;
*destination = '\0';
return string;
}
char *strcpy(unsigned char *destination, unsigned char const *source)
{
char *string = (char *) destination;
while (*source)
*destination++ = *source++;
return string;
}
char *strncpy(unsigned char *destination, unsigned char const *source, size_t count)
{
char *string = (char *) destination;
while (count && *source)
*destination++ = *source++, --count;
while (count)
*destination++ = '\0', --count;
return string;
}
int strcmp(unsigned char const *source, unsigned char const *destination)
{
if (source != destination)
do
if (*source - *destination)
#if 0
return *source - *destination;
#else
return (*source > *destination) - (*source < *destination);
#endif
while (destination++, *source++);
return 0;
}
int strncmp(unsigned char const *source, unsigned char const *destination, size_t count)
{
if (count && source != destination)
do
if (*source - *destination)
#if 0
return *source - *destination;
#else
return (*source > *destination) - (*source < *destination);
#endif
while (source++, *destination++ && --count);
return 0;
}
size_t strlen(unsigned char const *string)
{
#if 0
unsigned char *source = string;
while (*source)
source++;
return source - string;
#else
return (unsigned char *) memchr(string, '\0', ~(size_t) 0) - string;
#endif
}
size_t strnlen(unsigned char const *string, size_t count)
{
unsigned char *nul = memchr(string, '\0', count);
return nul ? nul - string : count;
}
__declspec(safebuffers)
size_t strspn(unsigned char const *string, unsigned char const *delimiter)
{
// yield number of leading characters in array 'string'
// which are equal to any character in array 'delimiter'
size_t bitmap[256 / (8 * sizeof(size_t))] = {0};
if (!*delimiter)
return 0;
if (!*string)
return 0;
do // build bitmap
bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
while (*++delimiter);
delimiter = string;
while (bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t))))
string++;
return string - delimiter;
}
__declspec(safebuffers)
size_t strcspn(unsigned char const *string, unsigned char const *delimiter)
{
// yield number of leading characters in array 'string'
// which differ from each character in array 'delimiter'
size_t bitmap[256 / (8 * sizeof(size_t))] = {1};
if (!*delimiter)
return strlen(string);
if (!*string)
return 0;
do // build bitmap
bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
while (*++delimiter);
delimiter = string;
while (!(bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t)))))
string++;
return string - delimiter;
}
__declspec(safebuffers)
char *strpbrk(unsigned char const *string, unsigned char const *delimiter)
{
// yield pointer to first character in array 'string'
// which is equal to any character in array 'delimiter'
#if 0
string += strcspn(string, delimiter);
return *string ? (char *) string : NULL;
#else
size_t bitmap[256 / (8 * sizeof(size_t))] = {0};
if (!*delimiter)
return NULL;
if (!*string)
return NULL;
do // build bitmap
bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
while (*++delimiter);
do
if (bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t))))
return (char *) string;
while (*++string);
return NULL;
#endif
}
char *strset(char *string, int character)
{
char *destination = string;
while (*destination)
*destination++ = (char) character;
return string;
}
__declspec(safebuffers)
char *strtok_r(unsigned char *string, unsigned char const *delimiter, char **next)
{
#if 0
if (!string)
string = (unsigned char *) *next;
if (!string || !*string)
return *next = NULL;
// skip leading delimiters
string += strspn(string, delimiter);
if (!*string) // no characters left?
return *next = NULL;
// skip token, i.e. non-delimiters,
// and save its address
*next = (char *) string + strcspn(string, delimiter);
if (!**next) // no characters left?
*next = NULL;
else // terminate token
*(*next)++ = '\0';
return (char *) string;
#else
size_t bitmap[256 / (8 * sizeof(size_t))] = {0};
if (!string)
string = (unsigned char *) *next;
if (!string || !*string)
return *next = NULL;
if (!*delimiter)
return *next = NULL, (char *) string;
do // build bitmap
bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
while (*++delimiter);
// skip leading delimiters
while (bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t))))
string++;
if (!*string) // no characters left?
return *next = NULL;
delimiter = string; // save (address of) token
*bitmap |= 1; // add '\0' as delimiter
do // skip token, i.e. non-delimiters
string++;
while (!(bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t)))));
if (!*string) // no characters left?
string = NULL;
else // terminate token
*string++ = '\0';
*next = (char *) string;// save (address of) next character
return (char *) delimiter;
#endif
}
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: characters are unsigned!
.386
.model flat, C
.code
strcat proc public ; char *strcat(char *destination, char const *source)
mov edx, edi
mov edi, [esp+8] ; edi = address of source string
xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
not ecx ; ecx = length of source string (including '\0')
push ecx
mov edi, [esp+8] ; edi = address of destination string
;; xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
dec edi ; edi = address of '\0'
; = end of destination string
mov eax, esi
mov esi, [esp+12] ; esi = address of source string
pop ecx ; ecx = length of source string (including '\0')
rep movsb
mov edi, edx
mov esi, eax
mov eax, [esp+4] ; eax = address of destination string
ret
strcat endp
strchr proc public ; char *strchr(char const *string, int character)
mov edx, edi
mov edi, [esp+4] ; edi = address of string
xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
not ecx ; ecx = length of string (including '\0')
sub edi, ecx ; edi = address of string
mov eax, [esp+8] ; eax = character
repne scasb
dec edi ; edi = address of character
neg ecx ; CF = (ecx <> 0)
; = ([edi] = character)
sbb eax, eax ; eax = (ecx = 0) ? 0 : -1
and eax, edi ; eax = (ecx = 0) ? 0 : address of character
mov edi, edx
ret
strchr endp
strcmp proc public ; int strcmp(char const *source, char const *destination)
push esi
push edi
xor eax, eax ; eax = 0
mov esi, [esp+12] ; esi = address of source string
mov edi, [esp+16] ; edi = address of destination string
cmp edi, esi
je short equal ; address of destination string = address of source string?
;; xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
not ecx ; ecx = length of destination string (including '\0')
sub edi, ecx ; edi = address of destination string
;; xor eax, eax ; eax = 0
repe cmpsb
seta al ; eax = (*source > *destination)
sbb eax, 0 ; eax = (*source > *destination)
; - (*source < *destination)
; = {1, 0, -1}
equal:
pop edi
pop esi
ret
strcmp endp
; NOTE: strcoll() is another implementation of strcmp()!
strcoll proc public ; int strcoll(char const *source, char const *destination)
mov ecx, [esp+4] ; ecx = address of source string
mov edx, [esp+8] ; edx = address of destination string
sub edx, ecx
jz short equal ; address of destination string = address of source string?
compare:
mov al, [ecx]
cmp al, [ecx+edx]
jne short different
inc ecx
test al, al
jnz short compare ; *source <> '\0'?
equal:
xor eax, eax ; eax = 0
ret
different:
sbb eax, eax ; eax = (*source < *destination) ? -1 : 0
or eax, 1 ; eax = (*source < *destination)
; - (*source > *destination)
; = {-1, 0, 1}
ret
strcoll endp
strcpy proc public ; char *strcpy(char *destination, char const *source)
mov edx, edi
mov edi, [esp+8] ; edi = address of source string
xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
not ecx ; ecx = length of source string (including '\0')
sub edi, ecx ; edi = address of source string
mov eax, esi
mov esi, edi ; esi = address of source string
mov edi, [esp+4] ; edi = address of destination string
rep movsb
mov edi, edx
mov esi, eax
mov eax, [esp+4] ; eax = address of destination string
ret
strcpy endp
strcspn proc public ; size_t strcspn(char const *string, char const *delimiter)
mov eax, [esp+4] ; eax = address of string
mov edx, [esp+8] ; edx = address of delimiter
xor ecx, ecx ; ecx = 0
cmp cl, [edx]
je short empty ; delimiter[0] = '\0'?
push ecx
push ecx
push ecx
push ecx
push ecx
push ecx
push ecx
push ecx ; bitmap[0..255] = 0,
; esp = address of bitmap
setup:
bts [esp], ecx ; bitmap[ecx] = 1
mov cl, [edx] ; ecx = *delimiter
inc edx ; edx = ++delimiter
cmp cl, ch
jne short setup ; ecx <> '\0'?
mov edx, eax ; edx = address of string
skip:
mov cl, [eax] ; ecx = *string
inc eax ; eax = ++string
bt [esp], ecx
jnc short skip ; bitmap[ecx] = 0 (no match)?
stop:
sbb eax, edx ; eax = number of non-matching characters
add esp, 32
ret
empty:
mov edx, eax ; edx = address of string
count:
inc eax ; eax = ++string
cmp cl, [eax-1]
jne short count
if 1
dec eax ; eax = --string
sub eax, edx
else
stc
sbb eax, edx ; eax = number of characters
endif
ret
strcspn endp
strlen proc public ; size_t strlen(char const *string)
mov edx, edi
mov edi, [esp+4] ; edi = address of string
xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb ; ecx = -1 - (address of '\0' + 1 - address of string)
; = -1 - (length of string + 1)
; = -2 - length of string
if 0
mov eax, -2
sub eax, ecx ; eax = -2 + 2 + length of string
; = length of string
else
mov eax, ecx ; eax = -1 - (length of string + 1)
not eax ; eax = length of string + 1
dec eax ; eax = length of string
endif
mov edi, edx
ret
strlen endp
strncat proc public ; char *strncat(char *destination, char const *source, size_t count)
push esi
push edi
mov esi, [esp+16] ; esi = address of source string
mov edx, [esp+20] ; edx = count
mov edi, esi ; edi = address of source string
mov ecx, edx ; ecx = count
xor eax, eax ; eax = '\0'
repne scasb
sub edx, ecx ; edx = length of source string (including '\0')
mov edi, [esp+12] ; edi = address of destination string
;; xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
dec edi ; edi = address of '\0'
; = end of destination string
mov ecx, edx ; ecx = length of source string (including '\0')
rep movsb
;; xor eax, eax ; eax = '\0'
stosb
mov eax, [esp+12] ; eax = address of destination string
pop edi
pop esi
ret
strncat endp
strncmp proc public ; int strncmp(char const *source, char const *destination, size_t count)
push esi
push edi
xor eax, eax ; eax = 0
mov esi, [esp+12] ; esi = address of source string
mov edi, [esp+16] ; edi = address of destination string
cmp edi, esi
je short equal ; address of destination string = address of source string?
mov ecx, [esp+20] ; ecx = count
test ecx, ecx
jz short equal ; count = 0?
;; xor eax, eax ; eax = 0,
;; ; CF = 0,
;; ; ZF = 1 (required when count is 0)
repe cmpsb
seta al ; eax = (*source > *destination)
sbb eax, 0 ; eax = (*source > *destination)
; - (*source < *destination)
; = {1, 0, -1}
equal:
pop edi
pop esi
ret
strncmp endp
strncpy proc public ; char *strncpy(char *destination, char const *source, size_t count)
push esi
push edi
mov esi, [esp+16] ; esi = address of source string
mov edx, [esp+20] ; edx = count
mov edi, esi ; edi = address of source string
mov ecx, edx ; ecx = count
xor eax, eax ; eax = '\0'
repne scasb
sub ecx, edx
neg ecx ; ecx = length of source string (including '\0')
sub edx, ecx ; edx = count - length of source string (including '\0')
mov edi, [esp+12] ; edi = address of destination string
rep movsb
mov ecx, edx ; ecx = count - length of source string (including '\0')
;; xor eax, eax ; eax = '\0'
rep stosb
mov eax, [esp+12] ; eax = address of destination string
pop edi
pop esi
ret
strncpy endp
strnlen proc public ; size_t strnlen(char const *string, size_t count)
mov ecx, [esp+8] ; ecx = count
test ecx, ecx
jz short empty ; count = 0?
xor eax, eax ; eax = '\0'
mov edx, edi
mov edi, [esp+4] ; edi = address of string
;; test edi, edi ; ZF = 0 (required when count is 0)
repne scasb ; ecx = (length of string < count)
; ? count - (length of string + 1) : 0
neg ecx ; CF = (ecx <> 0)
; = ([edi] = '\0')
; = (length of string < count),
; ecx = (length of string < count)
; ? length of string + 1 - count : 0
sbb ecx, eax ; ecx = (length of string < count)
; ? length of string - count : 0
add ecx, [esp+8] ; ecx = (length of string < count)
; ? length of string : count
mov edi, edx
empty:
mov eax, ecx ; eax = (length of string < count)
; ? length of string : count
ret
strnlen endp
strnset proc public ; char *strnset(char *string, int character, size_t count)
mov edx, [esp+4] ; edx = address of string
mov ecx, [esp+12] ; ecx = count
test ecx, ecx
jz short zero ; count = 0?
xor eax, eax ; eax = '\0'
push edi
mov edi, edx ; edi = address of string
;; test edi, edi ; ZF = 0 (required when count is 0)
repne scasb ; ecx = (length of string < count)
; ? count - (length of string + 1) : 0
mov edi, edx ; edi = address of string
neg ecx ; CF = (ecx <> 0)
; = ([edi] = '\0')
; = (length of string < count),
; ecx = (length of string < count)
; ? length of string + 1 - count : 0
sbb ecx, eax ; ecx = (length of string < count)
; ? length of string - count : 0
add ecx, [esp+16] ; ecx = (length of string < count)
; ? length of string : count
mov eax, [esp+12] ; eax = character
rep stosb
pop edi
zero:
mov eax, edx ; eax = address of string
ret
strnset endp
strpbrk proc public ; char *strpbrk(char const *string, char const *delimiter)
mov eax, [esp+4] ; eax = address of string
mov edx, [esp+8] ; edx = address of delimiter
xor ecx, ecx ; ecx = 0
cmp cl, [edx]
je short empty ; delimiter[0] = '\0'?
push ecx
push ecx
push ecx
push ecx
push ecx
push ecx
push ecx
push ecx ; bitmap[0..255] = 0,
; esp = address of bitmap
setup:
bts [esp], ecx ; bitmap[ecx] = 1
mov cl, [edx] ; ecx = *delimiter
inc edx ; edx = ++delimiter
cmp cl, ch
jne short setup ; ecx <> '\0'?
skip:
mov cl, [eax] ; ecx = *string
inc eax ; eax = ++string
bt [esp], ecx
jnc short skip ; bitmap[ecx] = 0 (no match)?
stop:
dec eax ; eax = --string
neg ecx
sbb ecx, ecx ; ecx = (*string = '\0') ? 0 : -1
and eax, ecx ; eax = (*string = '\0') ? 0 : address of string
add esp, 32
ret
empty:
xor eax, eax
ret
strpbrk endp
strrchr proc public ; char *strrchr(char const *string, int character)
mov edx, edi
mov edi, [esp+4] ; edi = address of string
xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
not ecx ; ecx = length of string (including '\0')
dec edi ; edi = address of '\0'
; = end of string
mov eax, [esp+8] ; eax = character
std
repne scasb
cld
inc edi ; edi = address of character
neg ecx ; CF = (ecx <> 0)
; = ([edi] = character)
sbb eax, eax ; eax = (ecx = 0) ? 0 : -1
and eax, edi ; eax = (ecx = 0) ? 0 : address of character
mov edi, edx
ret
strrchr endp
strrev proc public ; char *strrev(char *string)
mov edx, edi
mov edi, [esp+4] ; edi = address of string
xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
add ecx, edi ; ecx = address of string - 1
dec edi ; edi = address of '\0'
; = end of string
jmp short continue
reverse:
mov al, [ecx]
mov ah, [edi]
mov [ecx], ah
mov [edi], al
continue:
inc ecx
dec edi
cmp edi, ecx
ja short reverse
mov edi, edx
mov eax, [esp+4] ; eax = address of string
ret
strrev endp
strset proc public ; char *strset(char *string, int character)
mov edx, edi
mov edi, [esp+4] ; edi = address of string
xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
not ecx ; ecx = length of string (including '\0')
sub edi, ecx ; edi = address of string
dec ecx ; ecx = length of string
mov eax, [esp+8] ; eax = character
rep stosb
mov edi, edx
mov eax, [esp+4] ; eax = address of string
ret
strset endp
strspn proc public ; size_t strspn(char const *string, char const *delimiter)
mov eax, [esp+4] ; eax = address of string
mov edx, [esp+8] ; edx = address of delimiter
xor ecx, ecx ; ecx = 0
cmp cl, [edx]
je short empty ; delimiter[0] = '\0'?
push ecx
push ecx
push ecx
push ecx
push ecx
push ecx
push ecx
push ecx ; bitmap[0..255] = 0,
; esp = address of bitmap
mov cl, [edx] ; ecx = *delimiter
inc edx ; edx = ++delimiter
setup:
bts [esp], ecx ; bitmap[ecx] = 1
mov cl, [edx] ; ecx = *delimiter
inc edx ; edx = ++delimiter
cmp cl, ch
jne short setup ; ecx <> '\0'?
mov edx, eax ; edx = address of string
skip:
mov cl, [eax] ; ecx = *string
inc eax ; eax = ++string
bt [esp], ecx
jc short skip ; bitmap[ecx] = 1 (match)?
if 1
dec eax ; eax = --string
sub eax, edx
else
stc
sbb eax, edx ; eax = number of matching characters
endif
add esp, 32
ret
empty:
xor eax, eax
ret
strspn endp
strstr proc public ; char *strstr(char const *haystack, char const *needle)
push edi
mov edi, [esp+12] ; edi = address of needle
xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
not ecx ; ecx = length of needle (including '\0')
dec ecx ; ecx = length of needle
mov eax, [esp+8] ; eax = address of haystack
jz short empty ; length of needle = 0?
mov edx, ecx ; edx = length of needle
ifdef SIMPLE
push esi
compare:
mov esi, eax ; esi = current address in haystack
mov edi, [esp+16] ; edi = address of needle
mov ecx, edx ; ecx = length of needle
repe cmpsb
je short match ; needle in haystack?
inc eax ; eax = next address in haystack
cmp byte ptr [esi-1], 0
jne short compare ; non-matching character in haystack <> '\0'?
xor eax, eax
match:
else ; SIMPLE
mov edi, eax ; edi = address of haystack
xor eax, eax ; eax = '\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasb
not ecx ; ecx = length of haystack (including '\0')
sub edi, ecx ; edi = address of haystack
dec ecx ; ecx = length of haystack
jz short empty ; length of haystack = 0?
cmp ecx, edx
jb short empty ; length of haystack < length of needle?
push esi
push ebx
search:
mov esi, [esp+20] ; esi = address of needle
mov al, [esi] ; al = first character of needle
; edi = current address in haystack
repne scasb ; edi = next address in haystack,
; ecx = current length of haystack
jne short break ; (first character of) needle not in haystack?
dec ecx ; ecx = next length of haystack
mov al, [esi+edx-1] ; al = last character of needle
cmp al, [edi+edx-2]
jne short continue ; last character of needle not in haystack?
compare:
mov eax, edi ; eax = next address in haystack
mov ebx, ecx ; ebx = next length of haystack
if 0
dec edi ; edi = current address in haystack
; = address of matching character
; esi = address of needle
mov ecx, edx ; ecx = length of needle
else
; edi = next address in haystack
inc esi ; esi = address of needle + 1
mov ecx, edx
dec ecx ; ecx = length of needle - 1,
; ZF = (ecx = 0)
;; jz short match ; needle in haystack?
endif
repe cmpsb
je short match ; needle in haystack?
mov edi, eax ; edi = current address in haystack
mov ecx, ebx ; ecx = current length of haystack
continue:
cmp ecx, edx
jae short search ; length of haystack >= length of needle?
break:
xor eax, eax
pop ebx
pop esi
pop edi
ret
match:
dec eax ; eax = address of needle in haystack
pop ebx
endif ; SIMPLE
pop esi
empty:
pop edi
ret
strstr endp
strtok_r proc public ; char *strtok_r(char *string, char const *delimiter, char **next)
mov ecx, [esp+4] ; ecx = address of string
mov eax, [esp+8] ; eax = address of delimiter
mov edx, [esp+12] ; edx = address of address of next
test ecx, ecx
jnz short start ; address of string <> 0?
or ecx, [edx] ; ecx = address of next
jz short null ; address of next = 0 = address of string?
start:
cmp byte ptr [ecx], 0
je short null ; string[0] = '\0'?
cmp byte ptr [eax], 0
je short empty ; delimiter[0] = '\0'?
push ebx
xor ebx, ebx ; ebx = 0
push ebx
push ebx
push ebx
push ebx
push ebx
push ebx
push ebx
push ebx ; bitmap[0..255] = 0,
; esp = address of bitmap
mov bl, [eax] ; ebx = *delimiter
inc eax ; eax = ++delimiter
setup:
bts [esp], ebx ; bitmap[ebx] = 1
mov bl, [eax] ; ebx = *delimiter
inc eax ; eax = ++delimiter
cmp bl, bh
jne short setup ; ebx <> '\0'?
skip:
mov bl, [ecx] ; ebx = *string
inc ecx ; ecx = ++string
bt [esp], ebx
jc short skip ; bitmap[ebx] = 1 (ebx is a delimiter)?
cmp bl, bh
je short none ; ebx = '\0'?
mov bl, bh ; ebx = 0
bts [esp], ebx ; bitmap['\0'] = 1
mov eax, ecx
dec eax ; eax = address of token
token:
mov bl, [ecx] ; ebx = *string
inc ecx ; ecx = ++string
bt [esp], ebx
jnc short token ; bitmap[ebx] = 0 (ebx is not a delimiter)?
cmp bl, bh
je short last ; ebx = '\0'?
mov [ecx-1], bh ; string[-1] = '\0' (terminate token)
mov [edx], ecx ; *next = address of string
add esp, 32
pop ebx
ret
none:
mov eax, ebx ; eax = 0
last:
mov [edx], ebx ; *next = 0
add esp, 32
pop ebx
ret
null:
xor eax, eax ; eax = 0
mov [edx], eax ; *next = 0
ret
empty:
mov eax, ecx ; eax = address of string
xor ecx, ecx
mov [edx], ecx ; *next = 0
ret
strtok_r endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
i386-str.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 3 command lines to generate the object file
i386-str.obj
and add it to the existing object library
i386.lib
:
SET ML=/c /Gy /safeseh /W3 /X ML.EXE i386-str.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-str.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: if the
/Gy
option
to package every function in its own, separately linkable
COMDAT
section is not available with the version of the macro assembler
ML.EXE
you use,
split the i386 assembler source into multiple pieces,
with one function per source file.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-str.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: characters are unsigned!
.code
strcat proc public ; char *strcat(char *destination, char const *source)
mov r9, rcx ; r9 = address of destination string
mov r10, rdi
ifdef VARIANT
mov rdi, rcx ; rdi = address of destination string
xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
dec rdi ; rdi = address of '\0'
; = end of destination string
mov r11, rsi
mov rsi, rdi ; rsi = end of destination string
mov rdi, rdx ; rdi = address of source string
;; xor eax, eax
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
not rcx ; rcx = length of source string (including '\0')
mov rdi, rsi ; rdi = end of destination string
mov rsi, rdx ; rsi = address of source string
else ; VARIANT
mov rdi, rdx ; rdi = address of source string
;; xor eax, eax
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
not rcx ; rcx = length of source string (including '\0')
mov r11, rsi
mov rsi, rdx ; rsi = address of source string
mov rdx, rcx
mov rdi, r9 ; rdi = address of destination string
;; xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
dec rdi ; rdi = address of '\0'
; = end of destination string
mov rcx, rdx ; rcx = length of source string (including '\0')
endif ; VARIANT
rep movsb
mov rax, r9 ; rax = address of destination string
mov rdi, r10
mov rsi, r11
ret
strcat endp
strchr proc public ; char *strchr(char const *string, int character)
mov r10, rdi
mov rdi, rcx ; rdi = address of string
xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
not rcx ; rcx = length of string (including '\0')
mov rax, rdx ; rax = character
sub rdi, rcx ; rdi = address of string
repne scasb
lea rax, [rdi-1] ; rax = address of character
cmovne rax, rcx ; rax = (rcx = 0) ? 0 : address of character
mov rdi, r10
ret
strchr endp
strcmp proc public ; ssize_t strcmp(char const *source, char const *destination)
xor eax, eax ; rax = 0
cmp rcx, rdx
je short equal ; address of source string = address of destination string?
mov r11, rsi
mov rsi, rcx ; rsi = address of source string
mov r10, rdi
mov rdi, rdx ; rdi = address of destination string
;; xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
not rcx ; rcx = length of destination string (including '\0')
mov rdi, rdx ; rdi = address of destination string
;; xor eax, eax ; rax = 0
repe cmpsb
seta al ; rax = (*source > *destination)
sbb rax, 0 ; rax = (*source > *destination)
; - (*source < *destination)
; = {1, 0, -1}
mov rdi, r10
mov rsi, r11
equal:
ret
strcmp endp
; NOTE: strcoll() is another implementation of strcmp()!
strcoll proc public ; ssize_t strcoll(char const *source, char const *destination)
sub rdx, rcx
jz short equal ; address of destination string = address of source string?
compare:
mov al, [rcx]
cmp al, [rcx+rdx]
jne short different
inc rcx
test al, al
jnz short compare ; *source <> '\0'?
equal:
xor eax, eax ; rax = 0
ret
different:
sbb rax, rax ; rax = (*source < *destination) ? -1 : 0
or rax, 1 ; rax = (*source < *destination)
; - (*source > *destination)
; = {-1, 0, 1}
ret
strcoll endp
strcpy proc public ; char *strcpy(char *destination, char const *source)
mov r9, rcx ; r9 = address of destination string
mov r10, rdi
mov rdi, rdx ; rdi = address of source string
xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
not rcx ; rcx = length of source string (including '\0')
mov rdi, r9 ; rdi = address of destination string
mov r11, rsi
mov rsi, rdx ; rsi = address of source string
rep movsb
mov rax, r9 ; rax = address of destination string
mov rdi, r10
mov rsi, r11
ret
strcpy endp
strcspn proc public ; size_t strcspn(char const *string, char const *delimiter)
xor eax, eax ; rax = 0
cmp al, [rdx]
je short empty ; delimiter[0] = '\0'?
mov [rsp+32], rax
mov [rsp+24], rax
mov [rsp+16], rax
mov [rsp+8], rax ; bitmap[0..255] = 0
setup:
bts [rsp+8], rax ; bitmap[rax] = 1
mov al, [rdx] ; rax = *delimiter
inc rdx ; rdx = ++delimiter
cmp al, ah
jne short setup ; rax <> '\0'?
mov rdx, rcx ; rdx = address of string
skip:
mov al, [rcx] ; rax = *string
inc rcx ; rcx = ++string
bt [rsp+8], rax
jnc short skip ; bitmap[rax] = 0 (no match)?
stop:
sbb rcx, rdx ; rcx = number of non-matching characters
mov rax, rcx
ret
empty:
mov rdx, rcx ; rdx = address of string
count:
cmp al, [rcx]
lea rcx, [rcx+1] ; rcx = ++string
jne short count ; *string <> '\0'?
stc
sbb rcx, rdx ; rcx = number of characters
mov rax, rcx
ret
strcspn endp
strlen proc public ; size_t strlen(char const *string)
mov rdx, rdi
mov rdi, rcx ; rdi = address of string
xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
if 0
not rcx ; rcx = length of string (including '\0')
dec rcx
mov rax, rcx ; rax = length of string
else
mov rax, -2
sub rax, rcx ; rax = length of string
endif
mov rdi, rdx
ret
strlen endp
strncat proc private ; char *strncat(char *destination, char const *source, size_t count)
ud2
strncat endp
strncmp proc private ; int strncmp(char const *source, char const *destination, size_t count)
ud2
strncmp endp
strncpy proc private ; char *strncpy(char *destination, char const *source, size_t count)
ud2
strncpy endp
strnlen proc private ; size_t strnlen(char const *string, size_t count)
ud2
strnlen endp
strnset proc private ; char *strnset(char const *string, int character, size_t count)
ud2
strnset endp
strpbrk proc public ; char *strpbrk(char const *string, char const *delimiter)
xor eax, eax ; rax = 0
cmp al, [rdx]
je short empty ; delimiter[0] = '\0'?
mov [rsp+32], rax
mov [rsp+24], rax
mov [rsp+16], rax
mov [rsp+8], rax ; bitmap[0..255] = 0
setup:
bts [rsp+8], rax ; bitmap[rax] = 1
mov al, [rdx] ; rax = *delimiter
inc rdx ; rdx = ++delimiter
cmp al, ah
jne short setup ; rax <> '\0'?
skip:
mov al, [rcx] ; rax = *string
inc rcx ; rcx = ++string
bt [rsp+8], rax
jnc short skip ; bitmap[rax] = 0 (no match)?
stop:
dec rcx ; rcx = --string
neg eax
sbb rax, rax ; rax = (*string = '\0') ? 0 : -1
and rax, rcx ; rax = (*string = '\0') ? 0 : address of string
empty:
ret
strpbrk endp
strrchr proc public ; char *strrchr(char const *string, int character)
mov r10, rdi
mov rdi, rcx ; rdi = address of string
xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
not rcx ; rcx = length of string (including '\0')
mov rax, rdx ; rax = character
dec rdi ; rdi = address of '\0'
; = end of string
std
repne scasb
cld
lea rax, [rdi+1] ; rax = address of character
cmovne rax, rcx ; rax = (rcx = 0) ? 0 : address of character
mov rdi, r10
ret
strrchr endp
strrev proc private ; char *strrev(char *string)
ud2
strrev endp
strset proc public ; char *strset(char *string, int character)
mov r9, rcx ; r9 = address of string
mov r10, rdi
mov rdi, rcx ; rdi = address of string
xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
not rcx ; rcx = length of string (including '\0')
dec rcx
mov rdi, r9 ; rdi = address of string
mov rax, rdx ; rax = character
rep stosb
mov rax, r9 ; rax = address of string
mov rdi, r10
ret
strset endp
strspn proc public ; size_t strspn(char const *string, char const *delimiter)
xor eax, eax ; rax = 0
cmp al, [rdx]
je short empty ; delimiter[0] = '\0'?
mov [rsp+32], rax
mov [rsp+24], rax
mov [rsp+16], rax
mov [rsp+8], rax ; bitmap[0..255] = 0
mov al, [rdx] ; rax = *delimiter
inc rdx ; rdx = ++delimiter
setup:
bts [rsp+8], rax ; bitmap[rax] = 1
mov al, [rdx] ; rax = *delimiter
inc rdx ; rdx = ++delimiter
cmp al, ah
jne short setup ; rax <> '\0'?
mov rdx, rcx ; rdx = address of string
skip:
mov al, [rcx] ; rax = *string
inc rcx ; rcx = ++string
bt [rsp+8], rax
jc short skip ; bitmap[rax] = 1 (match)?
if 0
dec rcx ; rcx = --string
sub rcx, rdx
else
stc
sbb rcx, rdx ; rcx = number of matching characters
endif
mov rax, rcx
empty:
ret
strspn endp
strstr proc public ; char *strstr(char const *haystack, char const *needle)
mov r8, rcx ; r8 = address of haystack
mov r10, rdi
mov rdi, rdx ; rdi = address of needle
xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
not rcx ; rcx = length of needle (including '\0')
dec rcx ; rcx = length of needle
mov rax, r8 ; rax = address of haystack
jz short empty ; length of needle = 0?
mov r9, rcx ; r9 = length of needel
mov rdi, r8 ; rdi = address of haystack
xor eax, eax ; rax = '\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasb
not rcx ; rcx = length of haystack (including '\0')
sub rdi, rcx ; rdi = address of haystack
dec rcx ; rcx = length of haystack
jz short empty ; length of haystack = 0?
cmp rcx, r9
jb short empty ; length of haystack <length of needle?
mov r11, rsi
search:
mov al, [rdx] ; al = first character of needle
; rdi = current address in haystack
repne scasb ; rdi = next address in haystack,
; rcx = current length of haystack
jne short break ; (first character of) needle not in haystack?
dec rcx ; rcx = next length of haystack
mov al, [rdx+r9-1] ; al = last character of needle
cmp al, [rdi+r9-2]
jne short continue ; last character of needle not in haystack?
compare:
mov rax, rdi ; rax = next address in haystack
mov r8, rcx ; r8 = next length of haystack
if 0
dec rdi ; rdi = current address in haystack
; = address of matching character
mov rsi, rdx ; rsi = address of needle
mov rcx, r9 ; rcx = length of needle
else
; rdi = next address in haystack
mov rsi, rdx
inc rsi ; rsi = address of needle + 1
mov rcx, r9
dec rcx ; rcx = length of needle - 1,
; ZF = (rcx = 0)
;; jz short match ; needle in haystack?
endif
repe cmpsb
je short match ; needle in haystack?
mov rdi, rax ; rdi = current address in haystack
mov rcx, r8 ; rcx = current length of haystack
continue:
cmp rcx, r9
jae short search ; length of haystack >= length of needle?
break:
xor eax, eax
mov rdi, r10
mov rsi, r11
empty:
ret
match:
dec rax ; rax = address of needle in haystack
mov rdi, r10
mov rsi, r11
ret
strstr endp
strtok_r proc public ; char *strtok_r(char *string, char const *delimiter, char **next)
xor eax, eax ; rax = 0
test rcx, rcx
jnz short start ; string <> 0?
or rcx, [r8] ; rcx = next
jz short null ; address of next = 0 = address of string?
start:
cmp al, [rcx]
je short null ; string[0] = '\0'?
cmp al, [rdx]
je short empty ; *delimiter = '\0'?
mov [rsp+32], rax
mov [rsp+24], rax
mov [rsp+16], rax
mov [rsp+8], rax ; bitmap[0..255] = 0
mov al, [rdx] ; rax = *delimiter
inc rdx ; rdx = ++delimiter
setup:
bts [rsp+8], rax ; bitmap[rax] = 1
mov al, [rdx] ; rax = *delimiter
inc rdx ; rdx = ++delimiter
cmp al, ah
jne short setup ; rax <> '\0'?
skip:
mov al, [rcx] ; rax = *string
inc rcx ; rcx = ++string
bt [rsp+8], rax
jc short skip ; bitmap[rax] = 1 (rax is a delimiter)?
cmp al, ah
je short none ; rax = '\0'?
mov al, ah ; rax = 0
bts [rsp+8], rax ; bitmap['\0'] = 1
lea rdx, [rcx-1] ; rdx = address of token
token:
mov al, [rcx] ; rax = *string
inc rcx ; rcx = ++string
bt [rsp+8], rax
jnc short token ; bitmap[rax] = 0 (rax is not a delimiter)?
cmp al, ah
je short last ; rax = '\0'?
mov [rcx-1], ah ; string[-1] = '\0' (terminate token)
mov [r8], rcx ; *next = address of string
mov rax, rdx ; rax = address of token
ret
last:
mov [r8], rax ; *next = 0
mov rax, rdx ; rax = address of token
ret
empty:
mov [r8], rax ; *next = 0
mov rax, rcx ; rax = address of token
ret
null:
none:
mov [r8], rax ; *next = 0
ret
strtok_r endp
end
Microsoft Macro Assembler Reference
Save the AMD64 assembler source presented above as
amd64-str.asm
in the directory where you created the
object library amd64.lib
before, then execute the
following 3 command lines to generate the object file
amd64-str.obj
and add it to the existing object library
amd64.lib
:
SET ML=/c /Gy /W3 /X ML64.EXE amd64-str.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-str.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: if the
/Gy
option
to package every function in its own, separately linkable
COMDAT
section is not available with the version of the macro assembler
ML64.EXE
you use, split the AMD64 assembler source into
multiple pieces, with one function per source file.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-str.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
wcs*()
Standard Functionswcscat()
,
wcschr()
,
wcscmp()
,
wcscoll()
,
wcscpy()
,
wcslen()
,
wcsncat()
,
wcsncmp()
,
wcsncpy()
,
wcsnlen()
,
wcsnset()
,
wcsrchr()
,
wcsrev()
,
wcsset()
and
wcsstr()
functions for the i386 and the AMD64
platform follow with build instructions.
_wcsset()
Note: only
wcscat()
,
wcscmp()
,
wcscpy()
and
wcslen()
are available as
intrinsic
functions.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: counts and lengths are numbers of wide characters, not bytes!
.386
.model flat, C
.code
wcscat proc public ; wchar_t *wcscat(wchar_t *destination, wchar_t const *source)
mov edx, edi
mov edi, [esp+8] ; edi = address of source string
xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
not ecx ; ecx = length of source string (including L'\0')
push ecx
mov edi, [esp+8] ; edi = address of destination string
;; xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
dec edi
dec edi ; edi = address of L'\0'
; = end of destination string
mov eax, esi
mov esi, [esp+12] ; esi = address of source string
pop ecx ; ecx = length of source string (including L'\0')
rep movsw
mov edi, edx
mov esi, eax
mov eax, [esp+4] ; eax = address of destination string
ret
wcscat endp
wcschr proc public ; wchar_t *wcschr(wchar_t const *string, wchar_t character)
mov edx, edi
mov edi, [esp+4] ; edi = address of string
xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
not ecx ; ecx = length of string (including L'\0')
sub edi, ecx
sub edi, ecx ; edi = address of string
mov eax, [esp+8] ; eax = wide character
repne scasw
dec edi
dec edi ; edi = address of wide character
neg ecx ; CF = (ecx <> 0)
; = ([edi] = wide character)
sbb eax, eax ; eax = (ecx = 0) ? 0 : -1
and eax, edi ; eax = (ecx = 0) ? 0 : address of wide character
mov edi, edx
ret
wcschr endp
wcscmp proc public ; int wcscmp(wchar_t const *source, wchar_t const *destination)
push esi
push edi
xor eax, eax ; eax = 0
mov esi, [esp+12] ; esi = address of source string
mov edi, [esp+16] ; edi = address of destination string
cmp edi, esi
je short equal ; address of destination string = address of source string?
;; xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
not ecx ; ecx = length of destination string (including L'\0')
sub edi, ecx
sub edi, ecx ; edi = address of destination string
;; xor eax, eax ; eax = 0
repe cmpsw
seta al ; eax = (*source > *destination)
sbb eax, 0 ; eax = (*source > *destination)
; - (*source < *destination)
; = {1, 0, -1}
equal:
pop edi
pop esi
ret
wcscmp endp
; NOTE: wcscoll() is another implementation of wcscmp()!
wcscoll proc public ; int wcscoll(wchar_t const *source, wchar_t const *destination)
mov ecx, [esp+4] ; ecx = address of source string
mov edx, [esp+8] ; edx = address of destination string
sub edx, ecx
jz short equal ; address of destination string = address of source string?
compare:
mov ax, [ecx]
cmp ax, [ecx+edx]
jne short different
inc ecx
inc ecx
test ax, ax
jnz short compare ; *source <> L'\0'?
equal:
xor eax, eax ; eax = 0
ret
different:
sbb eax, eax ; eax = (*source < *destination) ? -1 : 0
or eax, 1 ; eax = (*source < *destination)
; - (*source < *destination)
; = {-1, 0, 1}
ret
wcscoll endp
wcscpy proc public ; wchar_t *wcscpy(wchar_t *destination, wchar_t const *source)
mov edx, edi
mov edi, [esp+8] ; edi = address of source string
xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
not ecx ; ecx = length of source string (including L'\0')
mov eax, esi
mov esi, [esp+8] ; esi = address of source string
mov edi, [esp+4] ; edi = address of destination string
rep movsw
mov edi, edx
mov esi, eax
mov eax, [esp+4] ; eax = address of destination string
ret
wcscpy endp
wcslen proc public ; size_t wcslen(wchar_t const *string)
mov edx, edi
mov edi, [esp+4] ; edi = address of string
xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw ; ecx = -1 - (address of L'\0' + 2 - address of string)
; = -1 - (length of string + 1)
; = -2 - length of string
if 0
mov eax, -2
sub eax, ecx ; eax = -2 + 2 + length of string
; = length of string
else
mov eax, ecx ; eax = -1 - (length of string + 1)
not eax ; eax = length of string + 1
dec eax ; eax = length of string
endif
mov edi, edx
ret
wcslen endp
wcsncat proc public ; wchar_t *wcsncat(wchar_t *destination, wchar_t const *source, size_t count)
push esi
push edi
mov esi, [esp+16] ; esi = address of source string
mov edx, [esp+20] ; edx = count
mov edi, esi ; edi = address of source string
mov ecx, edx ; ecx = count
xor eax, eax ; eax = L'\0'
repne scasw
sub edx, ecx ; edx = length of source string (including L'\0')
mov edi, [esp+12] ; edi = address of destination string
;; xor eax, eax ; eax = 'L\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
dec edi
dec edi ; edi = address of L'\0'
; = end of destination string
mov ecx, edx ; ecx = length of source string (including L'\0')
rep movsw
;; xor eax, eax ; eax = L'\0'
stosw
mov eax, [esp+12] ; eax = address of destination string
pop edi
pop esi
ret
wcsncat endp
wcsncmp proc public ; int wcsncmp(wchar_t const *source, wchar_t const *destination, size_t count)
push esi
push edi
xor eax, eax ; eax = 0
mov esi, [esp+12] ; esi = address of source string
mov edi, [esp+16] ; edi = address of destination string
cmp edi, esi
je short equal ; address of destination string = address of source string?
mov ecx, [esp+20] ; ecx = count
test ecx, ecx
jz short equal ; count = 0?
;; xor eax, eax ; eax = 0,
;; ; CF = 0,
; ZF = 1 (required when count is 0)
repe cmpsw
seta al ; eax = (*source > *destination)
sbb eax, 0 ; eax = (*source > *destination)
; - (*source < *destination)
; = {1, 0, -1}
equal:
pop edi
pop esi
ret
wcsncmp endp
wcsncpy proc public ; wchar_t *wcsncpy(wchar_t *destination, wchar_t const *source, size_t count)
push esi
push edi
mov esi, [esp+16] ; esi = address of source string
mov edx, [esp+20] ; edx = count
mov edi, esi ; edi = address of source string
mov ecx, edx ; ecx = count
xor eax, eax ; eax = L'\0'
repne scasw
sub ecx, edx
neg ecx ; ecx = length of source string (including L'\0')
sub edx, ecx ; edx = count - length of source string (including L'\0')
mov edi, [esp+12] ; edi = address of destination string
rep movsw
mov ecx, edx ; ecx = count - length of source string (including L'\0')
;; xor eax, eax ; eax = L'\0'
rep stosw
mov eax, [esp+12] ; eax = address of destination string
pop edi
pop esi
ret
wcsncpy endp
wcsnlen proc public ; size_t wcsnlen(wchar_t const *string, size_t count)
mov ecx, [esp+8] ; ecx = count
test ecx, ecx
jz short empty ; count = 0?
xor eax, eax ; eax = L'\0'
mov edx, edi
mov edi, [esp+4] ; edi = address of string
;; test edi, edi ; ZF = 0 (required when count is 0)
repne scasw ; ecx = (length of string < count)
; ? count - (length of string + 1) : 0
neg ecx ; CF = (ecx <> 0)
; = ([edi] = L'\0')
; = (length of string < count),
; ecx = (length of string < count)
; ? length of string + 1 - count : 0
sbb ecx, eax ; ecx = (length of string < count)
; ? length of string - count : 0
add ecx, [esp+8] ; ecx = (length of string < count)
; ? length of string : count
mov edi, edx
empty:
mov eax, ecx ; eax = (length of string < count)
; ? length of string : count
ret
wcsnlen endp
wcsnset proc public ; wchar_t *wcsnset(wchar_t *string, wchar_t character, size_t count)
mov edx, [esp+4] ; edx = address of string
mov ecx, [esp+12] ; ecx = count
test ecx, ecx
jz short zero ; count = 0?
xor eax, eax ; eax = L'\0'
push edi
mov edi, edx ; edi = address of string
;; test edi, edi ; ZF = 0 (required when count is 0)
repne scasw ; ecx = (length of string < count)
; ? count - (length of string + 1) : 0
mov edi, edx ; edi = address of string
neg ecx ; CF = (ecx <> 0)
; = ([edi] = L'\0')
; = (length of string < count),
; ecx = (length of string < count)
; ? length of string + 1 - count : 0
sbb ecx, eax ; ecx = (length of string < count)
; ? length of string - count : 0
add ecx, [esp+16] ; ecx = (length of string < count)
; ? length of string : count
mov eax, [esp+12] ; eax = wide character
rep stosw
pop edi
zero:
mov eax, edx ; eax = address of string
ret
wcsnset endp
wcsrchr proc public ; wchar_t *wcsrchr(wchar_t const *string, wchar_t character)
mov edx, edi
mov edi, [esp+4] ; edi = address of string
xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
not ecx ; ecx = length of string (including L'\0')
dec edi
dec edi ; edi = address of L'\0'
; = end of string
mov eax, [esp+8] ; eax = wide character
std
repne scasw
cld
inc edi
inc edi ; edi = address of wide character
neg ecx ; CF = (ecx <> 0)
; = ([edi] = wide character)
sbb eax, eax ; eax = (ecx = 0) ? 0 : -1
and eax, edi ; eax = (ecx = 0) ? 0 : address of wide character
mov edi, edx
ret
wcsrchr endp
wcsrev proc private ; wchar_t *wcsrev(wchar_t *string)
ud2
wcsrev endp
wcsset proc public ; wchar_t *wcsset(wchar_t *string, wchar_t character)
mov edx, edi
mov edi, [esp+4] ; edi = address of string
xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
not ecx ; ecx = length of string (including L'\0')
sub edi, ecx
sub edi, ecx ; edi = address of string
dec ecx ; ecx = length of string
mov eax, [esp+8] ; eax = wide character
rep stosw
mov edi, edx
mov eax, [esp+4] ; eax = address of string
ret
wcsset endp
wcsstr proc public ; wchar_t *wcsstr(wchar_t const *haystack, wchar_t const *needle)
push edi
mov edi, [esp+12] ; edi = address of needle
xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
not ecx ; ecx = length of needle (including L'\0')
dec ecx ; ecx = length of needle
mov eax, [esp+8] ; eax = address of haystack
jz short empty ; length of needle = 0?
mov edx, ecx ; edx = length of needle
ifdef SIMPLE
push esi
compare:
mov esi, eax ; esi = current address in haystack
mov edi, [esp+16] ; edi = address of needle
mov ecx, edx ; ecx = length of needle
repe cmpsw
je short match ; needle in haystack?
inc eax
inc eax ; eax = next address in haystack
cmp word ptr [esi-2], 0
jne short compare ; non-matching wide character in haystack <> L'\0'?
xor eax, eax
match:
else ; SIMPLE
mov edi, eax ; edi = address of haystack
xor eax, eax ; eax = L'\0'
xor ecx, ecx
dec ecx ; ecx = -1
repne scasw
not ecx ; ecx = length of haystack (including L'\0')
sub edi, ecx
sub edi, ecx ; edi = address of haystack
dec ecx ; ecx = length of haystack
jz short empty ; length of haystack = 0?
cmp ecx, edx
jb short empty ; length of haystack < length of needle?
push esi
push ebx
search:
mov esi, [esp+20] ; esi = address of needle
mov ax, [esi] ; ax = first wide character of needle
; edi = current address in haystack
repne scasw ; edi = next address in haystack,
; ecx = current length of haystack
jne short break ; (first wide character of) needle not in haystack?
dec ecx ; ecx = next length of haystack
mov ax, [esi+edx*2-2]
; ax = last wide character of needle
cmp ax, [edi+edx*2-4]
jne short continue ; last wide character of needle not in haystack?
compare:
mov eax, edi ; eax = next address in haystack
mov ebx, ecx ; ebx = next length of haystack
if 0
dec edi
dec edi ; edi = current address in haystack
; = address of matching wide character
; esi = address of needle
mov ecx, edx ; ecx = length of needle
else
; edi = next address in haystack
inc esi
inc esi ; esi = address of needle + 2
mov ecx, edx
dec ecx ; ecx = length of needle - 1,
; ZF = (ecx = 0)
;; jz short match ; needle in haystack?
endif
repe cmpsw
je short match ; needle in haystack?
mov edi, eax ; edi = current address in haystack
mov ecx, ebx ; ecx = current length of haystack
continue:
cmp ecx, edx
jae short search ; length of haystack >= length of needle?
break:
xor eax, eax
pop ebx
pop esi
pop edi
ret
match:
dec eax
dec eax ; eax = address of needle in haystack
pop ebx
endif ; SIMPLE
pop esi
empty:
pop edi
ret
wcsstr endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
i386-wcs.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 3 command lines to generate the object file
i386-wcs.obj
and add it to the existing object library
i386.lib
:
SET ML=/c /Gy /safeseh /W3 /X ML.EXE i386-wcs.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-wcs.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: if the
/Gy
option
to package every function in its own, separately linkable
COMDAT
section is not available with the version of the macro assembler
ML.EXE
you use,
split the i386 assembler source into multiple pieces,
with one function per source file.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-wcs.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: counts and lengths are numbers of wide characters, not bytes!
.code
wcscat proc public ; wchar_t *wcscat(wchar_t *destination, wchar_t const *source)
mov r9, rcx ; r9 = address of destination string
mov r10, rdi
ifdef VARIANT
mov rdi, rcx ; rdi = address of destination string
xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
dec rdi ; rdi = address of L'\0'
; = end of destination string
mov r11, rsi
mov rsi, rdi ; rsi = end of destination string
mov rdi, rdx ; rdi = address of source string
;; xor eax, eax
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of source string (including L'\0')
mov rdi, rsi ; rdi = end of destination string
mov rsi, rdx ; rsi = address of source string
else ; VARIANT
mov rdi, rdx ; rdi = address of source string
;; xor eax, eax
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of source string (including L'\0')
mov r11, rsi
mov rsi, rdx ; rsi = address of source string
mov rdx, rcx
mov rdi, r9 ; rdi = address of destination string
;; xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
dec rdi
dec rdi ; rdi = address of L'\0'
; = end of destination string
mov rcx, rdx ; rcx = length of source string (including L'\0')
endif ; VARIANT
rep movsw
mov rax, r9 ; rax = address of destination string
mov rdi, r10
mov rsi, r11
ret
wcscat endp
wcschr proc public ; wchar_t *wcschr(wchar_t const *string, wchar_t character)
mov r10, rdi
mov rdi, rcx ; rdi = address of string
xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of string (including L'\0')
mov rax, rdx ; rax = wide character
sub rdi, rcx ; rdi = address of string
repne scasw
lea rax, [rdi-2] ; rax = address of wide character
cmovne rax, rcx ; rax = (rcx = 0) ? 0 : address of wide character
mov rdi, r10
ret
wcschr endp
wcscmp proc public ; int wcscmp(wchar_t const *source, wchar_t const *destination)
xor eax, eax ; rax = 0
cmp rcx, rdx
je short equal ; address of source string = address of destination string?
mov r11, rsi
mov rsi, rcx ; rsi = address of source string
mov r10, rdi
mov rdi, rdx ; rdi = address of destination string
;; xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of destination string (including L'\0')
mov rdi, rdx ; rdi = address of destination string
;; xor eax, eax ; rax = 0
repe cmpsw
seta al ; rax = (*source > *destination)
sbb rax, 0 ; rax = (*source > *destination)
; - (*source < *destination)
; = {1, 0, -1}
mov rdi, r10
mov rsi, r11
equal:
ret
wcscmp endp
; NOTE: wcscoll() is another implementation of wcscmp()!
wcscoll proc public ; int wcscoll(wchar_t const *source, wchar_t const *destination)
sub rdx, rcx
jz short equal ; address of destination string = address of source string?
compare:
mov ax, [rcx]
cmp ax, [rcx+rdx]
jne short different
lea rcx, [rcx+2]
test ax, ax
jnz short compare ; *source <> L'\0'?
equal:
xor eax, eax ; rax = 0
ret
different:
sbb rax, rax ; rax = (*source < *destination) ? -1 : 0
or rax, 1 ; rax = (*source < *destination)
; - (*source > *destination)
; = {-1, 0, 1}
ret
wcscoll endp
wcscpy proc public ; wchar_t *wcscpy(wchar_t *destination, wchar_t const *source)
mov r9, rcx ; r9 = address of destination string
mov r10, rdi
mov rdi, rdx ; rdi = address of source string
xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of source string (including L'\0')
mov rdi, r9 ; rdi = address of destination string
mov r11, rsi
mov rsi, rdx ; rsi = address of source string
rep movsw
mov rax, r9 ; rax = address of destination string
mov rdi, r10
mov rsi, r11
ret
wcscpy endp
wcslen proc public ; size_t wcslen(wchar_t const *string)
mov rdx, rdi
mov rdi, rcx ; rdi = address of string
xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of string (including L'\0')
dec rcx
dec rcx
mov rax, rcx ; rax = length of string
mov rdi, rdx
ret
wcslen endp
wcsncat proc private ; wchar_t *wcsncat(wchar_t *destination, wchar_t const *source, size_t count)
ud2
wcsncat endp
wcsncmp proc private ; int wcsncmp(wchar_t const *source, wchar_t const *destination, size_t count)
ud2
wcsncmp endp
wcsncpy proc private ; wchar_t *wcsncpy(wchar_t *destination, wchar_t const *source, size_t count)
ud2
wcsncpy endp
wcsnlen proc private ; size_t wcsnlen(wchar_t const *string, size_t count)
ud2
wcsnlen endp
wcsnset proc private ; wchar_t *wcsnset(wchar_t *string, wchar_t character, size_t count)
ud2
wcsnset endp
wcsrchr proc public ; wchar_t *wcsrchr(wchar_t const *string, wchar_t character)
mov r10, rdi
mov rdi, rcx ; rdi = address of string
xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of string (including L'\0')
mov rax, rdx ; rax = wide character
lea rdi, [rdi-2] ; rdi = address of L'\0'
; = end of string
std
repne scasw
cld
lea rax, [rdi+2] ; rax = address of wide character
cmovne rax, rcx ; rax = (rcx = 0) ? 0 : address of wide character
mov rdi, r10
ret
wcsrchr endp
wcsrev proc private ; wchar_t *wcsrev(wchar_t *string)
ud2
wcsrev endp
wcsset proc public ; wchar_t *wcsset(wchar_t *string, wchar_t character)
mov r9, rcx ; r9 = address of string
mov r10, rdi
mov rdi, rcx ; rdi = address of string
xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of string (including L'\0')
dec rcx
mov rdi, r9 ; rdi = address of string
mov rax, rdx ; rax = wide character
rep stosw
mov rax, r9 ; rax = address of string
mov rdi, r10
ret
wcsset endp
wcsstr proc public ; wchar_t *wcsstr(wchar_t const *haystack, wchar_t const *needle)
mov r8, rcx ; r8 = address of haystack
mov r10, rdi
mov rdi, rdx ; rdi = address of needle
xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of needle (including L'\0')
dec rcx ; rcx = length of needle
mov rax, r8 ; rax = address of haystack
jz short empty ; length of needle = 0?
mov r9, rcx ; r9 = length of needel
mov rdi, r8 ; rdi = address of haystack
xor eax, eax ; rax = L'\0'
ifdef AMD
stc
sbb rcx, rcx ; rcx = -1
else
or rcx, -1 ; rcx = -1
endif
repne scasw
not rcx ; rcx = length of haystack (including L'\0')
mov rdi, r8 ; rdi = address of haystack
dec rcx ; rcx = length of haystack
jz short empty ; length of haystack = 0?
cmp rcx, r9
jb short empty ; length of haystack <length of needle?
mov r11, rsi
search:
mov ax, [rdx] ; ax = first wide character of needle
; rdi = current address in haystack
repne scasw ; rdi = next address in haystack,
; rcx = current length of haystack
jne short break ; (first wide character of) needle not in haystack?
dec rcx ; rcx = next length of haystack
mov ax, [rdx+r9*2-2]
; ax = last wide character of needle
cmp ax, [rdi+r9*2-4]
jne short continue ; last wide character of needle not in haystack?
compare:
mov rax, rdi ; rax = next address in haystack
mov r8, rcx ; r8 = next length of haystack
if 0
lea rdi, [rdi-2] ; rdi = current address in haystack
; = address of matching character
mov rsi, rdx ; rsi = address of needle
mov rcx, r9 ; rcx = length of needle
else
; rdi = next address in haystack
lea rsi, [rdx+2] ; rsi = address of needle + 2
mov rcx, r9
dec rcx ; rcx = length of needle - 1,
; ZF = (rcx = 0)
;; jz short match ; needle in haystack?
endif
repe cmpsw
je short match ; needle in haystack?
mov rdi, rax ; rdi = current address in haystack
mov rcx, r8 ; rcx = current length of haystack
continue:
cmp rcx, r9
jae short search ; length of haystack >= length of needle?
break:
xor eax, eax
mov rdi, r10
mov rsi, r11
empty:
ret
match:
lea rax, [rax-2] ; rax = address of needle in haystack
mov rdi, r10
mov rsi, r11
ret
wcsstr endp
end
Microsoft Macro Assembler Reference
Save the AMD64 assembler source presented above as
amd64-wcs.asm
in the directory where you created the
object library amd64.lib
before, then execute the
following 3 command lines to generate the object file
amd64-wcs.obj
and add it to the existing object library
amd64.lib
:
SET ML=/c /Gy /W3 /X ML64.EXE amd64-wcs.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-wcs.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: if the
/Gy
option
to package every function in its own, separately linkable
COMDAT
section is not available with the version of the macro assembler
ML64.EXE
you use, split the AMD64 assembler source into
multiple pieces, with one function per source file.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-wcs.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
Thread Local Storage (TLS) is the method by which each thread in a given multithreaded process can allocate locations in which to store thread-specific data. Dynamically bound (run-time) thread-specific data is supported by way of the TLS API (TlsAlloc, TlsGetValue, TlsSetValue, TlsFree). Win32 and the Microsoft C++ compiler now support statically bound (load-time) per-thread data in addition to the existing API implementation.__declspec Rules and Limitations for TLS Under the heading[…]
Visual C also provides a Microsoft-specific attribute, thread, as extended storage class modifier. Use the
__declspec
keyword to declare athread
variable. For example, the following code declares an integer thread local variable and initializes it with a value:__declspec( thread ) int tls_i = 1;
[…]
- On Windows operating systems before Windows Vista,
__declspec( thread )
has some limitations. If a DLL declares any data or object as__declspec( thread )
, it can cause a protection fault if dynamically loaded. After the DLL is loaded with LoadLibrary, it causes system failure whenever the code references the__declspec( thread )
data. Because the global variable space for a thread is allocated at run time, the size of this space is based on a calculation of the requirements of the application plus the requirements of all the DLLs that are statically linked. When you useLoadLibrary
, you can't extend this space to allow for the thread local variables declared with__declspec( thread )
. Use the TLS APIs, such as TlsAlloc, in your DLL to allocate TLS if the DLL might be loaded withLoadLibrary
.
The .tls section, the specification of the PE Format states:
The .tls section provides direct PE and COFF support for static thread local storage (TLS). […] a static TLS variable can be defined as follows, without using the Windows API:Ouch: even the very first (highlighted) sentence is wrong – the
__declspec (thread) int tlsFlag = 1;
To support this programming construct, the PE and COFF .tls section specifies the following information: initialization data, callback routines for per-thread initialization and termination, and the TLS index, which are explained in the following discussion.
Note
Statically declared TLS data objects can be used only in statically loaded image files. This fact makes it unreliable to use static TLS data in a DLL unless you know that the DLL, or anything statically linked with it, will never be loaded dynamically with the LoadLibrary API function.
Executable code accesses a static TLS data object through the following steps:
At link time, the linker sets the Address of Index field of the TLS directory. This field points to a location where the program expects to receive the TLS index.
The Microsoft run-time library facilitates this process by defining a memory image of the TLS directory and giving it the special name "__tls_used" (Intel x86 platforms) or "_tls_used" (other platforms). The linker looks for this memory image and uses the data there to create the TLS directory. Other compilers that support TLS and work with the Microsoft linker must use this same technique.
When a thread is created, the loader communicates the address of the thread's TLS array by placing the address of the thread environment block (TEB) in the FS register. A pointer to the TLS array is at the offset of 0x2C from the beginning of TEB. This behavior is Intel x86-specific.
The loader assigns the value of the TLS index to the place that was indicated by the Address of Index field.
The executable code retrieves the TLS index and also the location of the TLS array.
The code uses the TLS index and the TLS array location (multiplying the index by 4 and using it as an offset to the array) to get the address of the TLS data area for the given program and module. Each thread has its own TLS data area, but this is transparent to the program, which does not need to know how data is allocated for individual threads.
An individual TLS data object is accessed as some fixed offset into the TLS data area.
IMAGE_TLS_DIRECTORY
provides the
TLS support.
Note: the .tls
section is required
only when TLS data
is initialised, it is not needed when data is just declared.
Ouch: the initial note is but obsolete and wrong – Windows Vista and later versions of Windows NT support static TLS data in dynamically loaded DLLs!
Note: the multiplier 4 is of course only correct for 32-bit platforms; 64-bit platforms require the multiplier 8.
The documentation misses the following step for the x64 alias AMD64 processor architecture, and corresponding steps for other processor architectures as well:
Note: despite the fixed value of this offset, the Visual C compiler references the address of the external symbol
When a thread is created, the loader communicates the address of the thread's TLS array by placing the address of the thread environment block (TEB) in the GS register. A pointer to the TLS array is at the offset of 0x58 from the beginning of the TEB. This behavior is Intel x64-specific.
__tls_array
on the i386
alias x86 platform.
The specification of the PE Format continues:
Note: the documentation lacks the information that the Visual C compiler puts all data for the TLS template in COFF sectionsThe TLS directory has the following format:
Offset (PE32/PE32+) Size (PE32/PE32+) Field Description 0 4/8 Raw Data Start VA The starting address of the TLS template. The template is a block of data that is used to initialize TLS data. The system copies all of this data each time a thread is created, so it must not be corrupted. Note that this address is not an RVA; it is an address for which there should be a base relocation in the .reloc section. 4/8 4/8 Raw Data End VA The address of the last byte of the TLS, except for the zero fill. As with the Raw Data Start VA field, this is a VA, not an RVA. 8/16 4/8 Address of Index The location to receive the TLS index, which the loader assigns. This location is in an ordinary data section, so it can be given a symbolic name that is accessible to the program. 12/24 4/8 Address of Callbacks The pointer to an array of TLS callback functions. The array is null-terminated, so if no callback function is supported, this field points to 4 bytes set to zero. For information about the prototype for these functions, see TLS Callback Functions. 16/32 4 Size of Zero Fill The size in bytes of the template, beyond the initialized data delimited by the Raw Data Start VA and Raw Data End VA fields. The total template size should be the same as the total size of TLS data in the image file. The zero fill is the amount of data that comes after the initialized nonzero data. 20/36 4 Characteristics The four bits [23:20] describe alignment info. Possible values are those defined as IMAGE_SCN_ALIGN_*, which are also used to describe alignment of section in object files. The other 28 bits are reserved for future use.
.tls$‹suffix›
– which it declares
but writable instead of read-only, i.e. it fails to protect the template data against corruption, an easily avoidable safety hazard!
OOPS: the
Raw Data End VA
field contains the address of the first byte after
the TLS template!
OUCH: the Size of Zero Fill
field is
not supported at all!
Note: if the size of the initialised data of the
.tls
section in the image file is less than the section
size, the module loader fills the additional uninitialised data with
zeroes, i.e. the Size of Zero Fill
field is superfluous.
Under the heading
TLS Callback Functions
,
the specification of the
PE Format
states:
The program can provide one or more TLS callback functions […]The prototype for a callback function (pointed to by a pointer of type PIMAGE_TLS_CALLBACK) has the same parameters as a DLL entry-point function:
typedef VOID (NTAPI *PIMAGE_TLS_CALLBACK) ( PVOID DllHandle, DWORD Reason, PVOID Reserved );
The Reserved parameter should be set to zero. The Reason parameter can take the following values:
Setting Value Description DLL_PROCESS_ATTACH 1 A new process has started, including the first thread. DLL_THREAD_ATTACH 2 A new thread has been created. This notification sent for all but the first thread. DLL_THREAD_DETACH 3 A thread is about to be terminated. This notification sent for all but the first thread. DLL_PROCESS_DETACH 0 A process is about to terminate, including the original thread.
mainCRTStartup()
and
_DllMainCRTStartup()
of both its components, and uses a
TLS callback
function to log the thread’s progress:
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
__declspec(thread)
DWORD dwTLS = 'MSVC'; // placed in writable ".tls$" section by the compiler
#ifndef LIBRARY
#pragma data_seg(".tls")
DWORD _tls_begin = 'JUNK'; // placed before all TLS template data by the linker
#pragma data_seg(".tls$~~~")
DWORD _tls_end = 'JUNK'; // placed after all TLS template data by the linker
#pragma data_seg()
#pragma bss_seg(".bss$T")
DWORD _tls_index; // assigned by the module loader
#pragma bss_seg()
#else
extern const DWORD _tls_index;
#endif // LIBRARY
__declspec(safebuffers)
BOOL CDECL PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
CHAR szFormat[1024];
DWORD dwFormat;
DWORD dwOutput;
va_list vaInput;
va_start(vaInput, lpFormat);
dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
va_end(vaInput);
if ((dwFormat == 0UL)
|| !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
return FALSE;
return dwOutput == dwFormat;
}
const LPCSTR szReason[4] = {"process detach",
"process attach",
"thread attach",
"thread detach"};
__declspec(safebuffers)
VOID WINAPI TLSCallback(LPVOID hModule, DWORD dwReason, LPVOID lpUnused)
{
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
HMODULE hCaller;
DWORD dwCaller;
CHAR szCaller[MAX_PATH];
CHAR szModule[MAX_PATH];
DWORD dwModule = GetModuleFileName(hModule, szModule, sizeof(szModule));
if (hOutput == INVALID_HANDLE_VALUE)
return;
if (dwModule < sizeof(szModule))
szModule[dwModule] = '\0';
if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
_ReturnAddress(),
&hCaller))
szCaller[0] = '\0';
else
{
dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));
if (dwCaller < sizeof(szCaller))
szCaller[dwCaller] = '\0';
}
PrintFormat(hOutput,
"\r\n"
__FUNCTION__ "() function @ 0x%p\r\n"
"\tCalled module @ 0x%p = %hs\r\n"
"\tCalling module @ 0x%p = %hs\r\n"
"\tReturn address @ 0x%p = 0x%p\r\n"
"\tArguments:\r\n"
"\t\tModule = 0x%p\r\n"
"\t\tReason = %lu (%hs)\r\n"
"\t\tUnused = 0x%p\r\n"
"\tThread id = %lu\r\n",
TLSCallback,
hModule, szModule,
hCaller, szCaller,
_AddressOfReturnAddress(), _ReturnAddress(),
hModule, dwReason, szReason[dwReason], lpUnused,
GetCurrentThreadId());
}
#ifndef LIBRARY
const PIMAGE_TLS_CALLBACK _tls_callbacks[] = {TLSCallback, NULL};
const IMAGE_TLS_DIRECTORY _tls_used = {&_tls_begin,
&_tls_end + sizeof(_tls_end),
&_tls_index,
_tls_callbacks,
'VOID',
0UL};
#else
extern IMAGE_TLS_DIRECTORY _tls_used;
#pragma const_seg(".ptr$") // added to ".ptr" section by the linker
//const PIMAGE_TLS_CALLBACK _tls_callback = TLSCallback;
#pragma const_seg() // place more pointers to callback routines above here
__declspec(allocate(".ptr$")) // added to ".ptr" section by the linker
const PIMAGE_TLS_CALLBACK _tls_callback = TLSCallback;
#endif // LIBRARY
extern IMAGE_DOS_HEADER __ImageBase;
#ifdef _DLL
__declspec(dllexport)
__declspec(safebuffers)
DWORD WINAPI ThreadProc(LPVOID lpParameter)
{
HMODULE hCaller;
DWORD dwCaller;
CHAR szCaller[MAX_PATH];
CHAR szModule[MAX_PATH];
DWORD dwModule = GetModuleFileName((HMODULE) &__ImageBase, szModule, sizeof(szModule));
if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
_ReturnAddress(),
&hCaller))
szCaller[0] = '\0';
else
{
dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));
if (dwCaller < sizeof(szCaller))
szCaller[dwCaller] = '\0';
}
if (dwModule < sizeof(szModule))
szModule[dwModule] = '\0';
PrintFormat(lpParameter,
"\r\n"
__FUNCTION__ "() function @ 0x%p\r\n"
"\tCalled module @ 0x%p = %hs\r\n"
"\tCalling module @ 0x%p = %hs\r\n"
"\tReturn address @ 0x%p = 0x%p\r\n"
"\tParameter = 0x%p\r\n"
"\tThread id = %lu\r\n",
ThreadProc,
&__ImageBase, szModule,
hCaller, szCaller,
_AddressOfReturnAddress(), _ReturnAddress(),
lpParameter,
GetCurrentThreadId());
return GetLastError();
}
__declspec(safebuffers)
BOOL WINAPI _DllMainCRTStartup(HMODULE hModule, DWORD dwReason, CONTEXT *lpContext)
{
DWORD dwThreadId = GetCurrentThreadId();
HANDLE hThread;
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
HMODULE hCaller;
DWORD dwCaller;
CHAR szCaller[MAX_PATH];
CHAR szModule[MAX_PATH];
DWORD dwModule = GetModuleFileName(hModule, szModule, sizeof(szModule));
if (hOutput == INVALID_HANDLE_VALUE)
return FALSE;
if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
_ReturnAddress(),
&hCaller))
szCaller[0] = '\0';
else
{
dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));
if (dwCaller < sizeof(szCaller))
szCaller[dwCaller] = '\0';
}
if (dwModule < sizeof(szModule))
szModule[dwModule] = '\0';
PrintFormat(hOutput,
"\r\n"
__FUNCTION__ "() function @ 0x%p\r\n"
"\tCalled module @ 0x%p = %hs\r\n"
"\tCalling module @ 0x%p = %hs\r\n"
"\tReturn address @ 0x%p = 0x%p\r\n"
"\tArguments:\r\n"
"\t\tModule = 0x%p\r\n"
"\t\tReason = %lu (%hs)\r\n"
"\t\tUnused = 0x%p\r\n"
"\tThread id = %lu\r\n",
_DllMainCRTStartup,
hModule, szModule,
hCaller, szCaller,
_AddressOfReturnAddress(), _ReturnAddress(),
hModule, dwReason, szReason[dwReason], lpContext,
dwThreadId);
if (dwReason != DLL_PROCESS_ATTACH)
return FALSE;
PrintFormat(hOutput,
"\a"
"\tTLS index = %lu\r\n"
"\tTLS value = 0x%p\r\n"
"\tTLS array @ 0x%p\r\n"
"\tTLS block @ 0x%p\r\n"
"\tTLS dword @ 0x%p = \"%.4hs\"\r\n"
"\tTLS directory @ 0x%p\r\n"
"\t\tStart @ 0x%p\r\n"
"\t\tEnd @ 0x%p\r\n"
"\t\tIndex @ 0x%p\r\n"
"\t\tCallbacks @ 0x%p\r\n"
"\t\tZerofill = 0x%08lX = \"%.4hs\"\r\n"
"\t\tAlignment = 0x%08lX\r\n" + (dwTLS == 'MSVC'),
_tls_index,
TlsGetValue(_tls_index),
#ifdef _M_IX86
__readfsdword(44),
((LPVOID *) __readfsdword(44))[_tls_index],
#elif _M_AMD64
__readgsqword(88),
((LPVOID *) __readgsqword(88))[_tls_index],
#else
#error Only I386 and AMD64 supported!
#endif
&dwTLS, &dwTLS,
&_tls_used,
_tls_used.StartAddressOfRawData,
_tls_used.EndAddressOfRawData,
_tls_used.AddressOfIndex,
_tls_used.AddressOfCallBacks,
_tls_used.SizeOfZeroFill, &_tls_used.SizeOfZeroFill,
_tls_used.Characteristics);
hThread = CreateThread((LPSECURITY_ATTRIBUTES) NULL,
(SIZE_T) 65536,
ThreadProc,
hOutput,
0,
&dwThreadId);
if (hThread == NULL)
PrintFormat(hOutput,
"CreateThread() returned error %lu\r\n",
GetLastError());
else
{
PrintFormat(hOutput,
"\r\n"
"Thread %lu created and started\r\n",
dwThreadId);
if (!CloseHandle(hThread))
PrintFormat(hOutput,
"CloseHandle() returned error %lu\r\n",
GetLastError());
}
return TRUE;
}
#else // _DLL
__declspec(dllimport)
DWORD WINAPI ThreadProc(LPVOID lpParameter);
DWORD CDECL mainCRTStartup(VOID)
{
DWORD dwError = ERROR_SUCCESS;
DWORD dwThreadId = GetCurrentThreadId();
DWORD dwThread;
HANDLE hThread;
HANDLE hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
HMODULE hCaller;
DWORD dwCaller;
CHAR szCaller[MAX_PATH];
CHAR szModule[MAX_PATH];
DWORD dwModule = GetModuleFileName((HMODULE) &__ImageBase, szModule, sizeof(szModule));
if (hOutput == INVALID_HANDLE_VALUE)
return GetLastError();
if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
_ReturnAddress(),
&hCaller))
szCaller[0] = '\0';
else
{
dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));
if (dwCaller < sizeof(szCaller))
szCaller[dwCaller] = '\0';
}
if (dwModule < sizeof(szModule))
szModule[dwModule] = '\0';
PrintFormat(hOutput,
"\a\r\n"
__FUNCTION__ "() function @ 0x%p\r\n"
"\tCalled module @ 0x%p = %hs\r\n"
"\tCalling module @ 0x%p = %hs\r\n"
"\tReturn address @ 0x%p = 0x%p\r\n"
"\tThread id = %lu\r\n"
"\tTLS index = %ld\r\n"
"\tTLS value = 0x%p\r\n"
"\tTLS array @ 0x%p\r\n"
"\tTLS block @ 0x%p\r\n"
"\tTLS dword @ 0x%p = \"%.4hs\"\r\n"
"\tTLS directory @ 0x%p\r\n"
"\t\tStart @ 0x%p\r\n"
"\t\tEnd @ 0x%p\r\n"
"\t\tIndex @ 0x%p\r\n"
"\t\tCallbacks @ 0x%p\r\n"
"\t\tZerofill = 0x%08lX = \"%.4hs\"\r\n"
"\t\tAlignment = 0x%08lX\r\n" + (dwTLS == 'MSVC'),
mainCRTStartup,
&__ImageBase, szModule,
hCaller, szCaller,
_AddressOfReturnAddress(), _ReturnAddress(),
dwThreadId,
_tls_index,
TlsGetValue(_tls_index),
#ifdef _M_IX86
__readfsdword(44),
((LPVOID *) __readfsdword(44))[_tls_index],
#elif _M_AMD64
__readgsqword(88),
((LPVOID *) __readgsqword(88))[_tls_index],
#else
#error Only I386 and AMD64 supported!
#endif
&dwTLS, &dwTLS,
&_tls_used,
_tls_used.StartAddressOfRawData,
_tls_used.EndAddressOfRawData,
_tls_used.AddressOfIndex,
_tls_used.AddressOfCallBacks,
_tls_used.SizeOfZeroFill, &_tls_used.SizeOfZeroFill,
_tls_used.Characteristics);
hThread = CreateThread((LPSECURITY_ATTRIBUTES) NULL,
(SIZE_T) 65536,
ThreadProc,
hOutput,
0UL,
&dwThreadId);
if (hThread == NULL)
PrintFormat(hOutput,
"CreateThread() returned error %lu\r\n",
dwError = GetLastError());
else
{
PrintFormat(hOutput,
"\r\n"
"Thread %lu created and started\r\n",
dwThreadId);
if (WaitForSingleObject(hThread, INFINITE) == WAIT_FAILED)
PrintFormat(hOutput,
"WaitForSingleObject() returned error %lu\r\n",
dwError = GetLastError());
if (!GetExitCodeThread(hThread, &dwThread))
PrintFormat(hOutput,
"GetExitCodeThread() returned error %lu\r\n",
dwError = GetLastError());
else
PrintFormat(hOutput,
"\r\n"
"Thread %lu exited with code %lu\r\n",
dwThreadId, dwThread);
if (!CloseHandle(hThread))
PrintFormat(hOutput,
"CloseHandle() returned error %lu\r\n",
GetLastError());
}
return dwError;
}
#endif // _DLL
Save the
ANSI C
source presented above as tls-demo.c
in an arbitrary,
preferable empty directory, then execute the following 6 command
lines to compile and link it a first time to generate the
DLL
tls-demo.dll
and its import library
tls-demo.lib
for the AMD64 platform, to
compile and link it a second time to generate the image file
tls-demo.exe
for the AMD64 platform too,
and finally execute the latter:
SET CL=/GAFy /Oisy /W4 /Zl SET LINK=/NODEFAULTLIB /SECTION:.tls,!w CL.EXE /LD /MD tls-demo.c kernel32.lib user32.lib SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SECTION:.tls,!w /SUBSYSTEM:CONSOLE CL.EXE tls-demo.c tls-demo.lib kernel32.lib user32.lib .\tls-demo.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c tls-demo.c(108) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(109) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(110) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(111) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'const PIMAGE_TLS_CALLBACK *' Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /NODEFAULTLIB /SECTION:.tls,!w /out:tls-demo.dll /dll /implib:tls-demo.lib tls-demo.obj kernel32.lib user32.lib Creating library tls-demo.lib and object tls-demo.exp Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c tls-demo.c(108) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(109) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(110) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(111) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'const PIMAGE_TLS_CALLBACK *' Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SECTION:.tls,!w /SUBSYSTEM:CONSOLE /out:tls-demo.exe tls-demo.obj tls-demo.lib kernel32.lib user32.lib TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F038 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 1 (process attach) Unused = 0x0000000000000000 Thread id = 7544 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F0A8 = 0x0000000077837C3E Arguments: Module = 0x000007FEFACA0000 Reason = 1 (process attach) Unused = 0x000000000017F830 Thread id = 7544 TLS index = 1 TLS value = 0x0000000000000000 TLS array @ 0x00000000002C3280 TLS block @ 0x00000000002EA590 TLS dword @ 0x00000000002C32D4 = "CVSM" TLS directory @ 0x000007FEFACA20E0 Start @ 0x000007FEFACA5000 End @ 0x000007FEFACA5018 Index @ 0x000007FEFACA3000 Callbacks @ 0x000007FEFACA20D0 Zerofill = 0x564F4944 = "DIOV" Alignment = 0x00000000 Thread 11820 created and started TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F038 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 1 (process attach) Unused = 0x0000000000000000 Thread id = 7544 mainCRTStartup() function @ 0x000000013F891258 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x00000000776E0000 = C:\Windows\system32\kernel32.dll Return address @ 0x000000000017FC88 = 0x00000000776F556D Thread id = 7544 TLS index = 0 TLS value = 0x0000000000000000 TLS array @ 0x00000000002C3280 TLS block @ 0x00000000002C32D0 TLS dword @ 0x00000000002C32D4 = "CVSM" TLS directory @ 0x000000013F892100 Start @ 0x000000013F895000 End @ 0x000000013F895018 Index @ 0x000000013F893000 Callbacks @ 0x000000013F8920F0 Zerofill = 0x564F4944 = "DIOV" Alignment = 0x00000000 Thread 11888 created and started TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F458 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11888 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F4C8 = 0x00000000778383CC Arguments: Module = 0x000007FEFACA0000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11888 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F458 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11888 ThreadProc() function @ 0x000007FEFACA1258 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x00000000776E0000 = C:\Windows\system32\kernel32.dll Return address @ 0x000000000201FAE8 = 0x00000000776F556D Parameter = 0x0000000000000070 Thread id = 11888 TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F688 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 11888 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F6F8 = 0x0000000077838785 Arguments: Module = 0x000007FEFACA0000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 11888 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F688 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 11888 Thread 11888 exited with code 0 TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F2A8 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11820 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F318 = 0x00000000778383CC Arguments: Module = 0x000007FEFACA0000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11820 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F2A8 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11820 TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F758 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 7544 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F7C8 = 0x0000000077838785 Arguments: Module = 0x000007FEFACA0000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 7544 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F758 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 7544 ThreadProc() function @ 0x000007FEFACA1258 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x00000000776E0000 = C:\Windows\system32\kernel32.dll Return address @ 0x0000000001E0F938 = 0x00000000776F556D Parameter = 0x0000000000000070 Thread id = 11820 TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F488 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 0 (process detach) Unused = 0x0000000000000000 Thread id = 11820 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F4F8 = 0x000000007783775B Arguments: Module = 0x000007FEFACA0000 Reason = 0 (process detach) Unused = 0x0000000000000001 Thread id = 11820 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F488 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 0 (process detach) Unused = 0x0000000000000000 Thread id = 11820The program works as documented and intended – the variable
dwTLS
is initialised with the
ASCII
text CVSM
, the TLSCallback()
function runs
on the secondary thread 11820 and the tertiary thread 11888
before its
ThreadProc()
function as well as after the latter returns, and
it runs also on the primary thread 7544 before the
entry point functions of both the
DLL and the program
as well as after the latter returns from its entry
point function.
Note: the primary thread 7544 exits before the
secondary thread 11820 here – as documented in the
MSDN article
Terminating a Process,
the program terminates with its last thread.
ExitProcess()
Note: the
MSDN article
Terminating a Thread
specifies that threads are terminated upon return of their
ThreadProc()
callback function.
ExitThread()
Now (attempt to) build this application for the i386 platform, using the same command lines as before:
SET CL=/GAFy /Oisy /W4 /Zl SET LINK=/NODEFAULTLIB /SECTION:.tls,!w CL.EXE /LD /MD tls-demo.c kernel32.lib user32.lib […]
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
tls-demo.c
tls-demo.c(106) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD *'
tls-demo.c(107) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD *'
tls-demo.c(108) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD *'
tls-demo.c(109) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'const PIMAGE_TLS_CALLBACK *'
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation. All rights reserved.
/NODEFAULTLIB /SECTION:.tls,!w
/out:tls-demo.dll
/dll
/implib:tls-demo.lib
tls-demo.obj
kernel32.lib
user32.lib
Creating library tls-demo.lib and object tls-demo.exp
tls-demo.obj : error LNK2019: unresolved external symbol __tls_array referenced in function __DllMainCRTStartup@12
tls-demo.dll : fatal error LNK1120: 1 unresolved externals
OUCH: due to the braindead
behaviour of the Visual C compiler for the
i386 platform, which references the symbol
__tls_array
in the generated machine code instead to
use its fixed value 44, this build fails!
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.model flat, C
public _tls_array
_tls_array equ 44 ; offset of 'ThreadLocalStoragePointer' member in TEB;
; symbol referenced in code generated by the compiler!
_tls_32 struct 4
dword offset _tls_begin
dword offset _tls_end
dword offset _tls_index
dword offset _tls_start
dword 'VOID' ; BUG: the module loader does NOT support the 'SizeOfZeroFill' member!
dword 0
_tls_32 ends
_tls_bss segment alias(".bss$T") dword read write 'BSS'
public _tls_index
_tls_index dword ? ; assigned by the module loader!
_tls_bss ends
_tls_note segment alias(".comment") discard info read 'INFO'
byte "(C)opyright 2004-2025, Stefan Kanthak"
_tls_note ends
_tls_info segment alias(".drectve") discard info read 'INFO'
byte "/MERGE:.ptr=.rdata /SECTION:.tls,!w"
_tls_info ends
_tls_start segment alias(".ptr") dword read 'CONST'
_tls_start ends
_tls_stop segment alias(".ptr$~~~") dword read 'CONST'
dword 0 ; callback function array terminator
_tls_stop ends
_tls segment alias(".rdata$T") dword read 'CONST'
public _tls_used
_tls_used _tls_32 <> ; IMAGE_TLS_DIRECTORY32
_tls ends
_tls_begin segment alias(".tls") para read write 'DATA'
_tls_begin ends
_tls_end segment alias(".tls$~~~") byte read write 'DATA'
_tls_end ends
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
i386-tls.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 3 command lines to generate the object file
i386-tls.obj
and add it to the existing object library
i386.lib
:
SET ML=/c /safeseh /W3 /X ML.EXE i386-tls.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-tls.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-tls.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.Move the ANSI C source file
tls-demo.c
created before into the current
directory, then execute the following 6 command lines to compile and
link it a first time with the
TLS support module
from the object library i386.lib
to generate the
DLL
tls-demo.dll
and its import library
tls-demo.lib
for the i386 platform, to
compile and link it a second time to generate the image file
tls-demo.exe
for the i386 platform too,
and finally execute the latter:
SET CL=/c /DLIBRARY /GAFy /Oisy /W4 /Zl SET LINK=/NODEFAULTLIB CL.EXE /LD /MD tls-demo.c i386.lib kernel32.lib user32.lib SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE tls-demo.c tls-demo.lib i386.lib kernel32.lib user32.lib .\tls-demo.exe
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /NODEFAULTLIB /out:tls-demo.dll /dll /implib:tls-demo.lib tls-demo.obj i386.lib kernel32.lib user32.lib Creating library tls-demo.lib and object tls-demo.exp Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE /out:tls-demo.exe tls-demo.obj tls-demo.lib i386.lib kernel32.lib user32.lib TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF390 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 1 (process attach) Unused = 0x00000000 Thread id = 1724 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF3CC = 0x779F9280 Arguments: Module = 0x70350000 Reason = 1 (process attach) Unused = 0x004AF6D0 Thread id = 1724 TLS index = 1 TLS value = 0x00000000 TLS array @ 0x007F20D0 TLS block @ 0x0080B728 TLS dword @ 0x007F4FC8 = "CVSM" TLS directory @ 0x70352468 Start @ 0x70354000 End @ 0x70354004 Index @ 0x70353000 Callbacks @ 0x70352088 Zerofill = 0x564F4944 = "DIOV" Alignment = 0x00000000 Thread 2716 created and started TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF390 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 1 (process attach) Unused = 0x00000000 Thread id = 1724 TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF720 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 2716 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF75C = 0x779F9280 Arguments: Module = 0x70350000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 2716 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF720 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 2716 mainCRTStartup() function @ 0x0033116B Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x774D0000 = C:\Windows\syswow64\kernel32.dll Return address @ 0x004AF938 = 0x774E343D Thread id = 1724 TLS index = 0 TLS value = 0x00000000 TLS array @ 0x007F20D0 TLS block @ 0x007F4FC8 TLS dword @ 0x007F4FC8 = "CVSM" TLS directory @ 0x00332400 Start @ 0x00334000 End @ 0x00334004 Index @ 0x00333000 Callbacks @ 0x00332098 Zerofill = 0x564F4944 = "DIOV" Alignment = 0x00000000 ThreadProc() function @ 0x7035116B Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x774D0000 = C:\Windows\syswow64\kernel32.dll Return address @ 0x007CFAF0 = 0x774E343D Parameter = 0x00000074 Thread id = 2716 TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF7B4 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 2716 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF7F0 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 2716 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF7B4 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 2716 Thread 11748 created and started TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFBA4 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 11748 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFBE0 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 11748 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFBA4 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 11748 ThreadProc() function @ 0x7035116B Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x774D0000 = C:\Windows\syswow64\kernel32.dll Return address @ 0x021AFF74 = 0x774E343D Parameter = 0x00000074 Thread id = 11748 TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFC38 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 11748 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFC74 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 11748 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFC38 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 11748 Thread 11748 exited with code 0 TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF5CC = 0x779F9280 Arguments: Module = 0x70350000 Reason = 0 (process detach) Unused = 0x00000000 Thread id = 1724 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF608 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 0 (process detach) Unused = 0x00000001 Thread id = 1724 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF5CC = 0x779F9280 Arguments: Module = 0x00330000 Reason = 0 (process detach) Unused = 0x00000000 Thread id = 1724With the object module
i386-tls.obj
, program and
DLL
work as documented and intended now, exhibiting the insignificant
difference that the program terminates with the primary thread 1724
here.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
_tls_64 struct 8
qword offset _tls_begin
qword offset _tls_end
qword offset _tls_index
qword offset _tls_start
dword 'VOID' ; BUG: the module loader does NOT support the 'SizeOfZeroFill' member!
dword 0
_tls_64 ends
_bss segment alias(".bss$T") dword read write 'BSS'
public _tls_index
_tls_index dword ? ; assigned by the module loader!
_bss ends
_note segment alias(".comment") discard info read 'INFO'
byte "(C)opyright 2004-2025, Stefan Kanthak"
_note ends
_linker segment alias(".drectve") discard info read 'INFO'
byte "/MERGE:.ptr=.rdata /SECTION:.tls,!w"
_linker ends
_start segment alias(".ptr") para read 'CONST'
_tls_start equ $
_start ends
_stop segment alias(".ptr$~~~") read 'CONST'
qword 0 ; callback function array terminator
_stop ends
_const segment alias(".rdata$T") para read 'CONST'
public _tls_used
_tls_used _tls_64 <> ; IMAGE_TLS_DIRECTORY64
_const ends
_begin segment alias(".tls") para read write 'DATA'
_tls_begin equ $
_begin ends
_end segment alias(".tls$~~~") byte read write 'DATA'
_tls_end equ $
_end ends
end
Microsoft Macro Assembler Reference
Save the AMD64 assembler source presented above as
amd64-tls.asm
in the directory where you created the
object library amd64.lib
before, then execute the
following 3 command lines to generate the object file
amd64-tls.obj
and add it to the existing object library
amd64.lib
:
SET ML=/c /W3 /X ML64.EXE amd64-tls.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-tls.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-tls.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.Move the ANSI C source file
tls-demo.c
created before into the current
directory, then execute the following 6 command lines to compile and
link it a first time with the
TLS support module
from the object library amd64.lib
to generate the
DLL
tls-demo.dll
and its import library
tls-demo.lib
for the AMD64 platform, to
compile and link it a second time to generate the image file
tls-demo.exe
for the AMD64 platform too,
and finally execute the latter:
SET CL=/c /DLIBRARY /GAFy /Oisy /W4 /Zl SET LINK=/NODEFAULTLIB CL.EXE /LD /MD tls-demo.c amd64.lib kernel32.lib user32.lib SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE tls-demo.c tls-demo.lib amd64.lib kernel32.lib user32.lib .\tls-demo.exe
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /NODEFAULTLIB /out:tls-demo.dll /dll /implib:tls-demo.lib tls-demo.obj amd64.lib kernel32.lib user32.lib Creating library tls-demo.lib and object tls-demo.exp Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE /out:tls-demo.exe tls-demo.obj tls-demo.lib amd64.lib kernel32.lib user32.lib […]
_load_config_used
and __security_check_cookie()
Function (/GS
Support)The Load Configuration Structure (Image Only), the specification of the PE Format states:
The load configuration structure (IMAGE_LOAD_CONFIG_DIRECTORY) was formerly used in very limited cases in the Windows NT operating system itself to describe various features too difficult or too large to describe in the file header or optional header of the image. Current versions of the Microsoft linker and Windows XP and later versions of Windows use a new version of this structure for 32-bit x86-based systems that include reserved SEH technology.OUCH¹: the highlighted statement is but wrong –
[…]
The Microsoft linker automatically provides a default load configuration structure to include the reserved SEH data. If the user code already provides a load configuration structure, it must include the new reserved SEH fields. Otherwise, the linker cannot include the reserved SEH data and the image is not marked as containing reserved SEH.
LINK.EXE
neither provides an
IMAGE_LOAD_CONFIG_DIRECTORY
structure nor reports its omission with an error message!
The documentation of the
/SAFESEH
compiler options gives proper information:
If you link withThe specification of the PE format continues with the following disinformation:/NODEFAULTLIB
and you want a table of safe exception handlers, you need to supply a load config struct (…) that contains all the entries defined for Visual C++. For example:#include <windows.h> extern DWORD_PTR __security_cookie; /* /GS security cookie */ /* * The following two names are automatically created by the linker for any * image that has the safe exception table present. */ extern PVOID __safe_se_handler_table[]; /* base of safe handler entry table */ extern BYTE __safe_se_handler_count; /* absolute symbol whose address is the count of table entries */ const IMAGE_LOAD_CONFIG_DIRECTORY32 _load_config_used = { sizeof(IMAGE_LOAD_CONFIG_DIRECTORY32), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, &__security_cookie, __safe_se_handler_table, (DWORD)(DWORD_PTR) &__safe_se_handler_count };
Load Configuration LayoutOUCH²: the documentation for theThe load configuration structure has the following layout for 32-bit and 64-bit PE files:
Offset Size Field Description 0 4 Characteristics Flags that indicate attributes of the file, currently unused. […] 54/78 2 Reserved Must be zero.
IMAGE_LOAD_CONFIG_DIRECTORY
structure but states that the field at offset 0 stores the size of
the structure, and the field at offset 54 (for 32-bit images) or 78
(for 64-bit images) stores the
/DEPENDENTLOADFLAG
!
Caveat: only with the GuardFlags
member present in the
IMAGE_LOAD_CONFIG_DIRECTORY
structure, i.e. if its Size
member is at least 92 on
32-bit platforms and 148 on 64-bit platforms, the module loader
honors the
/DEPENDENTLOADFLAG
on Windows 10 1607 alias
Anniversary Update, codenamed
Redstone 1, and later versions of
Windows NT!
The documentation of the /GS compiler option states:
The /GS compiler option requires that the security cookie be initialized before any function that uses the cookie is run. The security cookie must be initialized immediately on entry to an EXE or DLL. This is done automatically if you use the default VCRuntime entry points: mainCRTStartup, wmainCRTStartup, WinMainCRTStartup, wWinMainCRTStartup, or _DllMainCRTStartup. If you use an alternate entry point, you must manually initialize the security cookie by calling __security_init_cookie.OOPS¹: contrary to the first highlighted statement, the code generated by the compiler requires only that the (arbitrary) value of the security cookie does not change between entry and exit of any function which uses it!
OOPS²: the documentation cited above but fails to provide the following (implementation) details:
_load_config_used
structure matches the size of the
IMAGE_LOAD_CONFIG_DIRECTORY64
structure in the eleventh entry of the
IMAGE_DATA_DIRECTORY
array in the
IMAGE_OPTIONAL_HEADER
structure;
__security_init_cookie()
provided in the
MSVCRT
libraries (re)initialises the security cookie only if it has this
default value or is 0;
mainCRTStartup
,
wmainCRTStartup
, WinMainCRTStartup
,
wWinMainCRTStartup
and _DllMainCRTStartup
!
__security_init_cookie()
function to (re)initialise the security cookie any more!
Note: while compiler and linker reference the
security cookie by its symbol name __security_cookie
,
the module loader references it through the virtual address stored
in the SecurityCookie
member of the
IMAGE_LOAD_CONFIG_DIRECTORY
structure.
The MSDN magazine articles Protecting Your Code with Visual C++ Defenses and Visual C++ Support for Stack-Based Buffer Protection provide additional information. strict_gs_check pragma Security Checks at Runtime and Compile Time Compiler Security Checks In Depth
CAVEAT: when an exception is thrown in a function and not handled in place, but by one of the calling functions, i.e. when the function’s epilog is not executed, an overwritten stack cookie is not detected!
__tls_used
to locate the
IMAGE_TLS_DIRECTORY
on the i386 platform
and _tls_used
on other platforms, the linker locates
the
IMAGE_LOAD_CONFIG_DIRECTORY
structure via the public symbol
__load_config_used
on the i386 platform and _load_config_used
on other platforms.
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#ifndef LOAD_LIBRARY_SEARCH_SYSTEM32
#define LOAD_LIBRARY_SEARCH_SYSTEM32 0x00000800UL
#endif
#ifndef _WIN64
#if 0
DWORD __security_cookie = 0xBB40E64EUL;
// = 3141592654 = 10**9 * pi
#else
const DWORD __security_cookie = 2654435769UL;
// = 0x9E3779B9UL
// = 2**32 / phi
#endif
extern LPVOID __safe_se_handler_table[];
extern BYTE __safe_se_handler_count;
const struct _IMAGE_LOAD_CONFIG_DIRECTORY_32
{
DWORD Size;
DWORD TimeDateStamp;
WORD MajorVersion;
WORD MinorVersion;
DWORD GlobalFlagsClear;
DWORD GlobalFlagsSet;
DWORD CriticalSectionDefaultTimeout;
DWORD DeCommitFreeBlockThreshold;
DWORD DeCommitTotalFreeThreshold;
DWORD LockPrefixTable;
DWORD MaximumAllocationSize;
DWORD VirtualMemoryThreshold;
DWORD ProcessHeapFlags;
DWORD ProcessAffinityMask;
WORD CSDVersion;
#if LCU > 2 // Redstone 1 (1607)
WORD DependentLoadFlags;
#else
WORD Reserved1;
#endif
DWORD EditList;
DWORD SecurityCookie;
DWORD SEHandlerTable;
DWORD SEHandlerCount;
#if LCU > 0 // Threshold 1 (1507)
DWORD GuardCFCheckFunctionPointer;
DWORD GuardCFDispatchFunctionPointer;
DWORD GuardCFFunctionTable;
DWORD GuardCFFunctionCount;
DWORD GuardFlags;
#if LCU > 1 // Threshold 2 (1511)
struct // _IMAGE_LOAD_CONFIG_CODE_INTEGRITY
{
WORD Flags;
WORD Catalog;
DWORD CatalogOffset;
DWORD Reserved;
} CodeIntegrity;
#if LCU > 2 // Redstone 1 (1607)
DWORD GuardAddressTakenIatEntryTable;
DWORD GuardAddressTakenIatEntryCount;
DWORD GuardLongJumpTargetTable;
DWORD GuardLongJumpTargetCount;
DWORD DynamicValueRelocTable;
DWORD CHPEMetadataPointer;
#if LCU > 3 // Redstone 2 (1703)
DWORD GuardRFFailureRoutine;
DWORD GuardRFFailureRoutineFunctionPointer;
DWORD DynamicValueRelocTableOffset;
WORD DynamicValueRelocTableSection;
WORD Reserved2;
DWORD GuardRFVerifyStackPointerFunctionPointer;
DWORD HotPatchTableOffset;
#if LCU > 4 // Redstone 3 (1709)
DWORD Reserved3;
DWORD EnclaveConfigurationPointer;
#if LCU > 5 // Redstone 4 (1803)
DWORD VolatileMetadataPointer;
#if LCU > 6 // Redstone 5 (1809)
DWORD GuardEHContinuationTable;
DWORD GuardEHContinuationCount;
// Titanium (1903)
// Vanadium (1909)
// Vibranium 1 (2004)
// Vibranium 2 (20H2)
#if LCU > 7 // Vibranium 3 (21H1)
DWORD GuardXFGCheckFunctionPointer;
DWORD GuardXFGDispatchFunctionPointer;
DWORD GuardXFGTableDispatchFunctionPointer;
#if LCU > 8 // Vibranium 4 (21H2)
DWORD CastGuardOsDeterminedFailureMode;
#if LCU > 9 // Vibranium 5 (22H2)
DWORD GuardMemcpyFunctionPointer;
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
} _load_config_used = {sizeof(_load_config_used),
'DATE', // = 2006-04-15 20:15:01 UTC
_MSC_VER / 100,
_MSC_VER % 100,
0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL,
0U,
LOAD_LIBRARY_SEARCH_SYSTEM32,
0UL,
&__security_cookie,
__safe_se_handler_table,
&__safe_se_handler_count,
0UL, 0UL, 0UL, 0UL, 0UL};
#else // _WIN64
#if 0
DWORD64 __security_cookie = 0x00002B992DDFA232ULL;
// = 3141592653589793241 >> 16
// = 10**18 / 2**16 * pi
#else
const DWORD64 __security_cookie = 173961102589770ULL;
// = 0x00009E3779B97F4AULL
// = 2**48 / phi
#endif
const struct _IMAGE_LOAD_CONFIG_DIRECTORY_64
{
DWORD Size;
DWORD TimeDateStamp;
WORD MajorVersion;
WORD MinorVersion;
DWORD GlobalFlagsClear;
DWORD GlobalFlagsSet;
DWORD CriticalSectionDefaultTimeout;
DWORD64 DeCommitFreeBlockThreshold;
DWORD64 DeCommitTotalFreeThreshold;
DWORD64 LockPrefixTable;
DWORD64 MaximumAllocationSize;
DWORD64 VirtualMemoryThreshold;
DWORD64 ProcessAffinityMask;
DWORD ProcessHeapFlags;
WORD CSDVersion;
#if LCU > 2 // Redstone 1 (1607)
WORD DependentLoadFlags;
#else
WORD Reserved1;
#endif
DWORD64 EditList;
DWORD64 SecurityCookie;
DWORD64 SEHandlerTable;
DWORD64 SEHandlerCount;
#if LCU > 0 // Threshold 1 (1507)
DWORD64 GuardCFCheckFunctionPointer;
DWORD64 GuardCFDispatchFunctionPointer;
DWORD64 GuardCFFunctionTable;
DWORD64 GuardCFFunctionCount;
DWORD GuardFlags;
#if LCU > 1 // Threshold 2 (1511)
struct // _IMAGE_LOAD_CONFIG_CODE_INTEGRITY
{
WORD Flags;
WORD Catalog;
DWORD CatalogOffset;
DWORD Reserved;
} CodeIntegrity;
#if LCU > 2 // Redstone 1 (1607)
DWORD64 GuardAddressTakenIatEntryTable;
DWORD64 GuardAddressTakenIatEntryCount;
DWORD64 GuardLongJumpTargetTable;
DWORD64 GuardLongJumpTargetCount;
DWORD64 DynamicValueRelocTable;
DWORD64 CHPEMetadataPointer;
#if LCU > 3 // Redstone 2 (1703)
DWORD64 GuardRFFailureRoutine;
DWORD64 GuardRFFailureRoutineFunctionPointer;
DWORD DynamicValueRelocTableOffset;
WORD DynamicValueRelocTableSection;
WORD Reserved2;
DWORD64 GuardRFVerifyStackPointerFunctionPointer;
DWORD HotPatchTableOffset;
#if LCU > 4 // Redstone 3 (1709)
DWORD Reserved3;
DWORD64 EnclaveConfigurationPointer;
#if LCU > 5 // Redstone 4 (1803)
DWORD64 VolatileMetadataPointer;
#if LCU > 6 // Redstone 5 (1809)
DWORD64 GuardEHContinuationTable;
DWORD64 GuardEHContinuationCount;
// Titanium (1903)
// Vanadium (1909)
// Vibranium 1 (2004)
// Vibranium 2 (20H2)
#if LCU > 7 // Vibranium 3 (21H1)
DWORD64 GuardXFGCheckFunctionPointer;
DWORD64 GuardXFGDispatchFunctionPointer;
DWORD64 GuardXFGTableDispatchFunctionPointer;
#if LCU > 8 // Vibranium 4 (21H2)
DWORD64 CastGuardOsDeterminedFailureMode;
#if LCU > 9 // Vibranium 5 (22H2)
DWORD64 GuardMemcpyFunctionPointer;
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
} _load_config_used = {sizeof(_load_config_used),
'TIME', // = 2014-10-23 18:47:33 UTC
_MSC_VER / 100,
_MSC_VER % 100,
0UL, 0UL, 0UL,
0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
0UL,
0U,
LOAD_LIBRARY_SEARCH_SYSTEM32,
0ULL,
&__security_cookie,
0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
0UL};
#endif // _WIN64
__declspec(noreturn)
#ifdef _WIN64
VOID __security_check_cookie(DWORD64 qwCookie)
{
if (qwCookie == __security_cookie)
return;
#else // _WIN64
VOID __security_check_cookie(DWORD dwCookie)
{
if (dwCookie == __security_cookie)
return;
#endif // _WIN64
#ifdef FAST_FAIL_STACK_COOKIE_CHECK_FAILURE
__fastfail(FAST_FAIL_STACK_COOKIE_CHECK_FAILURE);
#else
#ifdef FAIL_FAST_GENERATE_EXCEPTION_ADDRESS
RaiseFailFastException((EXCEPTION_RECORD *) NULL, (CONTEXT *) NULL, FAIL_FAST_GENERATE_EXCEPTION_ADDRESS);
#else
SetUnhandledExceptionFilter(NULL);
RaiseException(EXCEPTION_STACK_BUFFER_OVERRUN, EXCEPTION_NONCONTINUABLE, 1UL, _AddressOfReturnAddress());
#endif
#pragma comment(lib, "kernel32")
#endif
}
#pragma comment(user, "(C)opyright 2004-2025, Stefan Kanthak")
Note: the
__fastfail()
intrinsic function is supported since Windows 8, the
RaiseFailFastException()
function is supported since Windows 7.
Note: see the
MSDN articles
LoadLibraryEx()
function or
SetDefaultDllDirectories()
function for the values of the DependentLoadFlags
member, the articles
HeapCreate()
function,
HeapAlloc()
function or
HeapReAlloc()
function for the values of the ProcessHeapFlags
member,
and the article
Gflags Flag Reference
for the values of the GlobalFlagsClear
as well as the
GlobalFlagsSet
member.
Managing Heap Memory
Global Flag Reference
Save the
ANSI C
source presented above as lcu.c
in the directory where
you created the object library i386.lib
before, then
execute the following 3 command lines to compile it, write the
assembly to the text file tls.cod
and add the generated
object file i386-lcu.obj
to the existing object library
i386.lib
:
SET CL=/c /DLCU /FAsc /GAFry /Oxy /W4 /Zl CL.EXE /Foi386-lcu.obj lcu.c LINK.EXE /LIB /OUT:i386.lib i386.lib i386-lcu.objFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. lcu.c lcu.c(117) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'const DWORD *' lcu.c(118) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'LPVOID *' lcu.c(119) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'BYTE *' Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.CAVEAT: verify in the assembly written to the text file
lcu.cod
that the
__security_check_cookie()
function clobbers at most
register ECX
upon return to the caller when the stack
cookie is intact!
__cdecl
__fastcall
Move the
ANSI C
source file lcu.c
into the directory where you created
the object library amd64.lib
before, then execute the
following 3 command lines to compile it, write the assembly to the
text file lcu.cod
and add the generated object
file amd64-lcu.obj
to the object library
amd64.lib
:
SET CL=/c /DLCU /FAsc /GAFy /Oxy /W4 /Zl CL.EXE /Foamd64-lcu.obj lcu.c LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-lcu.obj
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. lcu.c lcu.c(227) : warning C4047: 'initializing' : 'DWORD64' differs in levels of indirection from 'const DWORD64 *' Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.CAVEAT: verify in the assembly written to the text file
lcu.cod
that the
__security_check_cookie()
function clobbers no
register except RCX
, R8
, R9
,
R10
and R11
upon return to the caller when
the stack cookie is intact!
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.686
.model flat; C
extern ___safe_se_handler_count :abs
extern ___safe_se_handler_table :ptr proc
_lcu_32 struct 4
dword sizeof _lcu_32
dword 'VOID' ; 2006-04-21 21:32:06 UTC
word @Version / 100
word @Version mod 100
dword 10 dup (0)
word 0, 2048 ; LOAD_LIBRARY_SEARCH_SYSTEM32
dword 0
dword offset ___security_cookie
dword offset ___safe_se_handler_table
dword offset ___safe_se_handler_count
dword 5 dup (0)
_lcu_32 ends
.const
public __load_config_used
__load_config_used \
_lcu_32 <> ; IMAGE_LOAD_CONFIG_DIRECTORY32
.data
public ___security_cookie
___security_cookie \
dword 3141592654
.code
@__security_check_cookie@4 \
proc public ; void __fastcall __security_check_cookie(dword cookie)
cmp ecx, ___security_cookie
jne short fastfail
ret
fastfail:
mov ecx, 2 ; ecx = FAST_FAIL_STACK_COOKIE_CHECK_FAILURE
int 41
ud2
@__security_check_cookie@4 \
endp
___security_init_cookie \
proc public ; void __cdecl __security_init_cookie(void)
mov eax, ___security_cookie
cmp eax, 3141592654
je short init
test eax, eax
jne short exit
init:
rdtsc ; eax = low dword of time stamp counter,
; edx = high dword of time stamp counter
xor eax, edx ; eax = random number
mov ___security_cookie, eax
exit:
ret
___security_init_cookie \
endp
end
Microsoft Macro Assembler Reference
Save the i386 assembler source presented above as
i386-lcu.asm
in the directory where you created the
object library i386.lib
before, then execute the
following 3 command lines to generate the object file
i386-lcu.obj
and add it to the existing object library
i386.lib
:
SET ML=/c /safeseh /W3 /X ML.EXE i386-lcu.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-lcu.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-lcu.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
_lcu_64 struct 8
dword sizeof _lcu_64
dword 'VOID' ; 2006-04-21 21:32:06 UTC
word @Version / 100
word @Version mod 100
dword 0, 0, 0
qword 0, 0, 0, 0, 0, 0
dword 0
word 0, 2048 ; LOAD_LIBRARY_SEARCH_SYSTEM32
qword 0
qword offset __security_cookie
qword 0, 0, 0, 0, 0, 0
dword 0
_lcu_64 ends
.const
public _load_config_used
_load_config_used \
_lcu_64 <> ; IMAGE_LOAD_CONFIG_DIRECTORY64
.data
public __security_cookie
__security_cookie \
qword 3141592653589793241 shr 16
.code
__security_check_cookie \
proc public ; void __security_check_cookie(qword cookie)
cmp rcx, __security_cookie
jne short fastfail
;; shr rcx, 48
;; jnz short fastfail
ret
fastfail:
mov ecx, 2 ; rcx = FAST_FAIL_STACK_COOKIE_CHECK_FAILURE
int 41
ud2
__security_check_cookie \
endp
__security_init_cookie \
proc public ; void __security_init_cookie(void)
mov rax, __security_cookie
mov rcx, 3141592653589793241 shr 16
cmp rcx, rax
je short init
test rax, rax
jne short exit
init:
rdtsc ; rax = low dword of time stamp counter,
; rdx = high dword of time stamp counter
mov ecx, edx ; rcx = high dword of time stamp counter
bswap edx
imul rcx, rax ; rcx = high dword of time stamp counter
; * low dword of time stamp counter
bswap rax
xor rax, rdx ; rax = byte-swapped time stamp counter
mul rcx
xor rax, rdx ; rax = random number
shr rax, 16
mov __security_cookie, rax
exit:
ret
__security_init_cookie \
endp
__GSHandlerCheck \
proc private ; int __GSHandlerCheck(void *, void *, void *, void *)
xor eax, eax
inc eax ; rax = ExceptionContinueSearch
ret
__GSHandlerCheck \
endp
end
Microsoft Macro Assembler Reference
Save the AMD64 assembler source presented above as
amd64-lcu.asm
in the directory where you created the
object library amd64.lib
before, then execute the
following 3 command lines to generate the object file
amd64-lcu.obj
and add it to the existing object library
amd64.lib
:
SET ML=/c /W3 /X ML64.EXE amd64-lcu.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-lcu.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-lcu.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
/DELAYLOAD
Linker Support for Delay-Loaded DLLs
Specifying DLLs to Delay Load
Constraints of Delay Loading DLLs
Binding Imports
Explicitly Unloading a Delay-Loaded DLL
Understanding the Helper Function
Error Handling and Notification
Exceptions
Failure Hooks
Notification Hooks
Structure and Constant Definitions
Developing Your Own Helper Function
Calling Conventions, Parameters, and Return Type
Calculating Necessary Values
Unloading a Delay-Loaded DLL
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#pragma comment(lib, "kernel32")
#pragma comment(user, "(C)opyright 2004-2025, Stefan Kanthak")
#ifndef EXCEPTION_DELAY_LOAD_INVALID_PARAMETER
#define EXCEPTION_DELAY_LOAD_INVALID_PARAMETER 0xC06D0057
#endif
#ifndef EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND 0xC06D007E
#endif
#ifndef EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND 0xC06D007F
#endif
extern const IMAGE_DOS_HEADER __ImageBase;
typedef DWORD RVA;
typedef enum dliNotify
{
dliStartProcessing,
dliNotePreLoadLibrary,
dliNotePreGetProcAddress,
dliFailLoadLib,
dliFailGetProc,
dliNoteEndProcessing
} dliNotify;
typedef struct DelayLoadDescr
{
DWORD dwAttributes; // 1UL = all members are RVAs
DWORD dwDllName;
DWORD dwHMODULE; // RVA of module handle
DWORD dwIAT; // RVA of import address table
DWORD dwINT; // RVA of import name table
DWORD dwBoundIAT; // RVA of optional bound import address table
DWORD dwUnloadIAT; // RVA of optional copy of original import address table
DWORD dwTimeStamp;
} DelayLoadDescr;
typedef struct DelayLoadProc
{
BOOL fImportByName;
union
{
LPCSTR szProcName;
DWORD dwOrdinal;
};
} DelayLoadProc;
typedef struct DelayLoadInfo
{
DWORD cb; // size of structure
DelayLoadDescr *pidd; // raw form of data (everything is there)
FARPROC *ppfn; // points to address of function to load
LPCSTR szDll; // name of DLL
DelayLoadProc dlp; // name or ordinal of function to load
HMODULE hmodCur; // handle of DLL
FARPROC pfnCur; // actual function that will be called
DWORD dwLastError; // error received (if an error notification)
} DelayLoadInfo;
typedef FARPROC (WINAPI *PfnDliHook) (dliNotify, DelayLoadInfo *);
BOOL WINAPI __FUnloadDelayLoadedDLL2(LPCSTR szDll);
FARPROC WINAPI __delayLoadHelper2(DelayLoadDescr *lpDLD, FARPROC *lpfnIATEntry)
{
HMODULE hModule;
HMODULE *lpHMODULE = (HMODULE *) ((LPBYTE) &__ImageBase + lpDLD->dwHMODULE);
IMAGE_THUNK_DATA *lpINT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwINT);
IMAGE_THUNK_DATA *lpIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwIAT);
IMAGE_THUNK_DATA *lpBoundIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwBoundIAT);
DWORD dwEntry = (IMAGE_THUNK_DATA *) lpfnIATEntry - lpIAT;
// NOTE: *lpfnIATEntry == lpIAT[dwEntry].u1.Function
DelayLoadInfo dli = {sizeof(DelayLoadInfo),
lpDLD,
lpfnIATEntry,
(LPCSTR) &__ImageBase + lpDLD->dwDllName,
{!IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal),
IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal)
? IMAGE_ORDINAL(lpINT[dwEntry].u1.Ordinal)
: ((IMAGE_IMPORT_BY_NAME *) ((LPBYTE) &__ImageBase + lpINT[dwEntry].u1.AddressOfData))->Name},
*lpHMODULE,
(FARPROC) NULL,
ERROR_SUCCESS};
if (lpDLD->dwAttributes != 0UL)
{
dli.dwLastError = ERROR_INVALID_PARAMETER;
RaiseException(EXCEPTION_DELAY_LOAD_INVALID_PARAMETER,
EXCEPTION_NONCONTINUABLE,
1UL,
(DWORD_PTR *) &dli);
return (FARPROC) NULL;
}
if (dli.hmodCur == NULL) // module not yet loaded?
{
#ifndef LOAD_LIBRARY_SEARCH_SYSTEM32
dli.hmodCur = LoadLibraryA(dli.szDll);
#else
dli.hmodCur = LoadLibraryExA(dli.szDll, NULL, LOAD_LIBRARY_SEARCH_SYSTEM32);
#endif
if (dli.hmodCur == NULL)
{
dli.dwLastError = GetLastError();
RaiseException(EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND,
EXCEPTION_NONCONTINUABLE,
1UL,
(DWORD_PTR *) &dli);
return (FARPROC) NULL;
}
#ifndef _WIN64
hModule = (HMODULE) InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#else
hModule = (HMODULE) _InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#endif
if (hModule == dli.hmodCur)
FreeLibrary(dli.hmodCur);
else
if (lpDLD->dwUnloadIAT != 0UL)
{
// ...
}
}
if ((lpBoundIAT != NULL) && (lpDLD->dwTimeStamp != 0UL))
{
IMAGE_NT_HEADERS *lpModule = (IMAGE_NT_HEADERS *) ((LPBYTE) dli.hmodCur + ((IMAGE_DOS_HEADER *) dli.hmodCur)->e_lfanew);
if ((lpModule->Signature == IMAGE_NT_SIGNATURE)
&& (lpModule->FileHeader.TimeDateStamp == lpDLD->dwTimeStamp)
&& (lpModule->OptionalHeader.ImageBase == dli.hmodCur))
{
dli.pfnCur = (FARPROC) lpBoundIAT[dwEntry].u1.Function;
if (dli.pfnCur != NULL)
return *lpfnIATEntry = dli.pfnCur;
}
}
dli.pfnCur = GetProcAddress(dli.hmodCur, dli.dlp.szProcName);
if (dli.pfnCur != NULL) // function address resolved?
return *lpfnIATEntry = dli.pfnCur;
dli.dwLastError = GetLastError();
RaiseException(EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND,
EXCEPTION_NONCONTINUABLE,
1UL,
(DWORD_PTR *) &dli);
return (FARPROC) NULL;
}
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.model flat, C
extern __pfnDliDefaultHook2 :ptr proc
extern __pfnDliFailureHook2 (__pfnDliDefaultHook2) :ptr proc
extern __pfnDliNotifyHook2 (__pfnDliDefaultHook2) :ptr proc
end
Microsoft Macro Assembler Reference
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
extern __pfnDliDefaultHook2 :ptr proc
extern __pfnDliFailureHook2 (__pfnDliDefaultHook2) :ptr proc
extern __pfnDliNotifyHook2 (__pfnDliDefaultHook2) :ptr proc
end
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#pragma comment(lib, "kernel32")
#ifdef _WIN64
#pragma comment(linker, "/ALTERNATENAME:__pfnDliFailureHook2=__pfnDliDefaultHook2")
#pragma comment(linker, "/ALTERNATENAME:__pfnDliNotifyHook2=__pfnDliDefaultHook2")
#else
#pragma comment(linker, "/ALTERNATENAME:___pfnDliFailureHook2=___pfnDliDefaultHook2")
#pragma comment(linker, "/ALTERNATENAME:___pfnDliNotifyHook2=___pfnDliDefaultHook2")
#endif
#pragma comment(user, "(C)opyright 2004-2025, Stefan Kanthak")
#ifndef EXCEPTION_DELAY_LOAD_INVALID_PARAMETER
#define EXCEPTION_DELAY_LOAD_INVALID_PARAMETER 0xC06D0057
#endif
#ifndef EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND 0xC06D007E
#endif
#ifndef EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND 0xC06D007F
#endif
extern const IMAGE_DOS_HEADER __ImageBase;
typedef DWORD RVA;
typedef enum dliNotify
{
dliStartProcessing,
dliNotePreLoadLibrary,
dliNotePreGetProcAddress,
dliFailLoadLib,
dliFailGetProc,
dliNoteEndProcessing
} dliNotify;
typedef struct DelayLoadDescr
{
DWORD dwAttributes; // 1UL = all members are RVAs
DWORD dwDllName;
DWORD dwHMODULE; // RVA of module handle
DWORD dwIAT; // RVA of import address table
DWORD dwINT; // RVA of import name table
DWORD dwBoundIAT; // RVA of optional bound import address table
DWORD dwUnloadIAT; // RVA of optional copy of original import address table
DWORD dwTimeStamp;
} DelayLoadDescr;
typedef struct DelayLoadProc
{
BOOL fImportByName;
union
{
LPCSTR szProcName;
DWORD dwOrdinal;
};
} DelayLoadProc;
typedef struct DelayLoadInfo
{
DWORD cb; // size of structure
DelayLoadDescr *pidd; // raw form of data (everything is there)
FARPROC *ppfn; // points to address of function to load
LPCSTR szDll; // name of DLL
DelayLoadProc dlp; // name or ordinal of function to load
HMODULE hmodCur; // handle of DLL
FARPROC pfnCur; // actual function that will be called
DWORD dwLastError; // error received (if an error notification)
} DelayLoadInfo;
typedef FARPROC (WINAPI *PfnDliHook) (dliNotify, DelayLoadInfo *);
extern PfnDliHook __pfnDliNotifyHook2;
extern PfnDliHook __pfnDliFailureHook2;
const PfnDliHook __pfnDliDefaultHook2 = NULL;
BOOL WINAPI __FUnloadDelayLoadedDLL2(LPCSTR szDll);
FARPROC WINAPI __delayLoadHelper2(DelayLoadDescr *lpDLD, FARPROC *lpfnIATEntry)
{
HMODULE hModule;
HMODULE *lpHMODULE = (HMODULE *) ((LPBYTE) &__ImageBase + lpDLD->dwHMODULE);
IMAGE_THUNK_DATA *lpINT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwINT);
IMAGE_THUNK_DATA *lpIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwIAT);
IMAGE_THUNK_DATA *lpBoundIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwBoundIAT);
DWORD dwEntry = (IMAGE_THUNK_DATA *) lpfnIATEntry - lpIAT;
// NOTE: *lpfnIATEntry == lpIAT[dwEntry].u1.Function
DelayLoadInfo dli = {sizeof(DelayLoadInfo),
lpDLD,
lpfnIATEntry,
(LPCSTR) &__ImageBase + lpDLD->dwDllName,
{!IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal),
IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal)
? IMAGE_ORDINAL(lpINT[dwEntry].u1.Ordinal)
: ((IMAGE_IMPORT_BY_NAME *) ((LPBYTE) &__ImageBase + lpINT[dwEntry].u1.AddressOfData))->Name},
*lpHMODULE,
(FARPROC) NULL,
ERROR_SUCCESS};
if (__pfnDliNotifyHook2 != NULL)
{
dli.pfnCur = (*__pfnDliNotifyHook2)(dliStartProcessing, &dli);
if (dli.pfnCur != NULL)
goto SUCCESS;
}
if (lpDLD->dwAttributes != 0UL)
{
dli.dwLastError = ERROR_INVALID_PARAMETER;
RaiseException(EXCEPTION_DELAY_LOAD_INVALID_PARAMETER,
EXCEPTION_NONCONTINUABLE,
1UL,
(DWORD_PTR *) &dli);
goto FAILURE;
}
if (dli.hmodCur != NULL) // module already loaded?
goto ADDRESS;
if (__pfnDliNotifyHook2 != NULL)
dli.hmodCur = (HMODULE) (*__pfnDliNotifyHook2)(dliNotePreLoadLibrary, &dli);
if (dli.hmodCur != NULL) // module handle resolved by notification routine?
goto ADDRESS;
#ifndef LOAD_LIBRARY_SEARCH_SYSTEM32
dli.hmodCur = LoadLibraryA(dli.szDll);
#else
dli.hmodCur = LoadLibraryExA(dli.szDll, NULL, LOAD_LIBRARY_SEARCH_SYSTEM32);
#endif
if (dli.hmodCur == NULL) // module not loaded?
{
dli.dwLastError = GetLastError();
if (__pfnDliFailureHook2 != NULL)
dli.hmodCur = (HMODULE) (*__pfnDliFailureHook2)(dliFailLoadLib, &dli);
if (dli.hmodCur == NULL)
{
RaiseException(EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND,
EXCEPTION_NONCONTINUABLE,
1UL,
(DWORD_PTR *) &dli);
goto FAILURE;
}
#ifndef _WIN64
hModule = (HMODULE) InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#else
hModule = (HMODULE) _InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#endif
if (hModule == dli.hmodCur)
FreeLibrary(dli.hmodCur);
else
if (lpDLD->dwUnloadIAT != 0UL)
{
// ...
}
}
ADDRESS:
if (__pfnDliNotifyHook2 != NULL)
dli.pfnCur = (*__pfnDliNotifyHook2)(dliNotePreGetProcAddress, &dli);
if (dli.pfnCur != NULL) // function address resolved by notification routine?
goto SUCCESS;
if ((lpBoundIAT != NULL) && (lpDLD->dwTimeStamp != 0UL))
{
IMAGE_NT_HEADERS *lpModule = (IMAGE_NT_HEADERS *) ((LPBYTE) dli.hmodCur + ((IMAGE_DOS_HEADER *) dli.hmodCur)->e_lfanew);
if ((lpModule->Signature == IMAGE_NT_SIGNATURE)
&& (lpModule->FileHeader.TimeDateStamp == lpDLD->dwTimeStamp)
&& (lpModule->OptionalHeader.ImageBase == dli.hmodCur))
{
dli.pfnCur = (FARPROC) lpBoundIAT[dwEntry].u1.Function;
if (dli.pfnCur != NULL)
goto SUCCESS;
}
}
dli.pfnCur = GetProcAddress(dli.hmodCur, dli.dlp.szProcName);
if (dli.pfnCur != NULL) // function address resolved?
goto SUCCESS;
dli.dwLastError = GetLastError();
if (__pfnDliFailureHook2 != NULL)
dli.pfnCur = (*__pfnDliFailureHook2)(dliFailGetProc, &dli);
if (dli.pfnCur != NULL) // function address resolved by failure routine?
goto SUCCESS;
RaiseException(EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND,
EXCEPTION_NONCONTINUABLE,
1UL,
(DWORD_PTR *) &dli);
goto FAILURE;
SUCCESS:
*lpfnIATEntry = dli.pfnCur;
FAILURE:
if (__pfnDliNotifyHook2 != NULL)
(*__pfnDliNotifyHook2)(dliNoteEndProcessing, &dli);
return dli.pfnCur;
}
Save the
ANSI C
source presented above as dli.c
in the directory where
you created the object library i386.lib
before, then
execute the following 3 command lines to compile it and add the
generated object file i386-dli.obj
to the existing
object library i386.lib
:
SET CL=/c /GAFyz /Oxy /W4 /Zl CL.EXE /Foi386-dli.obj dli.c LINK.EXE /LIB /OUT:i386.lib i386.lib i386-dli.objFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. dli.c dli.c(65) : warning C4201: nonstandard extension used : nameless struct/union dli.c(100) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(101) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(102) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(103) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(106) : warning C4047: ':' : 'DWORD' differs in levels of indirection from 'BYTE *' dli.c(106) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(106) : warning C4057: 'initializing' : 'LPCSTR' differs in indirection to slightly different base types from 'BYTE *' dli.c(107) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(135) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE' dli.c(140) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE' dli.c(186) : warning C4047: '==' : 'DWORD' differs in levels of indirection from 'HMODULE' Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.__stdcall Move the ANSI C source file
dli.c
into the directory where you created
the object library amd64.lib
before, then execute the
following 3 command lines to compile it and add the generated object
file amd64-dli.obj
to the object library
amd64.lib
:
SET CL=/c /GAFy /Oxy /W4 /Zl CL.EXE /Foamd64-dli.obj dli.c LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-dli.obj
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. dli.c dli.c(65) : warning C4201: nonstandard extension used : nameless struct/union dli.c(95) : warning C4244: 'initializing' : conversion from '__int64' to 'DWORD', possible loss of data dli.c(100) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(101) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(102) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(103) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(106) : warning C4047: ':' : 'ULONGLONG' differs in levels of indirection from 'BYTE *' dli.c(106) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(106) : warning C4057: 'initializing' : 'LPCSTR' differs in indirection to slightly different base types from 'BYTE *' dli.c(107) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(135) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE' dli.c(149) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE' dli.c(186) : warning C4047: '==' : 'ULONGLONG' differs in levels of indirection from 'HMODULE' Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
main()
and wmain()
SupportRemarks, the documentation of the linker option
/ENTRY:‹symbol›
states:
RemarksDynamic-Link Library Entry-Point Function DllMain Callback Function OUCH:The /ENTRY option specifies an entry point function as the starting address for an .exe file or DLL.
The function must be defined to use the
__stdcall
calling convention. The parameters and return value depend on if the program is a console application, a windows application or a DLL. It is recommended that you let the linker set the entry point so that the C run-time library is initialized correctly, and C++ constructors for static objects are executed.By default, the starting address is a function name from the C run-time library. The linker selects it according to the attributes of the program, as shown in the following table.
Function name Default for mainCRTStartup
(or wmainCRTStartup)An application that uses /SUBSYSTEM:CONSOLE; calls main
(orwmain
)WinMainCRTStartup
(or wWinMainCRTStartup)An application that uses /SUBSYSTEM:WINDOWS; calls WinMain
(orwWinMain
), which must be defined to use__stdcall
_DllMainCRTStartup A DLL; calls DllMain
if it exists, which must be defined to use__stdcall
If the /DLL or /SUBSYSTEM option is not specified, the linker selects a subsystem and entry point depending on whether
main
orWinMain
is defined.The functions
main
,WinMain
, andDllMain
are the three forms of the user-defined entry point.
mainCRTStartup()
,
mainCRTStartup()
, WinMainCRTStartup()
and
wWinMainCRTStartup()
, the 4 entry point functions for
applications, use but the
__cdecl
calling and naming convention
– they take the address of the
Process Environment Block
as argument and return a 32-bit integer as exit code of the
thread
respectively the
process.
Processes and Threads
The MSDN article Format of a C Decorated Name specifies:
The MSDN articles __cdecl and __stdcall provide more details. __fastcall __vectorcall Calling ConventionsThe form of decoration for a C function depends on the calling convention used in its declaration, as shown below. Note that in a 64-bit environment, functions are not decorated.
Calling convention Decoration __cdecl (the default) Leading underscore (_) __stdcall Leading underscore (_) and a trailing at sign (@) followed by a number representing the number of bytes in the parameter list __fastcall Same as __stdcall, but prepended by an at sign instead of an underscore __vectorcall Two trailing at signs (@@) followed by the decimal number of bytes in the parameter list.
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
typedef unsigned short wchar_t;
#ifdef CONSOLE
#ifdef UNICODE
int wmain(int argc, wchar_t *argv[], wchar_t *envp[])
#else
int main(int argc, char *argv[], char *envp[])
#endif
{
return *envp != argv[argc];
}
#else // WINDOWS
#ifdef UNICODE
int wWinMain(void *current, void *previous, wchar_t *cmdline, int show)
#else
int WinMain(void *current, void *previous, char *cmdline, int show)
#endif
{
return cmdline[current == previous] != show;
}
#endif // WINDOWS
Save the
ANSI C
source presented above as i386-sys.c
in an arbitrary,
preferable empty directory, then execute the following 5 command
lines to compile and (attempt to) link it:
SET CL=/W4 /X /Zl CL.EXE /DUNICODE /Gz i386-sys.c CL.EXE /Gz i386-sys.c CL.EXE /DCONSOLE /Gd i386-sys.c CL.EXE /DCONSOLE /DUNICODE /Gd i386-sys.cFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-sys.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /out:i386-sys.exe i386-sys.obj LINK : error LNK2001: unresolved external symbol _wWinMainCRTStartup i386-sys.exe : fatal error LNK1120: 1 unresolved externals Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-sys.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /out:i386-sys.exe i386-sys.obj LINK : error LNK2001: unresolved external symbol _WinMainCRTStartup i386-sys.exe : fatal error LNK1120: 1 unresolved externals Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-sys.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /out:i386-sys.exe i386-sys.obj LINK : error LNK2001: unresolved external symbol _mainCRTStartup i386-sys.exe : fatal error LNK1120: 1 unresolved externals Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-sys.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /out:i386-sys.exe i386-sys.obj LINK : error LNK2001: unresolved external symbol _wmainCRTStartup i386-sys.exe : fatal error LNK1120: 1 unresolved externalsOUCH: the linker expects the 4 entry point functions for applications,
mainCRTStartup()
,
mainCRTStartup()
, WinMainCRTStartup()
and
wWinMainCRTStartup()
, to be defined using the
__cdecl
naming convention,
i.e. without the
decoration
appended to the name of functions defined using the
__stdcall
naming convention!
; Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.386
.model flat, C
.code
main proc public
assume fs :flat ; fs = address of TEB
mov eax, fs:[48] ; eax = address of PEB
xor eax, [esp+4] ; eax = 0
ret
main endp
end main ; writes "/ENTRY:main" to '.drectve' section
Save the i386 assembler source presented above as
i386-sys.asm
in an arbitrary, preferable empty
directory, then execute the following 5 command lines to build the
console application i386-sys.exe
, execute it and
display its exit code:
SET ML=/safeseh /W3 /X SET LINK= ML.EXE i386-sys.asm .\i386-sys.exe ECHO %ERRORLEVEL%
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-sys.asm Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /OUT:i386-sys.exe i386-sys.obj 0
; Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
.code
wmain proc public
; gs = address of TEB
; rcx = address of PEB
xor eax, eax ; rax = 0
cmp rcx, gs:[96]
sete al ; rax = 1
ret
wmain endp
end
Save the AMD64 assembler source presented above as
amd64-sys.asm
in an arbitrary, preferable empty
directory, then execute the following 5 command lines to build the
console application amd64-sys.exe
, execute it and
display its exit code:
SET ML=/W3 /X SET LINK=/ENTRY:wmain ML64.EXE amd64-sys.asm .\amd64-sys.exe ECHO %ERRORLEVEL%
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-sys.asm Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:wmain /OUT:amd64-sys.exe amd64-sys.obj 1
main()
and wmain()
functions,
their arguments, how to parse the command line returned from the
GetCommandLine()
function and how to split the environment block returned from the
GetEnvironmentStrings()
function to derive these arguments.
Using wmain instead of main
Argument Definitions
main Function Restrictions
Parsing C Command-Line Arguments
WinMain function
Note: the
CommandLineToArgvW()
function uses the same algorithm, but supports only
UTF-16LE
encoding.
The following
ANSI C
program provides the glue
between the
mainCRTStartup()
or wmainCRTStartup()
entry point functions and the main()
or
wmain()
functions:
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#pragma comment(lib, "kernel32")
#pragma comment(user, "(C)opyright 2004-2025, Stefan Kanthak")
extern int main(int argc, char const *argv[], char const *envp[]);
extern int wmain(int argc, wchar_t const *argv[], wchar_t const *envp[]);
__declspec(noreturn)
__declspec(safebuffers)
VOID CDECL mainCRTStartup(VOID)
{
LPSTR lpArgument;
LPCSTR lpCmdLine = GetCommandLineA();
LPCSTR lpBlock = GetEnvironmentStringsA();
LPCSTR lpCount = lpCmdLine;
UINT uiCount = 0U;
UINT uiQuote = 0U;
UINT argc;
LPCSTR *argv;
LPCSTR *envp;
argc = (*lpCount != ' ' && *lpCount != '\t');
while (*lpCount != '\0') // count arguments
{
if (!uiQuote // whitespace outside double quotes?
&& (*lpCount == ' ' || *lpCount == '\t'))
{
do // skip unquoted whitespace
lpCount++;
while (*lpCount == ' ' || *lpCount == '\t');
argc += (*lpCount != '\0');
uiCount = 0U;
continue;
}
else if (*lpCount == '\\')
uiCount ^= ~0U;
else if (!uiCount // unescaped double quote?
&& *lpCount == '"')
uiQuote ^= ~0U;
else // regular character
uiCount = 0U;
lpCount++;
}
if (uiQuote) // unpaired double quote?
SetLastError(ERROR_BAD_ARGUMENTS);
argv = (LPCSTR *) _alloca((argc + 1U) * sizeof(*argv) + (lpCount + 1U - lpCmdLine) * sizeof(**argv));
argv[0] = lpArgument = (LPSTR) (argv + argc + 1U);
argc = uiCount = uiQuote = 0U;
while (*lpCmdLine != '\0') // process arguments
{
if (!uiQuote // whitespace outside double quotes?
&& (*lpCmdLine == ' ' || *lpCmdLine == '\t'))
{ // terminate current argument
*lpArgument = '\0';
do // skip unquoted whitespace
lpCmdLine++;
while (*lpCmdLine == ' ' || *lpCmdLine == '\t');
if (*lpCmdLine != '\0')
// store address of next argument
argv[++argc] = lpArgument = (LPSTR) lpCmdLine;
uiCount = 0U;
}
else if (*lpCmdLine == '\\')
{
*lpArgument++ = *lpCmdLine++;
uiCount++; // count backslash
}
else if (*lpCmdLine == '"')
{
lpArgument -= (uiCount + 1U) / 2U;
if (uiCount & 1U)
// double quote preceeded by odd number
// of backslashes: keep half of them
// and the (escaped) double quote
*lpArgument++ = *lpCmdLine++;
else // double quote preceeded by even number
// of backslashes: keep half of them
if (*++lpCmdLine == '"' && uiQuote)
// double quote inside double quotes and
// followed by another double quote:
// keep one double quote
*lpArgument++ = *lpCmdLine++;
else // skip double quote and toggle state
uiQuote ^= ~0U;
uiCount = 0U;
}
else // regular character
{
*lpArgument++ = *lpCmdLine++;
uiCount = 0U;
}
}
*lpArgument = '\0'; // terminate (last) argument
argv[++argc] = NULL; // store terminating NULL pointer
envp = argv + argc;
if (lpBlock != NULL)
{
for (uiCount = 0U, // count environment strings
lpCount = lpBlock; *lpCount != '\0'; lpCount += strlen(lpCount) + 1U)
if (*lpCount != '=')
uiCount++;
if (uiCount > 0U) // process environment strings
{
envp = (LPCSTR *) _alloca((uiCount + 1U) * sizeof(*envp));
for (uiCount = 0U,
lpCount = lpBlock; *lpCount != '\0'; lpCount += strlen(lpCount) + 1U)
if (*lpCount != '=')
envp[uiCount++] = lpCount;
envp[uiCount] = (LPCSTR) NULL;
}
}
ExitProcess(main(argc, argv, envp));
}
__declspec(noreturn)
__declspec(safebuffers)
VOID CDECL wmainCRTStartup(VOID)
{
LPWSTR lpArgument;
LPCWSTR lpCmdLine = GetCommandLineW();
LPCWSTR lpBlock = GetEnvironmentStringsW();
LPCWSTR lpCount = lpCmdLine;
UINT uiCount = 0U;
UINT uiQuote = 0U;
UINT argc;
LPCWSTR *argv;
LPCWSTR *envp;
argc = (*lpCount != L' ' && *lpCount != L'\t');
while (*lpCount != L'\0') // count arguments
{
if (!uiQuote // whitespace outside double quotes?
&& (*lpCount == L' ' || *lpCount == L'\t'))
{
do // skip unquoted whitespace
lpCount++;
while (*lpCount == L' ' || *lpCount == L'\t');
argc += (*lpCount != L'\0');
uiCount = 0U;
continue;
}
else if (*lpCount == L'\\')
uiCount ^= ~0U;
else if (!uiCount // unescaped double quote?
&& *lpCount == L'"')
uiQuote ^= ~0U;
else // regular character
uiCount = 0U;
lpCount++;
}
if (uiQuote) // unpaired double quote?
SetLastError(ERROR_BAD_ARGUMENTS);
argv = (LPCWSTR *) _alloca((argc + 1U) * sizeof(*argv) + (lpCount + 1U - lpCmdLine) * sizeof(**argv));
argv[0] = lpArgument = (LPWSTR) (argv + argc + 1U);
argc = uiCount = uiQuote = 0U;
while (*lpCmdLine != L'\0') // process arguments
{
if (!uiQuote // whitespace outside double quotes?
&& (*lpCmdLine == L' ' || *lpCmdLine == L'\t'))
{ // terminate current argument
*lpArgument = L'\0';
do // skip unquoted whitespace
lpCmdLine++;
while (*lpCmdLine == L' ' || *lpCmdLine == L'\t');
if (*lpCmdLine != L'\0')
// store address of next argument
argv[++argc] = lpArgument = (LPWSTR) lpCmdLine;
uiCount = 0U;
}
else if (*lpCmdLine == L'\\')
{
*lpArgument++ = *lpCmdLine++;
uiCount++; // count backslash
}
else if (*lpCmdLine == L'"')
{
lpArgument -= (uiCount + 1U) / 2U;
if (uiCount & 1U)
// double quote preceeded by odd number
// of backslashes: keep half of them
// and the (escaped) double quote
*lpArgument++ = *lpCmdLine++;
else // double quote preceeded by even number
// of backslashes: keep half of them
if (*++lpCmdLine == L'"' && uiQuote)
// double quote inside double quotes and
// followed by another double quote:
// keep one double quote
*lpArgument++ = *lpCmdLine++;
else // skip double quote and toggle state
uiQuote ^= ~0U;
uiCount = 0U;
}
else // regular character
{
*lpArgument++ = *lpCmdLine++;
uiCount = 0U;
}
}
*lpArgument = L'\0'; // terminate (last) argument
argv[++argc] = NULL; // store terminating NULL pointer
envp = argv + argc;
if (lpBlock != NULL)
{
for (uiCount = 0U, // count environment strings
lpCount = lpBlock; *lpCount != L'\0'; lpCount += wcslen(lpCount) + 1U)
if (*lpCount != L'=')
uiCount++;
if (uiCount > 0U) // process environment strings
{
envp = (LPCWSTR *) _alloca((uiCount + 1U) * sizeof(*envp));
for (uiCount = 0U,
lpCount = lpBlock; *lpCount != L'\0'; lpCount += wcslen(lpCount) + 1U)
if (*lpCount != L'=')
envp[uiCount++] = lpCount;
envp[uiCount] = (LPCWSTR) NULL;
}
}
ExitProcess(wmain(argc, argv, envp));
}
Note: the mainCRTStartup()
function
allocates up to 32768 bytes for the command line plus
16384×4 bytes (32-bit platforms) or 16384×8 bytes
(64-bit platforms) for the argv[]
array on the stack,
i.e. at most 96 kiB on 32-bit platforms and 160 kiB on
64-bit platforms; the wmainCRTStartup()
function
allocates up to 65536 bytes for the command line plus 16384×4
bytes (32-bit platforms) or 16384×8 bytes (64-bit platforms)
for the argv[]
array on the stack, i.e. at most
128 kiB on 32-bit platforms and 192 kiB on 64-bit
platforms.
Save the
ANSI C
source presented above as sys.c
in the directory where
you created the object library i386.lib
before, then
execute the following 3 command lines to compile it and add the
generated object file i386-sys.obj
to the existing
object library i386.lib
:
SET CL=/c /GAFdy /J /Oxy /W4 /Zl CL.EXE /Foi386-sys.obj sys.c LINK.EXE /LIB /OUT:i386.lib i386.lib i386-sys.objFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. sys.c Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.Move the ANSI C source file
sys.c
into the directory where you created
the object library amd64.lib
before, then execute the
following 3 command lines to compile it and add the generated object
file amd64-sys.obj
to the object library
amd64.lib
:
SET CL=/c /GAFy /J /Oxy /W4 /Zl CL.EXE /Foamd64-sys.obj sys.c LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-sys.obj
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. sys.c Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
amd64.lib
respectively
i386.lib
before or instead of the
MSVCRT
libraries.
Option Description /LD Creates a DLL. Passes the /DLL option to the linker. The linker looks for, but does not require, a
DllMain
function. If you do not write aDllMain
function, the linker inserts aDllMain
function that returns TRUE.Links the DLL startup code.
Creates an import library (.lib), if an export (.exp) file is not specified on the command line. You link the import library to applications that call your DLL.
Interprets /Fe (Name EXE File) as naming a DLL rather than an .exe file. By default, the program name becomes basename.dll instead of basename.exe.
[…]
Implies /MT unless you explicitly specify /MD.
.c
in an arbitrary, preferable empty directory,
then compile and (attempt to) link it with the object library
msvcrt.lib
against the
Visual C runtime
DLL:
COPY NUL: .c SET CL=/LD /W4 /X SET LINK=/MACHINE:I386 /MAP /OPT:ICF,REF CL.EXE /MD .cFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation. All rights reserved.
.c(1) : warning C4206: nonstandard extension used : translation unit is empty
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation. All rights reserved.
/MACHINE:I386 /MAP /OPT:ICF,REF
/out:.dll
/dll
/implib:.lib
.obj
LINK : error LNK2001: unresolved external symbol __DllMainCRTStartup@12
.dll : fatal error LNK1120: 1 unresolved externals
OUCH: the combined import and object
library
msvcrt.lib
shipped with the Visual C compiler does
not provide the entry point function
_DllMainCRTStartup()
required to build
DLLs!
Repeat the last command line without the compiler option
/MD
to link with the object library
libcmt.lib
shipped with the Visual C compiler, then display
the size of the generated empty
DLL
.dll
:
CL.EXE .c DIR .dll
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
.c(1) : warning C4206: nonstandard extension used : translation unit is empty
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation. All rights reserved.
/MACHINE:I386 /MAP /OPT:ICF,REF
/out:.dll
/dll
/implib:.lib
.obj
Volume in drive C has no label.
Volume Serial Number is 1957-0427
Directory of C:\Users\Stefan\Desktop
04/27/2015 08:15 PM 32,256 .dll
1 File(s) 32,256 bytes
0 Dir(s) 9,876,543,210 bytes free
OOPS: an emptyDLL is 31.5 kiB (in words: thirty-one and a half kilobyte)
small!
Note: the inspection of the generated text file
.map
to determine what the linker included is left as
an exercise to the reader.
Note: the corresponding demonstration for console
applications with empty main()
and
wmain()
functions as well as Windows
applications with empty
WinMain()
and
wWinMain()
functions is also left an exercise to the reader.
Note: a repetition of these demonstrations using the 64-bit build environment is left as an exercise to the reader too.
.CRT
Section Usage.CRT
section.
The following table shows how the Visual C
compiler and its runtime use the .CRT
section:
Section$Group | Public Name | Purpose and Usage |
---|---|---|
Section$Group | Public Name | Purpose and Usage |
.CRT$XCA | __xc_a | NULL pointer before array of
C++ constructor and initialiser function
pointers.
|
.CRT$XCAA | pre_cpp_init() function pointer. |
|
.CRT$XCU | Dynamic initialiser function pointers. | |
.CRT$XCZ | __xc_z | Terminating NULL pointer after array of
C++ constructor and initialiser function
pointers.
|
.CRT$XDA | __xd_a | NULL pointer before array of
C++
TLS
initialiser callback function pointers.
|
.CRT$XDC | C++ TLS initialiser callback function pointers. | |
.CRT$XDL | C++ TLS initialiser callback function pointers. | |
.CRT$XDU | C++ TLS initialiser callback function pointers. | |
.CRT$XDZ | __xd_z | Terminating NULL pointer after array of
C++
TLS
initialiser call function pointers.
|
.CRT$XIA | __xi_a | NULL pointer before array of
C initialiser function pointers.
|
.CRT$XIAA | pre_c_init() and
_mixed_pre_c_init() function pointers.
|
|
.CRT$XIC | __initmbctable() ,
__initstdio() , __inittime() ,
__lconv_init() and
__onexitinit() function pointers.
|
|
.CRT$XID | __set_emptyinvalidparamhandler ,
__set_loosefpmath() and
_InitCPLocHash() function pointers.
|
|
.CRT$XIY | __CxxSetUnhandledExceptionFilter()
function pointer.
|
|
.CRT$XIZ | __xi_z | Terminating NULL pointer after array of
C initialiser function pointers.
|
.CRT$XLA | __xl_a | NULL pointer before array of
TLS
callback function pointers.
|
.CRT$XLC | __dyn_tls_dtor() function pointer. |
|
.CRT$XLD | __dyn_tls_init() function pointer. |
|
.CRT$XLZ | __xl_z | Terminating NULL pointer after array of
TLS
callback function pointers.
|
.CRT$XPA | __xp_a | NULL pointer before array of
C pre-termination function pointers.
|
.CRT$XPB | _concrt_static_cleanup() function pointer.
|
|
.CRT$XPX | __termconin() ,
__termconout() , _locterm() and
_rmtmp() function pointers.
|
|
.CRT$XPXA | __endstdio() function pointer. |
|
.CRT$XPZ | __xp_z | Terminating NULL pointer after array of
C pre-termination function pointers.
|
.CRT$XTA | __xt_a | NULL pointer before array of
C termination function pointers.
|
.CRT$XTZ | __xt_z | Terminating NULL pointer after array of
C termination function pointers.
|
CAVEAT: all symbols have global scope and pollute the name space without necessity!
.rtc
Section Usage_RTC_Initialize()
_RTC_Terminate()
The following table shows how the Visual C
compiler and its runtime use the .rtc
section:
Section$Group | Public Name | Purpose and Usage |
---|---|---|
Section$Group | Public Name | Purpose and Usage |
.rtc$IAA | __rtc_iaa | Terminating NULL pointer before array of
RTC
initialisation function pointers.
|
.rtc$IZZ | __rtc_izz | Terminating NULL pointer after array of
RTC
initialisation function pointers.
|
.rtc$TAA | __rtc_taa | Terminating NULL pointer before array of
RTC
termination function pointers.
|
.rtc$TZZ | __rtc_tzz | Terminating NULL pointer after array of
RTC
termination function pointers.
|
CAVEAT: all symbols have global scope and pollute the name space without necessity!
Use the X.509 certificate to send S/MIME encrypted mail.
Note: email in weird format and without a proper sender name is likely to be discarded!
I dislike
HTML (and even
weirder formats too) in email, I prefer to receive plain text.
I also expect to see your full (real) name as sender, not your
nickname.
I abhor top posts and expect inline quotes in replies.
as iswithout any warranty, neither express nor implied.
cookiesin the web browser.
The web service is operated and provided by
Telekom Deutschland GmbH The web service provider stores a session cookie
in the web
browser and records every visit of this web site with the following
data in an access log on their server(s):