 
        
        
             
        
        
             
        
        __chkstk Routine_alloca Routine_chkstk Routine_allmul Routine_alldiv Routine_alldvrm Routine_allrem Routine_aulldiv Routine_aulldvrm Routine_aullrem Routine_aullshr Routine_allshl Routine_allshr Routine_all* and _aull* Routines in LeakedSource
_all* and _aull* Routines_rotl64() and _rotr64() Intrinsic Functions for i386 Platform_allrol() and _allror() Functions in i386 Assembler_abs64() Intrinsic Function for i386 Platform_allabs() Function in i386 Assembler_allneg() Function in i386 Assembler_allsgn() Function in i386 Assembler_allcmp() and _aullcmp() Functions in i386 Assembler_allmax() and _aullmax() Functions in i386 Assembler_allmin() and _aullmin() Functions in i386 Assembleracos(), asin(), atan(), atan2(), cos(), cosh(), exp(), fmod(), log(), log10(), pow(), sin(), sinh(), sqrt(), tan() and tanh() Standard Functions for i386 Platform_CI* and _ftol* Routinesmemchr() Standard Function for i386 PlatformSmartImplementation in i386 Assembler
SmartImplementation in AMD64 Assembler
mem*() Standard Functionsmemcpy() and memset() with Intrinsic Functionsstrchr() Standard Function for i386 Platformstrlen() Standard Function for i386 Platformstrrchr() and strstr() Standard Functions for i386 Platformstr*() Standard Functionswcs*() Standard Functions_load_config_used and __security_check_cookie() Function (/GS Support)main() and wmain() Support.CRT Section Usage.rtc Section Usageportable executableimage files, i.e. applications, DLLs and (kernel) drivers.
Additionally present properly optimised, especially for 64÷64-bit integer division on the i386 platform several times faster implementations.
__chkstk
            for memory allocations on the stack, and to the standard functions
            memcpy()
            and
            memset()
            for assignment and initialisation of arrays and structures.
         For code running on the i386 alias x86
            processor architecture, the Microsoft
            Visual C compiler generates calls to the (almost)
            undocumented helper routines
            _alloca
            and
            _chkstk
            for memory allocations on the stack, to the standard functions
            memcpy()
            and
            memset()
            for assignment and initialisation of arrays and structures, to the
            (almost) undocumented helper routines
            _alldiv,
            _alldvrm,
            _allmul,
            _allrem, _allshl and
            _allshr for signed 64-bit integer division,
            multiplication and shift operations, to the also (almost)
            undocumented helper routines
            _aulldiv,
            _aulldvrm, _aullrem and
            _aullshr for unsigned 64-bit integer division,
            multiplication and shift operations, and to the helper routines
            _CIacos, _CIasin,
            _CIatan,
            _CIatan2,
            _CIcos,
            _CIcosh,
            _CIexp,
            _CIfmod,
            _CIlog,
            _CIlog10,
            _CIpow,
            _CIsin,
            _CIsinh,
            _CIsqrt,
            _CItan,
            _CItanh and
            _ftol
            for transcendental as well as trigonometric floating-point
            functions.
            Internal CRT globals and functions
        
 Note: except for the mem*() and
            str*() standard functions, all helper routines use
            non-standard
            calling or naming convention
            and can’t be called from C or C++
            sources by their name!
        
 These routines are provided in the object file
            chkstk.obj, the object libraries
            libcmt.lib,
            libcmtd.lib, msvcrt.lib and
            msvcrtd.lib, partially also in
            runtmchk.lib;
            their i386 assembler sources are provided in the files
            alloca16.asm, chkstk.asm,
            lldiv.asm, lldvrm.asm,
            llmul.asm, llrem.asm,
            llshl.asm, llshr.asm,
            ulldiv.asm, ulldvrm.asm,
            ullrem.asm, ullshr.asm,
            memchr.asm, memcpy.asm,
            memset.asm, strchr.asm,
            strlen.asm etc., and their
            ANSI C
            sources are provided in the files strcat.c,
            strrchr.c, strstr.c etc.
        
 Note: many of these routines are exported from
            NTDLL.dll.
        
 Import libraries
            amd64.lib and i386.lib can be generated
            with the following 2 command lines:
        
LINK.EXE /LIB /DEF /EXPORT:__C_specific_handler /EXPORT:__chkstk /EXPORT:atoi /EXPORT:atol /EXPORT:isalnum /EXPORT:isalpha /EXPORT:iscntrl /EXPORT:isdigit /EXPORT:isgraph /EXPORT:islower /EXPORT:isprint /EXPORT:ispunct /EXPORT:isspace /EXPORT:isupper /EXPORT:isxdigit /EXPORT:iswalpha /EXPORT:iswctype /EXPORT:iswdigit /EXPORT:iswlower /EXPORT:iswspace /EXPORT:iswxdigit /EXPORT:memchr /EXPORT:memcmp /EXPORT:memcpy /EXPORT:memmove /EXPORT:memset /EXPORT:qsort /EXPORT:strcat /EXPORT:strcat_s /EXPORT:strchr /EXPORT:strcmp /EXPORT:strcpy /EXPORT:strcpy_s /EXPORT:strcspn /EXPORT:strlen /EXPORT:strncat /EXPORT:strncat_s /EXPORT:strncmp /EXPORT:strncpy /EXPORT:strncpy_s /EXPORT:strnlen /EXPORT:strpbrk /EXPORT:strrchr /EXPORT:strspn /EXPORT:strstr /EXPORT:strtok_s /EXPORT:strtol /EXPORT:strtoul /EXPORT:tolower /EXPORT:toupper /EXPORT:towlower /EXPORT:towupper /EXPORT:wcscat /EXPORT:wcscat_s /EXPORT:wcschr /EXPORT:wcscmp /EXPORT:wcscpy /EXPORT:wcscpy_s /EXPORT:wcscspn /EXPORT:wcslen /EXPORT:wcsncat /EXPORT:wcsncat_s /EXPORT:wcsncmp /EXPORT:wcsncpy /EXPORT:wcsncpy_s /EXPORT:wcsnlen /EXPORT:wcspbrk /EXPORT:wcsspn /EXPORT:wcsstr /EXPORT:wcstol /EXPORT:wcstoul /MACHINE:AMD64 /NAME:NTDLL /NODEFAULTLIB /OUT:amd64.lib LINK.EXE /LIB /DEF /EXPORT:_CIcos /EXPORT:_CIlog /EXPORT:_CIpow /EXPORT:_CIsin /EXPORT:_CIsqrt /EXPORT:_alldiv /EXPORT:_alldvrm /EXPORT:_allmul /EXPORT:_alloca_probe /EXPORT:_alloca_probe_8 /EXPORT:_alloca_probe_16 /EXPORT:_allrem /EXPORT:_allshl /EXPORT:_allshr /EXPORT:_aulldiv /EXPORT:_aulldvrm /EXPORT:_aullrem /EXPORT:_aullshr /EXPORT:_chkstk /EXPORT:_fltused /EXPORT:_ftol /EXPORT:atoi /EXPORT:atol /EXPORT:isalnum /EXPORT:isalpha /EXPORT:iscntrl /EXPORT:isdigit /EXPORT:isgraph /EXPORT:islower /EXPORT:isprint /EXPORT:ispunct /EXPORT:isspace /EXPORT:isupper /EXPORT:isxdigit /EXPORT:iswalpha /EXPORT:iswctype /EXPORT:iswdigit /EXPORT:iswlower /EXPORT:iswspace /EXPORT:iswxdigit /EXPORT:memchr /EXPORT:memcmp /EXPORT:memcpy /EXPORT:memmove /EXPORT:memset /EXPORT:qsort /EXPORT:strcat /EXPORT:strcat_s /EXPORT:strchr /EXPORT:strcmp /EXPORT:strcpy /EXPORT:strcpy_s /EXPORT:strcspn /EXPORT:strlen /EXPORT:strncat /EXPORT:strncat_s /EXPORT:strncmp /EXPORT:strncpy /EXPORT:strncpy_s /EXPORT:strnlen /EXPORT:strpbrk /EXPORT:strrchr /EXPORT:strspn /EXPORT:strstr /EXPORT:strtok_s /EXPORT:strtol /EXPORT:strtoul /EXPORT:tolower /EXPORT:toupper /EXPORT:towlower /EXPORT:towupper /EXPORT:wcscat /EXPORT:wcscat_s /EXPORT:wcschr /EXPORT:wcscmp /EXPORT:wcscpy /EXPORT:wcscpy_s /EXPORT:wcscspn /EXPORT:wcslen /EXPORT:wcsncat /EXPORT:wcsncat_s /EXPORT:wcsncmp /EXPORT:wcsncpy /EXPORT:wcsncpy_s /EXPORT:wcsnlen /EXPORT:wcspbrk /EXPORT:wcsspn /EXPORT:wcsstr /EXPORT:wcstol /EXPORT:wcstoul /MACHINE:I386 /NAME:NTDLL /NODEFAULTLIB /OUT:i386.libLIB Reference
Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Creating library amd64.lib and object amd64.exp Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. Creating library i386.lib and object i386.expNote: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
__chkstk Routine__chkstk
            routine states:
        Called by the compiler when you have more than one page of local variables in your function.OUCH¹: contrary to the first highlighted statement, the correct number for AMD64 alias x64 processors is but 4k too; 8kRemarks
__chkstk Routine is a helper routine for the C compiler. For x86 compilers, __chkstk Routine is called when the local variables exceed 4K bytes; for x64 compilers it is 8K.
This function is not defined in an SDK header and must be declared by the caller. This function is exported from kernelbase.dll.
OUCH²: contrary to the second highlighted statement, which is complete and dangerous nonsense, this routine uses a non-standard calling convention – it must not be declared and can not be called from C or C++ sources by its name!
The MSDN article x64 Prolog and Epilog specifies:
            The __chkstk helper will not modify any registers other
            than R10, R11, and the condition codes. In particular, it will
            return RAX unchanged and leave all nonvolatile registers and
            argument-passing registers unmodified.
        
            Note: the Visual C compiler calls
            it through the
            _alloca()
            intrinsic
            function, using register RAX for its argument.
            Overview of x64 Calling Conventions
            Register Usage
            Types and Storage
            Scalar Types
            Aggregates and Unions
            Examples of Structure Alignment
            Bitfields
            Calling Convention
            Parameter Passing
            Varargs
            Unprototyped Functions
            
            Caller/Callee Saved Registers
            Function Pointers
            Legacy Floating-Point Support
            FPSCR
            MXCSR
            Stack Usage
         Start the command prompt of the Visual C
            development environment for the AMD64 platform, then
            execute the following 3 command lines to locate the object file
            chkstk.obj and display its disassembly:
        
FOR %? IN (chkstk.obj) DO SET chkstk=%~$LIB:? DIR "%chkstk%" LINK.EXE /DUMP /DISASM "%chkstk%"
SET chkstk=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64\chkstk.obj
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64
02/18/2011  03:08 PM             1,922 chkstk.obj
               1 File(s)          1,922 bytes
               0 Dir(s)    9,876,543,210 bytes free
Microsoft (R) COFF/PE Dumper Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
Dump of file C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\amd64\chkstk.obj
File Type: COFF OBJECT
$$000000:
  0000000000000000: CC                 int         3
  0000000000000001: CC                 int         3
  0000000000000002: CC                 int         3
  0000000000000003: CC                 int         3
  0000000000000004: CC                 int         3
  0000000000000005: CC                 int         3
  0000000000000006: 66 66 0F 1F 84 00  nop         word ptr [rax+rax+00000000h]
                    00 00 00 00
__chkstk:
  0000000000000010: 48 83 EC 10        sub         rsp,10h
  0000000000000014: 4C 89 14 24        mov         qword ptr [rsp],r10
  0000000000000018: 4C 89 5C 24 08     mov         qword ptr [rsp+8],r11
  000000000000001D: 4D 33 DB           xor         r11,r11
  0000000000000020: 4C 8D 54 24 18     lea         r10,[rsp+18h]
  0000000000000020: 4C 8D 54 24 08     lea         r10,[rsp+8]
  0000000000000025: 4C 2B D0           sub         r10,rax
  0000000000000028: 4D 0F 42 D3        cmovb       r10,r11
  000000000000002C: 65 4C 8B 1C 25 10  mov         r11,qword ptr gs:[00000010h]
                    00 00 00
  000000000000002C: 65 4D 8B 52 10     mov         r11,qword ptr gs:[r11+10h]
  0000000000000035: 4D 3B D3           cmp         r10,r11
  0000000000000038: 73 16              jae         cs20
  000000000000003A: 66 41 81 E2 00 F0  and         r10w,0F000h
cs10:
  0000000000000040: 4D 8D 9B 00 F0 FF  lea         r11,[r11+FFFFF000h]
                    FF
  0000000000000047: 41 C6 03 00        mov         byte ptr [r11],0
  0000000000000047: 4D 85 1B           test        qword ptr [r11],r11
  000000000000004B: 4D 3B D3           cmp         r10,r11
  000000000000004E: 75 F0              jne         cs10
  000000000000004E: 72 F0              jnae        cs10
cs20:
  0000000000000050: 4C 8B 14 24        mov         r10,qword ptr [rsp]
  0000000000000054: 4C 8B 5C 24 08     mov         r11,qword ptr [rsp+8]
  0000000000000059: 48 83 C4 10        add         rsp,10h
  000000000000005D: C3                 ret
  Summary
           0 .data
         3A8 .debug$S
          70 .debug$T
           C .pdata
          5E .text
           8 .xdata
            19 (plus 7) instructions in 78 (plus 18) bytes.
         OUCH¹: the __chkstk routine saves
            and restores the volatile registers
            R10 and R11 without necessity, and
            very clumsy too; instead to use 2
            PUSH
            plus 2 POP
            instructions with just 8 bytes, it increments respectively
            decrements the stack pointer with SUB
            and ADD instructions and writes
            respectively reads the stack with 4
            MOV instructions, wasting 26 bytes!
        
 OUCH²: replacing the deleted
            JNE instruction at
            address 0x38 with the inserted
            JNAE alias
            JB instruction makes
            the deleted AND
            instruction at address 0x3A superfluous and saves
            6 bytes!
        
 OUCH³: replacing the deleted
            MOV instruction at address 0x47 with the
            inserted TEST
            instruction avoids a superfluous memory write and
            saves 1 byte!
        
Note: 7 of the 19 instructions and 37 of the total 78 code bytes are completely superfluous, they only waste memory, processor cycles – and every user’s time!
 Oops: replacing the deleted
            MOV instruction at address 0x2C with the
            inserted one saves 4 more bytes!
        
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; * The software is provided "as is" without any warranty, neither express
;   nor implied.
; * In no event will the author be held liable for any damage(s) arising
;   from the use of the software.
; * Redistribution of the software is allowed only in unmodified form.
; * Permission is granted to use the software solely for personal private
;   and non-commercial purposes.
; * An individuals use of the software in his or her capacity or function
;   as an agent, (independent) contractor, employee, member or officer of
;   a business, corporation or organization (commercial or non-commercial)
;   does not qualify as personal private and non-commercial purpose.
; * Without written approval from the author the software must not be used
;   for a business, for commercial, corporate, governmental, military or
;   organizational purposes of any kind, or in a commercial, corporate,
;   governmental, military or organizational environment of any kind.
_nt_tib	struct	8		; thread information block
chain	qword	?		; address of first exception registration record
base	qword	?		; stack base
limit	qword	?		; stack limit
	qword	?		; address of subsystem thread information block
fiber	qword	?		; fiber data
pointer	qword	?		; arbitrary user pointer
self	qword	?		; address of _nt_tib
_nt_tib	ends
	.code
; MSC internal intrinsic _alloca() alias __chkstk():
; receives argument in rax
; NOTE: _alloca() must preserve rax and all argument registers;
;       it can raise 'stack overflow' exception!
;;	alias	<_alloca> = <__chkstk>
__chkstk proc	public		; qword __chkstk(qword size)
	mov	r10, gs:[_nt_tib.limit]
				; r10 = (current) stack limit
	lea	r11, [rsp+8]	; r11 = stack pointer of caller
	sub	r11, rax	; r11 = new stack pointer
	jnb	short limit
overflow:
	xor	r11, r11	; r11 = 0
probe:
	sub	r10, 4096	; r10 = address of guard page
	test	r10, [r10]	; r10 = new stack limit via 'guard page' exception
limit:
	cmp	r10, r11
	ja	short probe	; stack limit > new stack pointer?
	ret
__chkstk endp
	endchkstk.asm in an arbitrary, preferable empty directory,
            then execute the following 3 command lines to generate the object
            file chkstk.obj and put it into the new object library
            amd64.lib:
        SET ML=/c /W3 /X ML64.EXE chkstk.asm LINK.EXE /LIB /OUT:amd64.lib chkstk.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: chkstk.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_alloca Routine_alloca
            compiler helper routine states:
        Allocates memory on the stack. […]OUCH: the[…]void *_alloca( size_t size );The
_allocaroutine returns avoidpointer to the allocated space, which is suitably aligned for storage of any type of object. Ifsizeis 0,_allocaallocates a zero-length item and returns a valid pointer to that item.A stack overflow exception is generated if the space can't be allocated. […]
_alloca_probe alias
            _chkstk
            routine returns but an unaligned pointer; only the
            _alloca_probe_8 and _alloca_probe_16
            routines return an aligned pointer, the first suitable to store
            MMX™
            variables, the second suitable to store
            SSE variables!
         CAVEAT: for constant arguments less than 64,
            the Visual C compiler generates calls to the
            _alloca_probe routine!
        
 Start the command prompt of the Visual C
            development environment for the i386 platform, then
            execute the following 4 command lines to locate the assembler source
            file alloca16.asm and display its content:
        
FOR %? IN (msvcrt.lib) DO SET msvcrt=%~$LIB:? SET source=%msvcrt:\lib\msvcrt.lib=\crt\src% DIR "%source%\intel\alloca16.asm" TYPE "%source%\intel\alloca16.asm"
SET msvcrt=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\msvcrt.lib
SET source=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             2,241 alloca16.asm
               1 File(s)          2,241 bytes
               0 Dir(s)    9,876,543,210 bytes free
        page    ,132
        title   alloca16 - aligned C stack checking routine
;***
;chkstk.asm - aligned C stack checking routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       Provides 16 and 8 bit aligned alloca routines.
;
;*******************************************************************************
.xlist
        include cruntime.inc
.list
extern  _chkstk:near
; size of a page of memory
        CODESEG
page
;***
; _alloca_probe_16, _alloca_probe_8 - align allocation to 16/8 byte boundary
;
;Purpose:
;       Adjust allocation size so the ESP returned from chkstk will be aligned
;       to 16/8 bit boundary. Call chkstk to do the real allocation.
;
;Entry:
;       EAX = size of local frame
;
;Exit:
;       Adjusted EAX.
;
;Uses:
;       EAX
;
;*******************************************************************************
public  _alloca_probe_8
_alloca_probe_16 proc                   ; 16 byte aligned alloca
        push    ecx
        lea     ecx, [esp] + 8          ; TOS before entering this function
        sub     ecx, eax                ; New TOS
        and     ecx, (16 - 1)           ; Distance from 16 bit align (align down)
        add     eax, ecx                ; Increase allocation size
        sbb     ecx, ecx                ; ecx = 0xFFFFFFFF if size wrapped around
        or      eax, ecx                ; cap allocation size on wraparound
        pop     ecx                     ; Restore ecx
        jmp     _chkstk
alloca_8:                               ; 8 byte aligned alloca
_alloca_probe_8 = alloca_8
        push    ecx
        lea     ecx, [esp] + 8          ; TOS before entering this function
        sub     ecx, eax                ; New TOS
        and     ecx, (8 - 1)            ; Distance from 8 bit align (align down)
        add     eax, ecx                ; Increase allocation Size
        sbb     ecx, ecx                ; ecx = 0xFFFFFFFF if size wrapped around
        or      eax, ecx                ; cap allocation size on wraparound
        pop     ecx                     ; Restore ecx
        jmp     _chkstk
_alloca_probe_16 endp
        end
            18 instructions in 44 bytes (plus 4 bytes for alignment).
         Oops: since both routines are contained in one
            (linkable) function, they occupy 48 bytes instead of 32 bytes;
            together with the referenced
            _chkstk
            routine they occupy 96 bytes.
        
_chkstk Routine_chkstk
            compiler helper routine states:
        _chkstk Routine is a helper routine for the C compiler. For x86 compilers, _chkstk Routine is called when the local variables exceed 4K bytes; for x64 compilers it is 8K.OUCH: contrary to the highlighted statement, the correct number for x64 alias AMD64 processors is but 4096 too; 8192
 The documentation for the
            /Gs
            compiler option states:
        
A stack probe is a sequence of code that the compiler inserts at the beginning of a function call. When initiated, a stack probe reaches benignly into memory by the amount of space required to store the function's local variables. This probe causes the operating system to transparently page in more stack memory if necessary, before the rest of the function runs.Execute the following 2 command lines to display the content of the assembler source fileBy default, the compiler generates code that initiates a stack probe when a function requires more than one page of stack space. This default is equivalent to a compiler option of
/Gs4096for x86, x64, ARM, and ARM64 platforms. This value allows an application and the Windows memory manager to increase the amount of memory committed to the program stack dynamically at run time.
chkstk.asm shipped with the
            Visual C compiler:
        DIR "%source%\intel\chkstk.asm" TYPE "%source%\intel\chkstk.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             3,465 chkstk.asm
               1 File(s)          3,465 bytes
               0 Dir(s)    9,876,543,210 bytes free
        page    ,132
        title   chkstk - C stack checking routine
;***
;chkstk.asm - C stack checking routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       Provides support for automatic stack checking in C procedures
;       when stack checking is enabled.
;
;*******************************************************************************
.xlist
        include cruntime.inc
.list
; size of a page of memory
_PAGESIZE_      equ     1000h
        CODESEG
page
;***
;_chkstk - check stack upon procedure entry
;
;Purpose:
;       Provide stack checking on procedure entry. Method is to simply probe
;       each page of memory required for the stack in descending order. This
;       causes the necessary pages of memory to be allocated via the guard
;       page scheme, if possible. In the event of failure, the OS raises the
;       _XCPT_UNABLE_TO_GROW_STACK exception.
;
;       NOTE:  Currently, the (EAX < _PAGESIZE_) code path falls through
;       to the "lastpage" label of the (EAX >= _PAGESIZE_) code path.  This
;       is small; a minor speed optimization would be to special case
;       this up top.  This would avoid the painful save/restore of
;       ecx and would shorten the code path by 4-6 instructions.
;
;Entry:
;       EAX = size of local frame
;
;Exit:
;       ESP = new stackframe, if successful
;
;Uses:
;       EAX
;
;Exceptions:
;       _XCPT_GUARD_PAGE_VIOLATION - May be raised on a page probe. NEVER TRAP
;                                    THIS!!!! It is used by the OS to grow the
;                                    stack on demand.
;       _XCPT_UNABLE_TO_GROW_STACK - The stack cannot be grown. More precisely,
;                                    the attempt by the OS memory manager to
;                                    allocate another guard page in response
;                                    to a _XCPT_GUARD_PAGE_VIOLATION has
;                                    failed.
;
;*******************************************************************************
public  _alloca_probe
_chkstk proc
_alloca_probe    =  _chkstk
        push    ecx
; Calculate new TOS.
        lea     ecx, [esp] + 8 - 4      ; TOS before entering function + size for ret value
        sub     ecx, eax                ; new TOS
; Handle allocation size that results in wraparound.
; Wraparound will result in StackOverflow exception.
        cmc
        sbb     eax, eax                ; 0 if CF==0, ~0 if CF==1
        not     eax                     ; ~0 if TOS did not wrapped around, 0 otherwise
        and     ecx, eax                ; set to 0 if wraparound
        mov     eax, esp                ; current TOS
        and     eax, not ( _PAGESIZE_ - 1) ; Round down to current page boundary
cs10:
        cmp     ecx, eax                ; Is new TOS
        jb      short cs20              ; in probed page?
        mov     eax, ecx                ; yes.
        pop     ecx
        xchg    esp, eax                ; update esp
        mov     eax, dword ptr [eax]    ; get return address
        mov     dword ptr [esp], eax    ; and put it at new TOS
        push    [eax]
        ret
; Find next lower page and probe
cs20:
        sub     eax, _PAGESIZE_         ; decrease by PAGESIZE
        test    dword ptr [eax],eax     ; probe page.
        jmp     short cs10
_chkstk endp
        end
            19 instructions in 43 bytes (plus 5 bytes for alignment).
         Oops¹: every programmer should but really know
            that two’s-complement binary arithmetic exhibits the identity
            −value = not (value − 1)!
        
 Oops²: instead of the deleted
            NOT instruction
            the CMC instruction
            inserted before the
            SBB instruction
            should be used, saving 1 byte.
        
 OOPS: on Pentium® and
            later processors, instead of the 2 deleted
            MOV instructions the single
            inserted
            PUSH
            instruction should be used, saving 4 bytes; the term
            - 4 of the initial
            LEA instruction must
            then be removed to account for the additional 4 bytes pushed onto
            the stack!
        
OUCH: if the new TOS is within an already allocated stack page, this stupid implementation but performs superfluous page probes, i.e. superfluous slow memory accesses, loading stale data into the cache hierarchy in the best case and triggering page faults transferring stale data into memory in the worst case!
FOR %? IN (chkstk.obj) DO SET chkstk=%~$LIB:? DIR "%chkstk%" LINK.EXE /DUMP /DISASM "%chkstk%"
SET chkstk=C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\chkstk.obj
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib
02/18/2011  03:52 PM             1,377 chkstk.obj
               1 File(s)          1,377 bytes
               0 Dir(s)    9,876,543,210 bytes free
Microsoft (R) COFF/PE Dumper Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
Dump of file C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\lib\chkstk.obj
File Type: COFF OBJECT
__chkstk:
  00000000: 51                 push        ecx
  00000001: 8D 4C 24 04        lea         ecx,[esp+4]
  00000005: 2B C8              sub         ecx,eax
  00000007: 1B C0              sbb         eax,eax
  00000009: F7 D0              not         eax
  0000000B: 23 C8              and         ecx,eax
  0000000D: 8B C4              mov         eax,esp
  0000000F: 25 00 F0 FF FF     and         eax,0FFFFF000h
cs10:
  00000014: 3B C8              cmp         ecx,eax
  00000016: 72 0A              jb          cs20
  00000018: 8B C1              mov         eax,ecx
  0000001A: 59                 pop         ecx
  0000001B: 94                 xchg        eax,esp
  0000001C: 8B 00              mov         eax,dword ptr [eax]
  0000001E: 89 04 24           mov         dword ptr [esp],eax
  00000021: C3                 ret
cs20:
  00000022: 2D 00 10 00 00     sub         eax,1000h
  00000027: 85 00              test        dword ptr [eax],eax
  00000029: EB E9              jmp         cs10
  Summary
           0 .data
         2EC .debug$S
          24 .debug$T
          2B .text
        SHL instruction; it uses 17
            instructions in 40 bytes (plus 8 bytes for alignment) when the text
            macro ALLOCA is undefined, and 18 instructions in 43
            bytes (plus 5 bytes for alignment) when the text macro
            ALLOCA is defined as 8 or 16:
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
_nt_tib	struct	4		; thread information block
chain	dword	?		; address of first exception registration record
base	dword	?		; stack base
limit	dword	?		; stack limit
	dword	?		; address of subsystem thread information block
fiber	dword	?		; fiber data
pointer	dword	?		; arbitrary user pointer
self	dword	?		; address of _nt_tib
_nt_tib	ends
	.code
; MSC internal intrinsic _alloca() alias _chkstk():
; receives argument in eax, returns result in esp
; NOTE: _alloca() must preserve all registers except eax;
;       it can raise 'stack overflow' exception!
ifndef ALLOCA
_chkstk	proc	public		; void _chkstk(dword size)
_alloca_probe proc	public	; void _alloca_probe(dword size)
elseifidn ALLOCA, %8
_alloca_probe_8 proc	public	; void _alloca_probe_8(dword size)
elseifidn ALLOCA, %16
_alloca_probe_16 proc	public	; void _alloca_probe_16(dword size)
endif
	push	ebx		; decrement stack pointer, save register
	lea	ebx, [esp+8]	; ebx = stack pointer of caller
	sub	ebx, eax	; ebx = new (unaligned) stack pointer
	cmc			; CF = ~(ebx < 0)
	sbb	eax, eax	; eax = (ebx < 0) ? 0 : -1
ifndef ALLOCA
elseifidn ALLOCA, %8
	shl	eax, 3		; eax = (ebx < 0) ? 0 : -8
elseifidn ALLOCA, %16
	shl	eax, 4		; eax = (ebx < 0) ? 0 : -16
endif
	and	eax, ebx	; eax = (ebx < 0) ? 0 : new (aligned) stack pointer
	assume	fs :flat
	mov	ebx, fs:[_nt_tib.limit]
				; ebx = (current) stack limit
	cmp	ebx, eax
	jna	short ready	; stack limit <= new stack pointer?
probe:
	sub	ebx, 4096	; ebx = address of guard page
	test	ebx, [ebx]	; ebx = new stack limit via 'guard page' exception
	cmp	ebx, eax
	ja	short probe	; new stack limit > new stack pointer?
ready:
	pop	ebx		; restore register
	xchg	eax, esp	; esp = new stack pointer,
				; eax = old stack pointer
				;     = address of return address
	push	[eax]		; decrement stack pointer, write return address
	ret
ifndef ALLOCA
_alloca_probe endp
_chkstk	endp
elseifidn ALLOCA, %8
_alloca_probe_8 endp
elseifidn ALLOCA, %16
_alloca_probe_16 endp
else
	echo	ALLOCA must be 8 or 16 when defined!
	.err	ALLOCA
endif
	endalloca.asm in an arbitrary, preferable empty directory,
            then execute the following 5 command lines to generate the 3 object
            files alloca.obj, alloca8.obj plus
            alloca16.obj and put them into the new object library
            i386.lib:
        SET ML=/c /safeseh /W3 /X ML.EXE alloca.asm ML.EXE /DALLOCA=8 /Foalloca8.obj alloca.asm ML.EXE /DALLOCA=16 /Foalloca16.obj alloca.asm LINK.EXE /LIB /OUT:i386.lib alloca.obj alloca8.obj alloca16.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alloca.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alloca.asm Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alloca.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_allmul Routine_allmul
            compiler helper routine states:
        Multiplies two LONGLONG or ULONGLONG integers. For example, to multiply two int64 values the compiler might generate a call to the _allmul routine.OUCH: contrary to the highlighted statement, the Visual C compiler generates calls to theRemarks
The _allmul routine is a helper routine for the C compiler. Whether the compiler uses _allmul is completely dependent on the optimization set.
This routine is used only on x86 platforms.
_allmul
            routine unconditionally, independent from any optimisation, when it
            encounters a multiplication where at least one of its operands is a
            (signed or unsigned) 64-bit integer!
         Execute the following 2 command lines to display the content of the
            assembler source file llmul.asm shipped with the
            Visual C compiler:
        
DIR "%source%\intel\llmul.asm" TYPE "%source%\intel\llmul.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             2,570 llmul.asm
               1 File(s)          2,570 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   llmul - long multiply routine
;***
;llmul.asm - long multiply routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       Defines long multiply routine
;       Both signed and unsigned routines are the same, since multiply's
;       work out the same in 2's complement
;       creates the following routine:
;           __allmul
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;llmul - long multiply routine
;
;Purpose:
;       Does a long multiply (same for signed/unsigned)
;       Parameters are not changed.
;
;Entry:
;       Parameters are passed on the stack:
;               1st pushed: multiplier (QWORD)
;               2nd pushed: multiplicand (QWORD)
;
;Exit:
;       EDX:EAX - product of multiplier and multiplicand
;       NOTE: parameters are removed from the stack
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_allmul PROC NEAR
.FPO (0, 4, 0, 0, 0, 0)
A       EQU     [esp + 4]       ; stack address of a
B       EQU     [esp + 12]      ; stack address of b
;
;       AHI, BHI : upper 32 bits of A and B
;       ALO, BLO : lower 32 bits of A and B
;
;             ALO * BLO
;       ALO * BHI
; +     BLO * AHI
; ---------------------
;
        mov     eax,HIWORD(A)
        mov     ecx,HIWORD(B)
        or      ecx,eax         ;test for both hiwords zero.
        mov     ecx,LOWORD(B)
        jnz     short hard      ;both are zero, just mult ALO and BLO
        mov     eax,LOWORD(A)
        mul     ecx
        ret     16              ; callee restores the stack
hard:
        push    ebx
.FPO (1, 4, 0, 0, 0, 0)
; must redefine A and B since esp has been altered
A2      EQU     [esp + 8]       ; stack address of a
B2      EQU     [esp + 16]      ; stack address of b
        mul     ecx             ;eax has AHI, ecx has BLO, so AHI * BLO
        mov     ebx,eax         ;save result
        mov     eax,LOWORD(A2)
        mul     dword ptr HIWORD(B2) ;ALO * BHI
        add     ebx,eax         ;ebx = ((ALO * BHI) + (AHI * BLO))
        mov     eax,LOWORD(A2)  ;ecx = BLO
        mul     ecx             ;so edx:eax = ALO*BLO
        add     edx,ebx         ;now edx has all the LO*HI stuff
        pop     ebx
        ret     16              ; callee restores the stack
_allmul ENDP
        end
            19 instructions in 52 bytes (plus 12 bytes for alignment).
         Ouch¹: since only the low parts of the
            products of the low and high parts of the arguments are needed, the
            2 highlighted widening
            MUL instructions should be
            replaced with 2 faster
            IMUL instructions!
        
 Ouch²: on processors featuring speculative
            execution, i.e. Pentium®Pro (introduced
            November 1, 1995)
            and newer, which execute 2 IMUL
            or MUL instructions faster
            than a mispredicted conditional branch, the test whether the high
            parts of both arguments are 0 is superfluous and
            impairs performance!
        
SPACE is undefined,
            else 12 instructions in 37 bytes (plus 11 bytes for alignment), but
            no non-volatile register:
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _allmul():
; receives arguments on stack, returns product modulo 2**64 in edx:eax
_allmul	proc	public		; sqword _allmul(sqword multiplicand, sqword multiplier)
	mov	ecx, [esp+16]	; ecx = high dword of multiplier
	mov	edx, [esp+8]	; edx = high dword of multiplicand
	mov	eax, [esp+4]	; eax = low dword of multiplicand
	or	ecx, edx
ifdef SPACE
	jz	short zero	; high dwords are 0?
else ; SPACE
	jnz	short notzero	; high dwords are not 0?
	mul	dword ptr [esp+12]
				; edx:eax = low dword of multiplicand
				;         * low dword of multiplier
	ret	16		; callee restores stack
notzero:
endif ; SPACE
	imul	edx, [esp+12]	; edx = high dword of multiplicand
				;     * low dword of multiplier
	mov	ecx, [esp+16]	; ecx = high dword of multiplier
	imul	ecx, eax	; ecx = high dword of multiplier
				;     * low dword of multiplicand
	add	ecx, edx	; ecx = high dword of multiplier
				;     * low dword of multiplicand
				;     + high dword of multiplicand
				;     * low dword of multiplier
zero:
	mul	dword ptr [esp+12]
				; edx:eax = low dword of multiplicand
				;         * low dword of multiplier
	add	edx, ecx	; edx:eax = product % 2**64
	ret	16		; callee restores stack
_allmul	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _allmul():
; receives arguments on stack, returns product modulo 2**64 in edx:eax
_allmul	proc	public		; sqword _allmul(sqword multiplicand, sqword multiplier)
	mov	eax, [esp+4]	; eax = low dword of multiplicand
	mov	edx, [esp+8]	; edx = high dword of multiplicand
	imul	edx, [esp+12]	; edx = high dword of multiplicand
				;     * low dword of multiplier
	mov	ecx, [esp+16]	; ecx = high dword of multiplier
	imul	ecx, eax	; ecx = high dword of multiplier
				;     * low dword of multiplicand
	add	ecx, edx	; ecx = high dword of multiplier
				;     * low dword of multiplicand
				;     + high dword of multiplicand
				;     * low dword of multiplier
	mul	dword ptr [esp+12]
				; edx:eax = low dword of multiplicand
				;         * low dword of multiplier
	add	edx, ecx	; edx:eax = product % 2**64
	ret	16		; callee restores stack
_allmul	endp
	endallmul.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            allmul.obj and add it to the existing object library
            i386.lib:
        ML.EXE allmul.asm LINK.EXE /LIB /OUT:i386.lib i386.lib allmul.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: allmul.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_alldiv Routine_alldiv
            compiler helper routine states:
        Divides two LONGLONG integers. For example, to divide two int64 values the compiler might generate a call to _alldiv Routine.OUCH: contrary to the highlighted statement, the Visual C compiler generates calls to theRemarks
_alldiv Routine is a helper routine for the C compiler. Whether the compiler calls _alldiv Routine is completely dependent on the optimization set.
_alldiv
            routine unconditionally, independent from any optimisation, when it
            encounters a division where at least one of its operands is a signed
            64-bit integer!
         Execute the following 2 command lines to display the content of the
            assembler source file lldiv.asm shipped with the
            Visual C compiler:
        
DIR "%source%\intel\lldiv.asm" TYPE "%source%\intel\lldiv.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             6,670 lldiv.asm
               1 File(s)          6,670 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   lldiv - signed long divide routine
;***
;lldiv.asm - signed long divide routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the signed long divide routine
;           __alldiv
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;lldiv - signed long divide
;
;Purpose:
;       Does a signed long divide of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the quotient (dividend/divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_alldiv PROC NEAR
.FPO (3, 4, 0, 0, 0, 0)
        push    edi
        push    esi
        push    ebx
; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to lldiv(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;               |      EDI      |
;               |---------------|
;               |      ESI      |
;               |---------------|
;       ESP---->|      EBX      |
;               -----------------
;
DVND    equ     [esp + 16]      ; stack address of dividend (a)
DVSR    equ     [esp + 24]      ; stack address of divisor (b)
DVND    equ     [esp + 12]
DVSR    equ     [esp + 20]
; Determine sign of the result (edi = 0 if result is positive, non-zero
; otherwise) and make operands positive.
        xor     edi,edi         ; result sign assumed positive
        mov     eax,HIWORD(DVND) ; hi word of a
        or      eax,eax         ; test to see if signed
        test    eax,eax
        jge     short L1        ; skip rest if a is already positive
        inc     edi             ; complement result sign flag
        mov     edx,LOWORD(DVND) ; lo word of a
        neg     eax             ; make a positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVND),eax ; save positive value
        mov     LOWORD(DVND),edx
L1:
        mov     eax,HIWORD(DVSR) ; hi word of b
        or      eax,eax         ; test to see if signed
        test    eax,eax
        jge     short L2        ; skip rest if b is already positive
        inc     edi             ; complement the result sign flag
        mov     edx,LOWORD(DVSR) ; lo word of a
        neg     eax             ; make b positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVSR),eax ; save positive value
        mov     LOWORD(DVSR),edx
L2:
;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
; NOTE - eax currently contains the high order word of DVSR
;
        or      eax,eax         ; check to see if divisor < 4194304K
        test    eax,eax
        jnz     short L3        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; eax <- high order bits of quotient
        mov     ebx,eax         ; save high bits of quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; eax <- low order bits of quotient
        mov     edx,ebx         ; edx:eax <- quotient
        jmp     short L4        ; set sign, restore stack and return
;
; Here we do it the hard way.  Remember, eax contains the high word of DVSR
;
L3:
        mov     ebx,eax         ; ebx:ecx <- divisor
        mov     ecx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L5:
        shr     ebx,1           ; shift divisor right one bit
        rcr     ecx,1
        shr     edx,1           ; shift dividend right one bit
        rcr     eax,1
        or      ebx,ebx
        test    ebx,ebx
        jnz     short L5        ; loop until divisor < 4194304K
        div     ecx             ; now divide, ignore remainder
        mov     esi,eax         ; save quotient
        mov     ebx,eax
;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;
        mul     dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
        mov     ecx,eax
        mov     ecx,HIWORD(DVSR)
        imul    ecx,ebx
        mov     eax,LOWORD(DVSR)
        mul     esi             ; QUOT * LOWORD(DVSR)
        mul     ebx
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L6        ; carry means Quotient is off by 1
;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;
        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L6        ; if result > original, do subtract
        jb      short L7        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        sbb     edx,HIWORD(DVND)
        jbe     short L7        ; if less or equal we are ok, else subtract
L6:
        dec     esi             ; subtract 1 from quotient
        dec     ebx
L7:
        xor     edx,edx         ; edx:eax <- quotient
        mov     eax,esi
        mov     eax,ebx
;
; Just the cleanup left to do.  edx:eax contains the quotient.  Set the sign
; according to the save value, cleanup the stack, and return.
;
L4:
        dec     edi             ; check to see if result is negative
        jnz     short L8        ; if EDI == 0, result should be negative
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0
;
; Restore the saved registers and return.
;
L8:
        pop     ebx
        pop     esi
        pop     edi
        ret     16
_alldiv ENDP
end
            With 70 instructions in 170 bytes (plus 6 bytes for alignment), this
            routine has several major and minor flaws: 3 major
            flaws on all kinds of processors, and 4 more only on processors
            which feature speculative execution!
         Note: unlike the
            IDIV instruction, which raises a
            divide error (#DE) exception when dividing
            −263, the smallest signed 64-bit integer, by
            −1, this routine returns but the (wrong) quotient
            −263!
        
 OOPS¹: instead of the 4 deleted
            OR instructions which
            perform superfluous writes, the 4 inserted
            TEST instructions should be
            used.
        
 OOPS²: instead of the deleted
            first widening
 MUL
            instruction and the following deleted
            MOV instruction, the inserted
            MOV instruction loading the high part of
            the divisor into register ECX followed by the
            inserted faster
            IMUL instruction should be
            used.
        
 OUCH¹: instead of register ESI
            register EBX should be used, saving a pair of
            PUSH
            and POP instructions
            and 2 bytes!
        
 OUCH²: for divisors less than 232
            and a dividend less than 232×divisor, i.e. if the
            quotient is less than 232, instead of the long
            alias schoolbook
 division performed with the 2
            highlighted chained
            DIV instructions – each
            slower than a mispredicted conditional branch – after the
            conditional branch to label L3:, a single
            DIV instruction is sufficient,
            saving about 40 to 240 processor cycles!
        
 OUCH³: instead of the highlighted
            (brain)dead slow loop with 2 pairs of
            SHR and
            RCR instructions
            after label L5:, 2 pairs of
            SHRD and
            SHR instructions with their
            shift count determined per BSR
            instruction should be used!
        
 Note: this
            BSR instruction would also
            replace the deleted
            OR instruction
            respectively the inserted
            TEST instruction after label
            L2:.
        
 OUCH⁴: on processors which feature
            speculative execution, instead of the 3 highlighted
            conditional branches to the labels L1:,
            L2: and L8:, which are
            slow when mispredicted, and the following
            NEG plus
            SBB
            instructions to negate the arguments as well as the quotient, a
            branchless and thus faster instruction sequence should be used!
        
 OUCH⁵: on processors which feature
            speculative execution, instead of the 2
            CMP instructions and the
            3 conditional branches before label L6:, which are
            slow when mispredicted, a faster instruction
            sequence with less or no conditional branches should be used!
        
Note: with the modifications shown in the source, this routine has 66 instructions in 164 bytes (plus 12 bytes for alignment).
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _alldiv():
; receives arguments on stack, returns quotient in edx:eax
; NOTE: _alldiv() can raise 'division by zero' exception; it does
;       not raise 'integer overflow' exception on quotient overflow,
;       but returns ±2**63 for -2**63 / -1!
_alldiv	proc	public		; sqword _alldiv(sqword dividend, sqword divisor)
	xor	ecx, ecx	; ecx = sign of quotient = 0
	; determine sign of dividend and compute |dividend|
	mov	edx, [esp+8]	; edx = high dword of dividend
	test	edx, edx
	jns	short @f	; (high dword of) dividend >= 0?
	mov	eax, [esp+4]	; eax = low dword of dividend
	neg	edx
	neg	eax
	sbb	edx, ecx	; edx:eax = -dividend = |dividend|
	mov	[esp+4], eax
	mov	[esp+8], edx	; write |dividend| back on stack
	dec	ecx		; ecx = sign of dividend = -1
@@:
	; determine sign of divisor and compute |divisor|
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	test	edx, edx
	jns	short @f	; (high dword of) divisor >= 0?
	neg	edx
	neg	eax
	sbb	edx, 0		; edx:eax = -divisor = |divisor|
	mov	[esp+12], eax
	mov	[esp+16], edx	; write |divisor| back on stack
	not	ecx		; ecx = sign of dividend
				;     ^ sign of divisor
				;     = sign of quotient
@@:
	push	ecx		; [esp] = (quotient < 0) ? -1 : 0
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?
	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?
	; remainder < divisor < 2**32
	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?
	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+8]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
;;	xor	edx, edx	; edx:eax = |quotient|
	jmp	short quotient
	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = |quotient|
	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = quotient
	ret	16		; callee restores stack
	; dividend < divisor: quotient = 0
trivial:
	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	cdq			; edx:eax = quotient = 0
	ret	16		; callee restores stack
	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?
	; perform "extended & adjusted" division
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?
	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"
	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+24]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	jc	short @f	; divisor * quotient" >= 2**64?
	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx	; CF = (dividend < divisor * quotient")
				;    = (remainder" < 0)
@@:
	sbb	eax, eax	; eax = (quotient < quotient") ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|
	pop	ebx
quotient:
	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient
	ret	16		; callee restores stack
	; dividend = divisor = -2**63: quotient = 1
special:
	pop	eax		; eax = sign of quotient = 0
	inc	eax		; eax = (low dword of) quotient = 1
	cdq			; edx:eax = quotient = 1
	ret	16		; callee restores stack
_alldiv	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _alldiv():
; receives arguments on stack, returns quotient in edx:eax
; NOTE: _alldiv() can raise 'division by zero' exception; it does
;       not raise 'integer overflow' exception on quotient overflow,
;       but returns ±2**63 for -2**63 / -1!
_alldiv	proc	public		; sqword _alldiv(sqword dividend, sqword divisor)
	; determine sign of dividend and compute |dividend|
	mov	eax, [esp+8]
	mov	ecx, [esp+4]	; eax:ecx = dividend
	cdq			; edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	; eax:ecx = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	; eax:ecx = (dividend < 0) ? -dividend : dividend
				;         = |dividend|
	mov	[esp+4], ecx	; write |dividend| back on stack
	mov	[esp+8], eax
	push	edx		; [esp] = (dividend < 0) ? -1 : 0
	; determine sign of divisor and compute |divisor|
	mov	edx, [esp+20]
	mov	eax, [esp+16]	; edx:eax = divisor
	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|
	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx
	xor	[esp], ecx	; [esp] = (dividend < 0) ^ (divisor < 0) ? -1 : 0
				;       = (quotient < 0) ? -1 : 0
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?
	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?
	; remainder < divisor < 2**32
	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?
	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+8]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
;;	xor	edx, edx	; edx:eax = |quotient|
	jmp	short quotient
	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = |quotient|
	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = quotient
	ret	16		; callee restores stack
	; dividend < divisor: quotient = 0
trivial:
	pop	ecx		; ecx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0
	ret	16		; callee restores stack
	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?
	; perform "extended & adjusted" division
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?
	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"
	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+24]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx	; CF = (dividend < divisor * quotient")
				;    = (remainder" < 0)
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) |quotient|
;;	xor	edx, edx	; edx:eax = |quotient|
	pop	ebx
quotient:
	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient
	ret	16		; callee restores stack
	; dividend = divisor = -2**63: quotient = 1
special:
	pop	eax		; eax = sign of quotient = 0
	inc	eax		; eax = (low dword of) quotient = 1
	xor	edx, edx	; edx:eax = quotient = 1
	ret	16		; callee restores stack
_alldiv	endp
	endalldiv.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            alldiv.obj and add it to the existing object library
            i386.lib:
        ML.EXE alldiv.asm LINK.EXE /LIB /OUT:i386.lib i386.lib alldiv.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alldiv.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_alldvrm Routinelldvrm.asm shipped with the
            Visual C compiler:
        DIR "%source%\intel\lldvrm.asm" TYPE "%source%\intel\lldvrm.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             8,557 lldvrm.asm
               1 File(s)          8,557 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   lldvrm - signed long divide and remainder routine
;***
;lldvrm.asm - signed long divide and remainder routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the signed long divide and remainder routine
;           __alldvrm
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;lldvrm - signed long divide and remainder
;
;Purpose:
;       Does a signed long divide and remainder of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the quotient (dividend/divisor)
;       EBX:ECX contains the remainder (divided % divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_alldvrm PROC NEAR
.FPO (3, 4, 0, 0, 1, 0)
        push    edi
        push    esi
        push    ebp
; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to alldvrm(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;               |      EDI      |
;               |---------------|
;               |      ESI      |
;               |---------------|
;       ESP---->|      EBP      |
;               -----------------
;
DVND    equ     [esp + 16]      ; stack address of dividend (a)
DVSR    equ     [esp + 24]      ; stack address of divisor (b)
; Determine sign of the quotient (edi = 0 if result is positive, non-zero
; otherwise) and make operands positive.
; Sign of the remainder is kept in ebp.
        xor     edi,edi         ; result sign assumed positive
        xor     ebp,ebp         ; result sign assumed positive
        mov     eax,HIWORD(DVND) ; hi word of a
        or      eax,eax         ; test to see if signed
        jge     short L1        ; skip rest if a is already positive
        inc     edi             ; complement result sign flag
        inc     ebp             ; complement result sign flag
        mov     edx,LOWORD(DVND) ; lo word of a
        neg     eax             ; make a positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVND),eax ; save positive value
        mov     LOWORD(DVND),edx
L1:
        mov     eax,HIWORD(DVSR) ; hi word of b
        or      eax,eax         ; test to see if signed
        jge     short L2        ; skip rest if b is already positive
        inc     edi             ; complement the result sign flag
        mov     edx,LOWORD(DVSR) ; lo word of a
        neg     eax             ; make b positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVSR),eax ; save positive value
        mov     LOWORD(DVSR),edx
L2:
;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
; NOTE - eax currently contains the high order word of DVSR
;
        or      eax,eax         ; check to see if divisor < 4194304K
        jnz     short L3        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; eax <- high order bits of quotient
        mov     ebx,eax         ; save high bits of quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; eax <- low order bits of quotient
        mov     esi,eax         ; ebx:esi <- quotient
;
; Now we need to do a multiply so that we can compute the remainder.
;
        mov     eax,ebx         ; set up high word of quotient
        mul     dword ptr LOWORD(DVSR) ; HIWORD(QUOT) * DVSR
        mov     ecx,eax         ; save the result in ecx
        mov     eax,esi         ; set up low word of quotient
        mul     dword ptr LOWORD(DVSR) ; LOWORD(QUOT) * DVSR
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jmp     short L4        ; complete remainder calculation
;
; Here we do it the hard way.  Remember, eax contains the high word of DVSR
;
L3:
        mov     ebx,eax         ; ebx:ecx <- divisor
        mov     ecx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L5:
        shr     ebx,1           ; shift divisor right one bit
        rcr     ecx,1
        shr     edx,1           ; shift dividend right one bit
        rcr     eax,1
        or      ebx,ebx
        jnz     short L5        ; loop until divisor < 4194304K
        div     ecx             ; now divide, ignore remainder
        mov     esi,eax         ; save quotient
;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;
        mul     dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
        mov     ecx,eax
        mov     eax,LOWORD(DVSR)
        mul     esi             ; QUOT * LOWORD(DVSR)
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L6        ; carry means Quotient is off by 1
;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;
        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L6        ; if result %gt; original, do subtract
        jb      short L7        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        jbe     short L7        ; if less or equal we are ok, else subtract
L6:
        dec     esi             ; subtract 1 from quotient
        sub     eax,LOWORD(DVSR) ; subtract divisor from result
        sbb     edx,HIWORD(DVSR)
L7:
        xor     ebx,ebx         ; ebx:esi <- quotient
L4:
;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will do the subtract in the
; opposite direction and negate the result if necessary.
;
        sub     eax,LOWORD(DVND) ; subtract dividend from result
        sbb     edx,HIWORD(DVND)
;
; Now check the result sign flag to see if the result is supposed to be positive
; or negative.  It is currently negated (because we subtracted in the 'wrong'
; direction), so if the sign flag is set we are done, otherwise we must negate
; the result to make it positive again.
;
        dec     ebp             ; check result sign flag
        jns     short L9        ; result is ok, set up the quotient
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0
;
; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx.
;
L9:
        mov     ecx,edx
        mov     edx,ebx
        mov     ebx,ecx
        mov     ecx,eax
        mov     eax,esi
;
; Just the cleanup left to do.  edx:eax contains the quotient.  Set the sign
; according to the save value, cleanup the stack, and return.
;
        dec     edi             ; check to see if result is negative
        jnz     short L8        ; if EDI == 0, result should be negative
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0
;
; Restore the saved registers and return.
;
L8:
        pop     ebp
        pop     esi
        pop     edi
        ret     16
_alldvrm ENDP
end
            91 instructions in 223 bytes (plus 1 byte for alignment).
         OUCH: the highlighted comment with the following
            code is a remarkable gem – the remainder is already present
            in register EDX!
        
 Note: unlike the
            IDIV instruction, which raises a
            divide error (#DE) exception when dividing
            −263, the smallest signed 64-bit integer, by
            −1, this routine returns but the (wrong) quotient
            −263 and the (correct) remainder 0, i.e. the only
            integer smaller in magnitude than the divisor −1!
        
JCCLESS is defined, else processors which don’t
            feature speculative execution, uses 111 instructions in 268 bytes
            (plus 4 bytes for alignment) respectively 108 instructions in 260
            bytes (plus 12 bytes for alignment), including 22 instructions in 50
            bytes for the special and trivial cases not covered by
            Microsoft’s poor implementation:
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _alldvrm():
; receives arguments on stack, returns quotient in edx:eax and remainder in ebx:ecx
; NOTE: _alldvrm() can raise 'division by zero' exception; it does
;       not raise 'integer overflow' exception on quotient overflow,
;       but returns ±2**63 for -2**63 / -1 and 0 for -2**63 % -1!
_alldvrm proc	public		; sqword _alldvrm(sqword dividend, sqword divisor)
	; determine sign of dividend and compute |dividend|
	mov	edx, [esp+8]
	mov	eax, [esp+4]	; edx:eax = dividend
	mov	ebx, edx
	sar	ebx, 31		; ebx = (dividend < 0) ? -1 : 0
				;     = (remainder < 0) ? -1 : 0
	xor	eax, ebx
	xor	edx, ebx	; edx:eax = (dividend < 0) ? ~dividend : dividend
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|
	mov	[esp+4], eax	; write |dividend| back on stack
	mov	[esp+8], edx
	; determine sign of divisor and compute |divisor|
	mov	edx, [esp+16]
	mov	eax, [esp+12]	; edx:eax = divisor
	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|
	mov	[esp+12], eax	; write |divisor| back on stack
	mov	[esp+16], edx
	xor	ecx, ebx	; ecx = (divisor < 0) ^ (dividend < 0) ? -1 : 0
				;     = (quotient < 0) ? -1 : 0
	push	ecx		; [esp] = (quotient < 0) ? -1 : 0
	push	ebx		; [esp] = (remainder < 0) ? -1 : 0
	mov	ecx, [esp+16]	; ecx = high dword of dividend
	cmp	[esp+12], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?
	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?
	; remainder < divisor < 2**32
	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+16]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?
	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	xor	ebx, ebx	; ebx = high dword of quotient = 0
	jmp	short next
	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
next:
	mov	eax, [esp+12]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	ecx, edx	; ecx = (low dword of) |remainder|
	mov	edx, ebx	; edx:eax = |quotient|
;;	xor	ebx, ebx	; ebx:ecx = |remainder|
if 0
	mov	ebx, [esp+4]	; ebx = (quotient < 0) ? -1 : 0
	xor	eax, ebx
	xor	edx, ebx
	sub	eax, ebx
	sbb	edx, ebx	; edx:eax = quotient
	pop	ebx		; ebx = (remainder < 0) ? -1 : 0
	xor	ecx, ebx
	sub	ecx, ebx
	sbb	ebx, ebx	; ebx:ecx = remainder
else
	pop	ebx		; ebx = (remainder < 0) ? -1 : 0
	xor	ecx, ebx
	sub	ecx, ebx
	sbb	ebx, ebx	; ebx:ecx = remainder
	xor	eax, [esp]
	xor	edx, [esp]
	sub	eax, [esp]
	sbb	edx, [esp]	; edx:eax = quotient
endif
	add	esp, 4
	ret	16		; callee restores stack
	; dividend < divisor: quotient = 0, remainder = dividend
trivial:
	pop	eax		; eax = (remainder < 0) ? -1 : 0
	mov	ecx, [esp+8]
	mov	ebx, [esp+12]	; ebx:ecx = |remainder| = |dividend|
	xor	ecx, eax
	xor	ebx, eax
	sub	ecx, eax
	sbb	ebx, eax	; ebx:ecx = remainder
	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0
	ret	16		; callee restores stack
	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?
	; perform "extended & adjusted" division
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?
	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"
	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	push	ebx		; [esp] = quotient"
	mov	eax, [esp+24]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+28]	; ebx = high dword of divisor * quotient"
	add	edx, ebx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+16]
	mov	ebx, [esp+20]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JCCLESS
	pop	eax		; eax = quotient"
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	; ebx:ecx = remainder" + divisor
				;         = |remainder|
	dec	eax		; eax = quotient" - 1
				;     = low dword of |quotient|
@@:
else ; JCCLESS
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	add	[esp], eax	; [esp] = quotient" - (remainder" < 0)
				;       = (low dword of) |quotient|
	and	eax, [esp+24]
	and	edx, [esp+28]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	; ebx:ecx = remainder" + divisor
				;         = |remainder|
	pop	eax		; eax = (low dword of) |quotient|
endif ; JCCLESS
;;	xor	edx, edx	; edx:eax = |quotient|
	pop	edx		; edx = (remainder < 0) ? -1 : 0
	xor	ecx, edx
	xor	ebx, edx
	sub	ecx, edx
	sbb	ebx, edx	; ebx:ecx = remainder
	pop	edx		; edx = (quotient < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = quotient
	ret	16		; callee restores stack
	; dividend = divisor = -2**63: quotient = 1, remainder = 0
special:
	pop	ebx		; ebx = sign of remainder = -1
	inc	ebx
;;	xor	ecx, ecx	; ebx:ecx = remainder = 0
	pop	eax		; eax = sign of quotient = 0
	inc	eax		; eax = (low dword of) quotient = 1
	xor	edx, edx	; edx:eax = quotient = 1
	ret	16		; callee restores stack
_alldvrm endp
	endalldvrm.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            alldvrm.obj and add it to the existing object library
            i386.lib:
        ML.EXE alldvrm.asm LINK.EXE /LIB /OUT:i386.lib i386.lib alldvrm.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: alldvrm.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_allrem Routinellrem.asm shipped with the
            Visual C compiler:
        DIR "%source%\intel\llrem.asm" TYPE "%source%\intel\llrem.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             7,067 llrem.asm
               1 File(s)          7,067 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   llrem - signed long remainder routine
;***
;llrem.asm - signed long remainder routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the signed long remainder routine
;           __allrem
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;llrem - signed long remainder
;
;Purpose:
;       Does a signed long remainder of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the remainder (dividend%divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_allrem PROC NEAR
.FPO (2, 4, 0, 0, 0, 0)
        push    ebx
        push    edi
; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a%b will
; generate a call to lrem(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;               |       EBX     |
;               |---------------|
;       ESP---->|       EDI     |
;               -----------------
;
DVND    equ     [esp + 12]      ; stack address of dividend (a)
DVSR    equ     [esp + 20]      ; stack address of divisor (b)
; Determine sign of the result (edi = 0 if result is positive, non-zero
; otherwise) and make operands positive.
        xor     edi,edi         ; result sign assumed positive
        mov     eax,HIWORD(DVND) ; hi word of a
        or      eax,eax         ; test to see if signed
        jge     short L1        ; skip rest if a is already positive
        inc     edi             ; complement result sign flag bit
        mov     edx,LOWORD(DVND) ; lo word of a
        neg     eax             ; make a positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVND),eax ; save positive value
        mov     LOWORD(DVND),edx
L1:
        mov     eax,HIWORD(DVSR) ; hi word of b
        or      eax,eax         ; test to see if signed
        jge     short L2        ; skip rest if b is already positive
        mov     edx,LOWORD(DVSR) ; lo word of b
        neg     eax             ; make b positive
        neg     edx
        sbb     eax,0
        mov     HIWORD(DVSR),eax ; save positive value
        mov     LOWORD(DVSR),edx
L2:
;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
; NOTE - eax currently contains the high order word of DVSR
;
        or      eax,eax         ; check to see if divisor < 4194304K
        jnz     short L3        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; edx <- remainder
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; edx <- final remainder
        mov     eax,edx         ; edx:eax <- remainder
        xor     edx,edx
        dec     edi             ; check result sign flag
        jns     short L4        ; negate result, restore stack and return
        jmp     short L8        ; result sign ok, restore stack and return
;
; Here we do it the hard way.  Remember, eax contains the high word of DVSR
;
L3:
        mov     ebx,eax         ; ebx:ecx <- divisor
        mov     ecx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L5:
        shr     ebx,1           ; shift divisor right one bit
        rcr     ecx,1
        shr     edx,1           ; shift dividend right one bit
        rcr     eax,1
        or      ebx,ebx
        jnz     short L5        ; loop until divisor < 4194304K
        div     ecx             ; now divide, ignore remainder
;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;
        mov     ecx,eax         ; save a copy of quotient in ECX
        mul     dword ptr HIWORD(DVSR)
        xchg    ecx,eax         ; save product, get quotient in EAX
        mul     dword ptr LOWORD(DVSR)
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L6        ; carry means Quotient is off by 1
;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract the original divisor from the result.
;
        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L6        ; if result > original, do subtract
        jb      short L7        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        jbe     short L7        ; if less or equal we are ok, else subtract
L6:
        sub     eax,LOWORD(DVSR) ; subtract divisor from result
        sbb     edx,HIWORD(DVSR)
L7:
;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will do the subtract in the
; opposite direction and negate the result if necessary.
;
        sub     eax,LOWORD(DVND) ; subtract dividend from result
        sbb     edx,HIWORD(DVND)
;
; Now check the result sign flag to see if the result is supposed to be positive
; or negative.  It is currently negated (because we subtracted in the 'wrong'
; direction), so if the sign flag is set we are done, otherwise we must negate
; the result to make it positive again.
;
        dec     edi             ; check result sign flag
        jns     short L8        ; result is ok, restore stack and return
L4:
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0
;
; Just the cleanup left to do.  edx:eax contains the quotient.
; Restore the saved registers and return.
;
L8:
        pop     edi
        pop     ebx
        ret     16
_allrem ENDP
	end
            69 instructions in 178 bytes (plus 14 bytes for alignment).
         Note: unlike the
            IDIV instruction, which raises a
            divide error (#DE) exception when dividing
            −263, the smallest signed 64-bit integer, by
            −1, this routine returns the (correct) remainder 0, i.e. the
            only integer smaller in magnitude than the divisor −1!
        
JCCLESS is defined, else processors which don’t
            feature speculative execution, uses 85 instructions in 213 bytes
            (plus 11 bytes for alignment) respectively 84 instructions in 211
            bytes (plus 13 bytes for alignment), including 12 instructions in 33
            bytes for the special and trivial cases not covered by
            Microsoft’s poor implementation:
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _allrem():
; receives arguments on stack, returns remainder in edx:eax
; NOTE: _allrem() can raise 'division by zero' exception; it does
;       not raise 'integer overflow' exception on quotient overflow,
;       but returns 0 for -2**63 % -1!
_allrem	proc	public		; sqword _allrem(sqword dividend, sqword divisor)
	; determine sign of dividend and compute |dividend|
	mov	eax, [esp+8]
	mov	ecx, [esp+4]	; eax:ecx = dividend
	cdq			; edx = (dividend < 0) ? -1 : 0
	xor	ecx, edx
	xor	eax, edx	; ecx:eax = (dividend < 0) ? ~dividend : dividend
	sub	ecx, edx
	sbb	eax, edx	; ecx:eax = (dividend < 0) ? -dividend : dividend
				;         = |dividend|
	mov	[esp+4], ecx	; write |dividend| back on stack
	mov	[esp+8], eax
	push	edx		; [esp] = (dividend < 0) ? -1 : 0
	; determine sign of divisor and compute |divisor|
	mov	edx, [esp+20]
	mov	eax, [esp+16]	; edx:eax = divisor
	mov	ecx, edx
	sar	ecx, 31		; ecx = (divisor < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx	; edx:eax = (divisor < 0) ? ~divisor : divisor
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = (divisor < 0) ? -divisor : divisor
				;         = |divisor|
	mov	[esp+16], eax	; write |divisor| back on stack
	mov	[esp+20], edx
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?
	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?
	; remainder < divisor < 2**32
	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+12]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?
	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	jmp	short next
	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
next:
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) |remainder|
;;	xor	edx, edx	; edx:eax = |remainder|
	pop	edx		; edx = (remainder < 0) ? -1 : 0
	xor	eax, edx
	sub	eax, edx
	sbb	edx, edx	; edx:eax = remainder
	ret	16		; callee restores stack
	; dividend < divisor: remainder = dividend
trivial:
	mov	eax, [esp+8]
	mov	edx, [esp+12]	; edx:eax = |remainder| = |dividend|
	jmp	short remainder
	; 2**63 >= dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor = 2**63?
	; perform "extended & adjusted" division
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?
	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"
	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+16]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+16]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+20]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+24]	; ebx = high dword of divisor * quotient"
	add	edx, ebx	; edx:eax = divisor * quotient"
	mov	ecx, [esp+12]
	mov	ebx, [esp+16]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JCCLESS
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+20]
	adc	ebx, [esp+24]	; ebx:ecx = remainder" + divisor
				;         = remainder
@@:
	mov	eax, ecx
	mov	edx, ebx	; edx:eax = |remainder|
else ; JCCLESS
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+20]
	and	edx, [esp+24]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ecx
	adc	edx, ebx	; edx:eax = remainder" + divisor
				;         = |remainder|
endif ; JCCLESS
	pop	ebx
remainder:
	pop	ecx		; ecx = (remainder < 0) ? -1 : 0
	xor	eax, ecx
	xor	edx, ecx
	sub	eax, ecx
	sbb	edx, ecx	; edx:eax = remainder
	ret	16		; callee restores stack
	; dividend = divisor = -2**63: remainder = 0
special:
	pop	eax		; eax = sign of remainder = -1
	inc	eax
	xor	edx, edx	; edx:eax = remainder = 0
	ret	16		; callee restores stack
_allrem	endp
	endallrem.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            allrem.obj and add it to the existing object library
            i386.lib:
        ML.EXE allrem.asm LINK.EXE /LIB /OUT:i386.lib i386.lib allrem.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: allrem.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_aulldiv Routine_aulldiv
            compiler helper routine states:
        Divides two ULONGLONG integers. For example, to divide two UInt64 values the compiler might generate a call to _aulldiv Routine.OUCH: contrary to the highlighted statement, the Visual C compiler generates calls to theRemarks
_aulldiv Routine is a helper routine for the C compiler. Whether the compiler calls _aulldiv Routine is completely dependent on the optimization set.
_aulldiv
            routine unconditionally, independent from any optimisation, when it
            encounters a division where at least one of its operands is an
            unsigned 64-bit integer!
         Execute the following 2 command lines to display the content of the
            assembler source file ulldiv.asm shipped with the
            Visual C compiler:
        
DIR "%source%\intel\ulldiv.asm" TYPE "%source%\intel\ulldiv.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             5,079 ulldiv.asm
               1 File(s)          5,079 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   ulldiv - unsigned long divide routine
;***
;ulldiv.asm - unsigned long divide routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the unsigned long divide routine
;           __aulldiv
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;ulldiv - unsigned long divide
;
;Purpose:
;       Does a unsigned long divide of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the quotient (dividend/divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_aulldiv        PROC NEAR
.FPO (2, 4, 0, 0, 0, 0)
        push    ebx
        push    esi
; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to uldiv(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;               |      EBX      |
;               |---------------|
;       ESP---->|      ESI      |
;               -----------------
;
DVND    equ     [esp + 12]      ; stack address of dividend (a)
DVSR    equ     [esp + 20]      ; stack address of divisor (b)
DVND    equ     [esp + 8]
DVSR    equ     [esp + 16]
;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
        mov     eax,HIWORD(DVSR) ; check to see if divisor < 4194304K
        or      eax,eax
        mov     edx,HIWORD(DVSR)
        test    edx,edx
        jnz     short L1        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; get high order bits of quotient
        mov     ebx,eax         ; save high bits of quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; get low order bits of quotient
        mov     edx,ebx         ; edx:eax <- quotient hi:quotient lo
        jmp     short L2        ; restore stack and return
;
; Here we do it the hard way.  Remember, eax contains DVSRHI
;
L1:
        mov     ecx,eax         ; ecx:ebx <- divisor
        mov     ecx,edx
        mov     ebx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L3:
        shr     ecx,1           ; shift divisor right one bit; hi bit <- 0
        rcr     ebx,1
        shr     edx,1           ; shift dividend right one bit; hi bit <- 0
        rcr     eax,1
        or      ecx,ecx
        test    ecx,ecx
        jnz     short L3        ; loop until divisor < 4194304K
        div     ebx             ; now divide, ignore remainder
        mov     esi,eax         ; save quotient
        mov     ebx,eax
;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;
        mul     dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
        mov     ecx,eax
        mov     ecx,HIWORD(DVSR)
        imul    ecx,ebx
        mov     eax,LOWORD(DVSR)
        mul     esi             ; QUOT * LOWORD(DVSR)
        mul     ebx
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L4        ; carry means Quotient is off by 1
;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;
        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L4        ; if result > original, do subtract
        jb      short L5        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        sbb     edx,HIWORD(DVND)
        jbe     short L5        ; if less or equal we are ok, else subtract
L4:
        dec     esi             ; subtract 1 from quotient
        dec     ebx
L5:
        xor     edx,edx         ; edx:eax <- quotient
        mov     eax,esi
        mov     eax,ebx
;
; Just the cleanup left to do.  edx:eax contains the quotient.
; Restore the saved registers and return.
;
L2:
        pop     esi
        pop     ebx
        ret     16
_aulldiv        ENDP
        end
            With 43 instructions in 104 bytes (plus 8 bytes for alignment), this
            routine has several major and minor flaws: 3 major
            flaws on all kinds of processors, and 1 more only on processors
            which feature speculative execution!
         OOPS¹: instead of the 2 deleted
            OR instructions which
            perform superfluous write operations, the 2 inserted
            TEST instructions should be
            used.
        
 OOPS²: register EDX should be
            used instead of register EAX before the conditional
            branch to label L1:, and the following
            deleted XOR
            instruction should be removed.
        
 OOPS³: instead of the deleted
            first widening
 MUL
            instruction and the following deleted
            MOV instruction, the inserted
            MOV instruction loading the high part of
            the divisor into register ECX followed by the
            inserted faster
            IMUL instruction should be
            used.
        
 OUCH¹: instead of register ESI
            register EBX should be used, saving a pair of
            PUSH
            and POP instructions
            and 2 bytes!
        
 OUCH²: for divisors less than 232
            and a dividend less than 232×divisor, i.e. if the
            quotient is less than 232, instead of the long
            alias schoolbook
 division performed with the 2
            highlighted chained
            DIV instructions
            – each slower than a mispredicted conditional branch –
            after the conditional branch to label L3:, a single
            DIV instruction is sufficient,
            saving about 40 to 240 processor cycles!
        
 OUCH³: instead of the highlighted
            (brain)dead slow loop with 2 pairs of
            SHR and
            RCR instructions
            after label L3:, 2 pairs of
            SHRD and
            SHR instructions with their
            shift count determined per
            BSR instruction should be
            used!
        
 Note: this
            BSR instruction would also
            replace the deleted
            OR instruction
            respectively the inserted
            TEST instruction before the
            conditional branch to label L1:.
        
 OUCH⁴: on processors which feature
            speculative execution, instead of the 3 conditional branches before
            label L6:, which are slow when
            mispredicted, and the 2
            CMP instructions, a faster
            instruction sequence with less or no conditional branches should be
            used!
        
Note: with the modifications shown in the source, this routine has 38 instructions in 96 bytes.
JCCLESS is defined, else processors which don’t
            feature speculative execution, uses 59 instructions in 147 bytes
            (plus 13 bytes for alignment) respectively 60 instructions in 148
            bytes (plus 12 bytes for alignment), including 12 instructions in 30
            bytes for the special and trivial cases not covered by
            Microsoft’s poor implementation:
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _aulldiv():
; receives arguments on stack, returns quotient in edx:eax
; NOTE: _aulldiv() can raise 'division by zero' exception!
_aulldiv proc	public		; qword _aulldiv(qword dividend, qword divisor)
	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?
	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?
	; remainder < divisor < 2**32
	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?
	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	xor	edx, edx	; edx:eax = quotient
	ret	16		; callee restores stack
	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	push	eax		; [esp] = high dword of quotient
	mov	eax, [esp+8]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	pop	edx		; edx:eax = quotient
	ret	16		; callee restores stack
	; dividend < divisor: quotient = 0
trivial:
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0
	ret	16		; callee restores stack
	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?
	; perform "extended & adjusted" division
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?
	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"
	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+12]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+16]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
ifndef JCCLESS
	mov	ecx, [esp+20]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	add	edx, ecx	; edx:eax = divisor * quotient"
	jc	short @f	; divisor * quotient" >= 2**64?
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	; CF = (dividend < divisor * quotient")
				;    = (remainder" < 0)
@@:
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient
else ; JCCLESS
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	cmp	[esp+8], eax
	sbb	ecx, edx	; ecx:... = dividend
				;         - low dword of divisor * quotient"
	mov	eax, [esp+20]	; eax = high dword of divisor
	imul	eax, ebx	; eax = high dword of divisor * quotient"
if 0
	sub	ecx, eax	; ecx:... = dividend - divisor * quotient"
				;         = remainder"
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	add	eax, ebx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
	xor	edx, edx	; edx:eax = quotient
else
	xor	edx, edx	; edx = high dword of quotient = 0
	sub	ecx, eax	; ecx:... = dividend - divisor * quotient"
				;         = remainder"
	mov	eax, ebx	; eax = quotient"
	sbb	eax, edx	; eax = quotient" - (remainder" < 0)
				;     = (low dword of) quotient
endif
endif ; JCCLESS
	pop	ebx
	ret	16		; callee restores stack
	; dividend >= divisor >= 2**63: quotient = 1
special:
	xor	eax, eax
	xor	edx, edx
	inc	eax		; edx:eax = quotient = 1
	ret	16		; callee restores stack
_aulldiv endp
	endaulldiv.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            aulldiv.obj and add it to the existing object library
            i386.lib:
        ML.EXE aulldiv.asm LINK.EXE /LIB /OUT:i386.lib i386.lib aulldiv.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aulldiv.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_aulldvrm Routineulldvrm.asm shipped with the
            Visual C compiler:
        DIR "%source%\intel\ulldvrm.asm" TYPE "%source%\intel\ulldvrm.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             6,227 ulldvrm.asm
               1 File(s)          6,227 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   ulldvrm - unsigned long divide and remainder routine
;***
;ulldvrm.asm - unsigned long divide and remainder routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the unsigned long divide and remainder routine
;           __aulldvrm
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;ulldvrm - unsigned long divide and remainder
;
;Purpose:
;       Does a unsigned long divide and remainder of the arguments.  Arguments
;       are not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the quotient (dividend/divisor)
;       EBX:ECX contains the remainder (divided % divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_aulldvrm PROC NEAR
.FPO (1, 4, 0, 0, 0, 0)
        push    esi
; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a/b will
; generate a call to aulldvrm(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;       ESP---->|      ESI      |
;               -----------------
;
DVND    equ     [esp + 8]       ; stack address of dividend (a)
DVSR    equ     [esp + 16]      ; stack address of divisor (b)
;
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
        mov     eax,HIWORD(DVSR) ; check to see if divisor < 4194304K
        or      eax,eax
        jnz     short L1        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; get high order bits of quotient
        mov     ebx,eax         ; save high bits of quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; get low order bits of quotient
        mov     esi,eax         ; ebx:esi <- quotient
;
; Now we need to do a multiply so that we can compute the remainder.
;
        mov     eax,ebx         ; set up high word of quotient
        mul     dword ptr LOWORD(DVSR) ; HIWORD(QUOT) * DVSR
        mov     ecx,eax         ; save the result in ecx
        mov     eax,esi         ; set up low word of quotient
        mul     dword ptr LOWORD(DVSR) ; LOWORD(QUOT) * DVSR
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jmp     short L2        ; complete remainder calculation
;
; Here we do it the hard way.  Remember, eax contains DVSRHI
;
L1:
        mov     ecx,eax         ; ecx:ebx <- divisor
        mov     ebx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L3:
        shr     ecx,1           ; shift divisor right one bit; hi bit <- 0
        rcr     ebx,1
        shr     edx,1           ; shift dividend right one bit; hi bit <- 0
        rcr     eax,1
        or      ecx,ecx
        jnz     short L3        ; loop until divisor < 4194304K
        div     ebx             ; now divide, ignore remainder
        mov     esi,eax         ; save quotient
;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;
        mul     dword ptr HIWORD(DVSR) ; QUOT * HIWORD(DVSR)
        mov     ecx,eax
        mov     eax,LOWORD(DVSR)
        mul     esi             ; QUOT * LOWORD(DVSR)
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L4        ; carry means Quotient is off by 1
;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we are ok, otherwise
; subtract one (1) from the quotient.
;
        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L4        ; if result > original, do subtract
        jb      short L5        ; if result < original, we are ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        jbe     short L5        ; if less or equal we are ok, else subtract
L4:
        dec     esi             ; subtract 1 from quotient
        sub     eax,LOWORD(DVSR) ; subtract divisor from result
        sbb     edx,HIWORD(DVSR)
L5:
        xor     ebx,ebx         ; ebx:esi <- quotient
L2:
;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will do the subtract in the
; opposite direction and negate the result.
;
        sub     eax,LOWORD(DVND) ; subtract dividend from result
        sbb     edx,HIWORD(DVND)
        neg     edx             ; otherwise, negate the result
        neg     eax
        sbb     edx,0
;
; Now we need to get the quotient into edx:eax and the remainder into ebx:ecx.
;
        mov     ecx,edx
        mov     edx,ebx
        mov     ebx,ecx
        mov     ecx,eax
        mov     eax,esi
;
; Just the cleanup left to do.  edx:eax contains the quotient.
; Restore the saved registers and return.
;
        pop     esi
        ret     16
_aulldvrm ENDP
        end
            58 instructions in 149 bytes (plus 11 bytes for alignment).
         OUCH: the highlighted comment with the following
            code is a remarkable gem – the remainder is already present
            in register EDX!
        
JCCLESS is defined, else processors which don’t
            feature speculative execution, uses 75 instructions in 193 bytes
            (plus 15 bytes for alignment) respectively 72 instructions in 185
            bytes (plus 7 bytes for alignment), including 18 instructions in 50
            bytes for the special and trivial cases not covered by
            Microsoft’s poor implementation:
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _aulldvrm():
; receives arguments on stack, returns quotient in edx:eax and remainder in ebx:ecx
; NOTE: _aulldvrm() can raise 'division by zero' exception!
_aulldvrm proc	public		; qword _aulldvrm(qword dividend, qword divisor)
	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?
	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?
	; remainder < divisor < 2**32
	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?
	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	mov	ecx, edx	; ecx = (low dword of) remainder
	xor	ebx, ebx	; ebx:ecx = remainder
	xor	edx, edx	; edx:eax = quotient
	ret	16		; callee restores stack
	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	ebx, eax	; ebx = high dword of quotient
	mov	eax, [esp+4]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	ecx, edx	; ecx = (low dword of) remainder
	mov	edx, ebx	; edx:eax = quotient
	xor	ebx, ebx	; ebx:ecx = remainder
	ret	16		; callee restores stack
	; dividend < divisor: quotient = 0, remainder = dividend
trivial:
	mov	ecx, [esp+4]
	mov	ebx, [esp+8]	; ebx:ecx = remainder = dividend
	xor	eax, eax
	xor	edx, edx	; edx:eax = quotient = 0
	ret	16		; callee restores stack
	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?
	; perform "extended & adjusted" division
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+8]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?
	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"
	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+8]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+8]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+12]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	mov	ecx, [esp+16]	; ecx = high dword of divisor
	imul	ecx, ebx	; ecx = high dword of divisor * quotient"
	push	ebx		; [esp] = quotient"
	mov	ebx, [esp+12]	; ebx = high dword of dividend
	sub	ebx, ecx	; ebx = high dword of dividend
				;     - high dword of divisor * quotient"
	mov	ecx, [esp+8]	; ecx = low dword of dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor * quotient"
				;         = remainder"
ifndef JCCLESS
	pop	eax		; eax = quotient"
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ecx, [esp+12]
	adc	ebx, [esp+16]	; ebx:ecx = remainder" + divisor
				;         = remainder
	dec	eax		; eax = quotient" - 1
				;     = (low dword of) quotient
@@:
else ; JCCLESS
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	add	[esp], eax	; [esp] = quotient" - (remainder" < 0)
				;       = (low dword of) quotient
	and	eax, [esp+16]
	and	edx, [esp+20]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	ecx, eax
	adc	ebx, edx	; ebx:ecx = remainder" + divisor
				;         = remainder
	pop	eax		; eax = (low dword of) quotient
endif ; JCCLESS
	xor	edx, edx	; edx:eax = quotient
	ret	16		; callee restores stack
	; dividend >= divisor >= 2**63:
	; quotient = 1, remainder = dividend - divisor
special:
	mov	ecx, [esp+4]
	mov	ebx, [esp+8]	; ebx:ecx = dividend
	sub	ecx, eax
	sbb	ebx, edx	; ebx:ecx = dividend - divisor
				;         = remainder
	xor	eax, eax
	xor	edx, edx
	inc	eax		; edx:eax = quotient = 1
	ret	16		; callee restores stack
_aulldvrm endp
	endaulldvrm.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            aulldvrm.obj and add it to the existing object library
            i386.lib:
        ML.EXE aulldvrm.asm LINK.EXE /LIB /OUT:i386.lib i386.lib aulldvrm.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aulldvrm.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_aullrem Routineullrem.asm shipped with the
            Visual C compiler:
        DIR "%source%\intel\ullrem.asm" TYPE "%source%\intel\ullrem.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             5,330 ullrem.asm
               1 File(s)          5,330 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   ullrem - unsigned long remainder routine
;***
;ullrem.asm - unsigned long remainder routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines the unsigned long remainder routine
;           __aullrem
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;ullrem - unsigned long remainder
;
;Purpose:
;       Does a unsigned long remainder of the arguments.  Arguments are
;       not changed.
;
;Entry:
;       Arguments are passed on the stack:
;               1st pushed: divisor (QWORD)
;               2nd pushed: dividend (QWORD)
;
;Exit:
;       EDX:EAX contains the remainder (dividend%divisor)
;       NOTE: this routine removes the parameters from the stack.
;
;Uses:
;       ECX
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_aullrem        PROC NEAR
.FPO (1, 4, 0, 0, 0, 0)
        push    ebx
; Set up the local stack and save the index registers.  When this is done
; the stack frame will look as follows (assuming that the expression a%b will
; generate a call to ullrem(a, b)):
;
;               -----------------
;               |               |
;               |---------------|
;               |               |
;               |--divisor (b)--|
;               |               |
;               |---------------|
;               |               |
;               |--dividend (a)-|
;               |               |
;               |---------------|
;               | return addr** |
;               |---------------|
;       ESP---->|      EBX      |
;               -----------------
;
DVND    equ     [esp + 8]       ; stack address of dividend (a)
DVSR    equ     [esp + 16]      ; stack address of divisor (b)
; Now do the divide.  First look to see if the divisor is less than 4194304K.
; If so, then we can use a simple algorithm with word divides, otherwise
; things get a little more complex.
;
        mov     eax,HIWORD(DVSR) ; check to see if divisor < 4194304K
        or      eax,eax
        jnz     short L1        ; nope, gotta do this the hard way
        mov     ecx,LOWORD(DVSR) ; load divisor
        mov     eax,HIWORD(DVND) ; load high word of dividend
        xor     edx,edx
        div     ecx             ; edx <- remainder, eax <- quotient
        mov     eax,LOWORD(DVND) ; edx:eax <- remainder:lo word of dividend
        div     ecx             ; edx <- final remainder
        mov     eax,edx         ; edx:eax <- remainder
        xor     edx,edx
        jmp     short L2        ; restore stack and return
;
; Here we do it the hard way.  Remember, eax contains DVSRHI
;
L1:
        mov     ecx,eax         ; ecx:ebx <- divisor
        mov     ebx,LOWORD(DVSR)
        mov     edx,HIWORD(DVND) ; edx:eax <- dividend
        mov     eax,LOWORD(DVND)
L3:
        shr     ecx,1           ; shift divisor right one bit; hi bit <- 0
        rcr     ebx,1
        shr     edx,1           ; shift dividend right one bit; hi bit <- 0
        rcr     eax,1
        or      ecx,ecx
        jnz     short L3        ; loop until divisor < 4194304K
        div     ebx             ; now divide, ignore remainder
;
; We may be off by one, so to check, we will multiply the quotient
; by the divisor and check the result against the orignal dividend
; Note that we must also check for overflow, which can occur if the
; dividend is close to 2**64 and the quotient is off by 1.
;
        mov     ecx,eax         ; save a copy of quotient in ECX
        mul     dword ptr HIWORD(DVSR)
        xchg    ecx,eax         ; put partial product in ECX, get quotient in EAX
        mul     dword ptr LOWORD(DVSR)
        add     edx,ecx         ; EDX:EAX = QUOT * DVSR
        jc      short L4        ; carry means Quotient is off by 1
;
; do long compare here between original dividend and the result of the
; multiply in edx:eax.  If original is larger or equal, we're ok, otherwise
; subtract the original divisor from the result.
;
        cmp     edx,HIWORD(DVND) ; compare hi words of result and original
        ja      short L4        ; if result > original, do subtract
        jb      short L5        ; if result < original, we're ok
        cmp     eax,LOWORD(DVND) ; hi words are equal, compare lo words
        jbe     short L5        ; if less or equal we're ok, else subtract
L4:
        sub     eax,LOWORD(DVSR) ; subtract divisor from result
        sbb     edx,HIWORD(DVSR)
L5:
;
; Calculate remainder by subtracting the result from the original dividend.
; Since the result is already in a register, we will perform the subtract in
; the opposite direction and negate the result to make it positive.
;
        sub     eax,LOWORD(DVND) ; subtract original dividend from result
        sbb     edx,HIWORD(DVND)
        neg     edx             ; and negate it
        neg     eax
        sbb     edx,0
;
; Just the cleanup left to do.  dx:ax contains the remainder.
; Restore the saved registers and return.
;
L2:
        pop     ebx
        ret     16
_aullrem        ENDP
        end
            44 instructions in 117 bytes (plus 11 bytes for alignment).
        JCCLESS is defined, else processors which don’t
            feature speculative execution, uses 65 instructions in 173 bytes
            (plus 3 bytes for alignment) respectively 64 instructions in 171
            bytes (plus 5 bytes for alignment), including 14 instructions in 43
            bytes for the special and trivial cases not covered by
            Microsoft’s poor implementation:
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _aullrem():
; receives arguments on stack, returns remainder in edx:eax
; NOTE: _aullrem() can raise 'division by zero' exception!
_aullrem proc	public		; qword _aullrem(qword dividend, qword divisor)
	mov	ecx, [esp+8]	; ecx = high dword of dividend
	mov	eax, [esp+12]
	mov	edx, [esp+16]	; edx:eax = divisor
	cmp	[esp+4], eax
	sbb	ecx, edx
	jb	short trivial	; dividend < divisor?
	bsr	ecx, edx	; ecx = index of most significant '1' bit
				;        in high dword of divisor
	jnz	short extended	; high dword of divisor <> 0?
	; remainder < divisor < 2**32
	mov	ecx, eax	; ecx = (low dword of) divisor
	mov	eax, [esp+8]	; eax = high dword of dividend
	cmp	eax, ecx
	jae	short long	; high dword of dividend >= divisor?
	; perform normal division
normal:
	mov	edx, eax	; edx = high dword of dividend
	mov	eax, [esp+4]	; edx:eax = dividend
	div	ecx		; eax = (low dword of) quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder
	ret	16		; callee restores stack
	; perform "long" alias "schoolbook" division
long:
;;	xor	edx, edx	; edx:eax = high dword of dividend
	div	ecx		; eax = high dword of quotient,
				; edx = high dword of remainder'
	mov	eax, [esp+4]	; eax = low dword of dividend
	div	ecx		; eax = low dword of quotient,
				; edx = (low dword of) remainder
	mov	eax, edx	; eax = (low dword of) remainder
	xor	edx, edx	; edx:eax = remainder
	ret	16		; callee restores stack
	; dividend < divisor: remainder = dividend
trivial:
	mov	eax, [esp+4]
	mov	edx, [esp+8]	; edx:eax = remainder = dividend
	ret	16		; callee restores stack
	; dividend >= divisor >= 2**32: quotient < 2**32
extended:
	xor	ecx, 31		; ecx = number of leading '0' bits
				;        in (high dword of) divisor
	jz	short special	; divisor >= 2**63?
	; perform "extended & adjusted" division
	shld	edx, eax, cl	; edx = divisor / 2**(index + 1)
				;     = divisor'
;;	shl	eax, cl
	push	ebx
	mov	ebx, edx	; ebx = divisor'
ifndef JCCLESS
	xor	eax, eax	; eax = high dword of quotient' = 0
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx
	jb	short @f	; high dword of dividend < divisor'?
	; high dword of dividend >= divisor':
	; subtract divisor' from high dword of dividend to prevent possible
	; division overflow and set most significant bit of quotient"
	sub	edx, ebx	; edx = high dword of dividend - divisor'
				;     = high dword of dividend'
	inc	eax		; eax = high dword of quotient' = 1
@@:
	push	eax		; [esp] = high dword of quotient'
else ; JCCLESS
	mov	edx, [esp+12]	; edx = high dword of dividend
	cmp	edx, ebx	; CF = (high dword of dividend < divisor')
	sbb	eax, eax	; eax = (high dword of dividend < divisor') ? -1 : 0
	inc	eax		; eax = (high dword of dividend < divisor') ? 0 : 1
				;     = high dword of quotient'
	push	eax		; [esp] = high dword of quotient'
if 0
	neg	eax		; eax = (high dword of dividend < divisor') ? 0 : -1
	and	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
else
	imul	eax, ebx	; eax = (high dword of dividend < divisor') ? 0 : divisor'
endif
	sub	edx, eax	; edx = high dword of dividend
				;     - (high dword of dividend < divisor') ? 0 : divisor'
				;     = high dword of dividend'
endif ; JCCLESS
	mov	eax, [esp+12]	; eax = low dword of dividend
				;     = low dword of dividend'
	div	ebx		; eax = dividend' / divisor'
				;     = low dword of quotient',
				; edx = remainder'
	pop	ebx		; ebx = high dword of quotient'
	shld	ebx, eax, cl	; ebx = quotient' / 2**(index + 1)
				;     = dividend / divisor'
				;     = quotient"
;;	shl	eax, cl
	mov	eax, [esp+16]	; eax = low dword of divisor
	mul	ebx		; edx:eax = low dword of divisor * quotient"
	imul	ebx, [esp+20]	; ebx = high dword of divisor * quotient"
	mov	ecx, [esp+12]	; ecx = high dword of dividend
	sub	ecx, ebx	; ecx = high dword of dividend
				;     - high dword of divisor * quotient"
	mov	ebx, [esp+8]	; ebx = low dword of dividend
	sub	ebx, eax
	sbb	ecx, edx	; ecx:ebx = dividend - divisor * quotient"
				;         = remainder"
ifndef JCCLESS
	jnb	short @f	; remainder" >= 0?
				;  (with borrow, it is off by divisor,
				;   and quotient" is off by 1)
	add	ebx, [esp+16]
	adc	ecx, [esp+20]	; ecx:ebx = remainder" + divisor
				;         = remainder
@@:
	mov	eax, ebx
	mov	edx, ecx	; edx:eax = remainder
else ; JCCLESS
	sbb	eax, eax	; eax = (remainder" < 0) ? -1 : 0
	cdq			; edx = (remainder" < 0) ? -1 : 0
	and	eax, [esp+16]
	and	edx, [esp+20]	; edx:eax = (remainder" < 0) ? divisor : 0
	add	eax, ebx
	adc	edx, ecx	; edx:eax = remainder" + divisor
				;         = remainder
endif ; JCCLESS
	pop	ebx
	ret	16		; callee restores stack
	; dividend >= divisor >= 2**63: remainder = dividend - divisor
special:
if 0
	mov	eax, [esp+4]
	mov	edx, [esp+8]	; edx:eax = dividend
	sub	eax, [esp+12]
	sbb	edx, [esp+16]	; edx:eax = dividend - divisor
				;         = remainder
else
	neg	edx
	neg	eax
	sbb	edx, ecx	; edx:eax = -divisor
	add	eax, [esp+4]
	adc	edx, [esp+8]	; edx:eax = dividend - divisor
				;         = remainder
endif
	ret	16		; callee restores stack
_aullrem endp
	endaullrem.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            aullrem.obj and add it to the existing object library
            i386.lib:
        ML.EXE aullrem.asm LINK.EXE /LIB /OUT:i386.lib i386.lib aullrem.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aullrem.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_aullshr Routineullshr.asm shipped with the
            Visual C compiler:
        DIR "%source%\intel\ullshr.asm" TYPE "%source%\intel\ullshr.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             1,545 ullshr.asm
               1 File(s)          1,545 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   ullshr - long shift right
;***
;ullshr.asm - long shift right
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       define unsigned long shift right routine
;           __aullshr
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;ullshr - long shift right
;
;Purpose:
;       Does a unsigned Long Shift Right
;       Shifts a long right any number of bits.
;
;Entry:
;       EDX:EAX - long value to be shifted
;       CL    - number of bits to shift by
;
;Exit:
;       EDX:EAX - shifted value
;
;Uses:
;       CL is destroyed.
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_aullshr        PROC NEAR
.FPO (0, 0, 0, 0, 0, 0)
;
; Handle shifts of 64 bits or more (if shifting 64 bits or more, the result
; depends only on the high order bit of edx).
;
        cmp     cl,64
        jae     short RETZERO
;
; Handle shifts of between 0 and 31 bits
;
        cmp     cl, 32
        jae     short MORE32
        shrd    eax,edx,cl
        shr     edx,cl
        ret
;
; Handle shifts of between 32 and 63 bits
;
MORE32:
        mov     eax,edx
        xor     edx,edx
        and     cl,31
        shr     eax,cl
        ret
;
; return 0 in edx:eax
;
RETZERO:
        xor     eax,eax
        xor     edx,edx
        ret
_aullshr        ENDP
        end
            15 instructions in 31 bytes (plus 1 byte for alignment).
         OUCH: i386 and newer processors
            perform shift operations modulo the register size, the
            deleted AND instruction
            is therefore superfluous!
        
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _aullshr():
; receives arguments in edx:eax and cl, returns result in edx:eax
_aullshr proc	public		; qword _aullshr(qword value, byte count)
	cmp	cl, 31
	ja	short @f	; count > 31?
	shrd	eax, edx, cl
	shr	edx, cl		; edx:eax = result
	ret
@@:
	xor	eax, eax	; eax = high dword of result
				;     = 0
	cmp	cl, 63
	ja	short @f	; count > 63?
	xchg	eax, edx	; eax = high dword of value,
				; edx = high dword of result = 0
	shr	eax, cl		; edx:eax = result
	ret
@@:
	cdq			; edx:eax = result = 0
	ret
_aullshr endp
	endaullshr.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            aullshr.obj and add it to the existing object library
            i386.lib:
        ML.EXE aullshr.asm LINK.EXE /LIB /OUT:i386.lib i386.lib aullshr.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: aullshr.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_allshl Routinellshl.asm shipped with the
            Visual C compiler:
        DIR "%source%\intel\llshl.asm" TYPE "%source%\intel\llshl.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             1,493 llshl.asm
               1 File(s)          1,493 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   llshl - long shift left
;***
;llshl.asm - long shift left
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       define long shift left routine (signed and unsigned are same)
;           __allshl
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;llshl - long shift left
;
;Purpose:
;       Does a Long Shift Left (signed and unsigned are identical)
;       Shifts a long left any number of bits.
;
;Entry:
;       EDX:EAX - long value to be shifted
;       CL    - number of bits to shift by
;
;Exit:
;       EDX:EAX - shifted value
;
;Uses:
;       CL is destroyed.
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_allshl PROC NEAR
.FPO (0, 0, 0, 0, 0 ,0)
;
; Handle shifts of 64 or more bits (all get 0)
;
        cmp     cl, 64
        jae     short RETZERO
;
; Handle shifts of between 0 and 31 bits
;
        cmp     cl, 32
        jae     short MORE32
        shld    edx,eax,cl
        shl     eax,cl
        ret
;
; Handle shifts of between 32 and 63 bits
;
MORE32:
        mov     edx,eax
        xor     eax,eax
        and     cl,31
        shl     edx,cl
        ret
;
; return 0 in edx:eax
;
RETZERO:
        xor     eax,eax
        xor     edx,edx
        ret
_allshl ENDP
        end
            15 instructions in 31 bytes (plus 1 byte for alignment).
         OUCH: i386 and newer processors
            perform shift operations modulo the register size, the
            deleted AND instruction
            is therefore superfluous!
        
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _allshl():
; receives arguments in edx:eax and cl, returns result in edx:eax
_allshl	proc	public		; sqword _allshl(sqword value, byte count)
	cmp	cl, 31
	ja	short @f	; count > 31?
	shld	edx, eax, cl
	shl	eax, cl		; edx:eax = result
	ret
@@:
	mov	edx, eax	; edx = low dword of value
	xor	eax, eax	; eax = low dword of result
				;     = 0
	cmp	cl, 63
	ja	short @f	; count > 63?
	shl	edx, cl		; edx:eax = result
	ret
@@:
	cdq			; edx:eax = result = 0
	ret
_allshl	endp
	endallshl.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            allshl.obj and add it to the existing object library
            i386.lib:
        ML.EXE allshl.asm LINK.EXE /LIB /OUT:i386.lib i386.lib allshl.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: allshl.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_allshr Routinellshr.asm shipped with the
            Visual C compiler:
        DIR "%source%\intel\llshr.asm" TYPE "%source%\intel\llshr.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             1,561 llshr.asm
               1 File(s)          1,561 bytes
               0 Dir(s)    9,876,543,210 bytes free
        title   llshr - long shift right
;***
;llshr.asm - long shift right
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       define signed long shift right routine
;           __allshr
;
;*******************************************************************************
.xlist
include cruntime.inc
include mm.inc
.list
;***
;llshr - long shift right
;
;Purpose:
;       Does a signed Long Shift Right
;       Shifts a long right any number of bits.
;
;Entry:
;       EDX:EAX - long value to be shifted
;       CL    - number of bits to shift by
;
;Exit:
;       EDX:EAX - shifted value
;
;Uses:
;       CL is destroyed.
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
_allshr PROC NEAR
.FPO (0, 0, 0, 0, 0, 0)
;
; Handle shifts of 64 bits or more (if shifting 64 bits or more, the result
; depends only on the high order bit of edx).
;
        cmp     cl,64
        jae     short RETSIGN
;
; Handle shifts of between 0 and 31 bits
;
        cmp     cl, 32
        jae     short MORE32
        shrd    eax,edx,cl
        sar     edx,cl
        ret
;
; Handle shifts of between 32 and 63 bits
;
MORE32:
        mov     eax,edx
        sar     edx,31
        and     cl,31
        sar     eax,cl
        ret
;
; Return double precision 0 or -1, depending on the sign of edx
;
RETSIGN:
        sar     edx,31
        mov     eax,edx
        ret
_allshr ENDP
        end
            15 instructions in 33 bytes (plus 15 bytes for alignment).
         OUCH: i386 and newer processors
            perform shift operations modulo the register size, the
            deleted AND instruction
            is therefore superfluous!
        
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
; MSC internal _allshr():
; receives arguments in edx:eax and cl, returns result in edx:eax
_allshr	proc	public		; sqword _allshr(sqword value, byte count)
	cmp	cl, 31
	ja	short @f	; count > 31?
	shrd	edx, eax, cl
	sar	eax, cl		; edx:eax = result
	ret
@@:
	mov	eax, edx	; eax = high dword of value
	cdq			; edx = (value < 0) ? -1 : 0
				;     = high dword of result
	cmp	cl, 63
	ja	short @f	; count > 63?
	sar	eax, cl		; edx:eax = result
	ret
@@:
	mov	eax, edx	; edx:eax = (value < 0) ? -1 : 0
				;         = result
	ret
_allshr	endp
	endallshr.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 2 command lines to generate the object file
            allshr.obj and add it to the existing object library
            i386.lib:
        ML.EXE allshr.asm LINK.EXE /LIB /OUT:i386.lib i386.lib allshr.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: allshr.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
_all* and _aull* Routines in LeakedSource
DIR "%source%\intel\ll*.asm" DIR "%source%\intel\ull*.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             6,670 lldiv.asm
02/18/2011  03:40 PM             8,557 lldvrm.asm
02/18/2011  03:40 PM             2,570 llmul.asm
02/18/2011  03:40 PM             7,067 llrem.asm
02/18/2011  03:40 PM             1,493 llshl.asm
02/18/2011  03:40 PM             1,561 llshr.asm
               6 File(s)         27,918 bytes
               0 Dir(s)    9,876,543,210 bytes free
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             5,079 ulldiv.asm
02/18/2011  03:40 PM             6,227 ulldvrm.asm
02/18/2011  03:40 PM             5,330 ullrem.asm
02/18/2011  03:40 PM             1,545 ullshr.asm
               4 File(s)         18,181 bytes
               0 Dir(s)    9,876,543,210 bytes free
            The following table presents the revision history extracted from the
            i386 assembler source file
            blcrtasm.asm,
            but stripped from the 10 individual assembler source files shown
            above.
         Note: on November 19, 1993, 8 (in words:
            eight)
            years
            after Intel® introduced their 80386
            processor, and 8
            months
            after they introduced the Pentium®
            processor, these routines were
            modified to work on 64 bit integers
,
            but without taking advantage of these 32-bit
            processor’s new
 instructions like
            BSF,
            BSR,
            SHLD and
            SHRD to replace
            the (dead)slow loops which shift both operands by just one bit per
            pass with SHR and
            RCR instructions.
        
 Note: even the initial versions of the
            _alldvrm and _aulldvrm routines,
            created October 6, 1998, almost 3
            years
            after Intel introduced their
            Pentium®Pro processor, and 17
            months
            after they introduced the Pentium®II
            processor, failed to rectify (not just) this performance degrading
            deficiency.
            Intel Microprocessor Quick Reference Guide - Product Family
        
| Routine(s) | Date | Who | Comment | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Routine(s) | Date | Who | Comment | |||||||||
| llshl | llshr | ullshr | 1983-11-?? | HS | initial version | |||||||
| lldiv | llmul | llrem | ulldiv | ullrem | 1983-11-29 | DFW | initial version | |||||
| llshl | llshr | ullshr | 1983-11-30 | DFW | added medium model support | |||||||
| llshl | llshr | ullshr | 1984-03-12 | DFW | broke apart; added long model support | |||||||
| lldiv | llmul | llrem | llshl | llshr | ulldiv | ullrem | ullshr | 1984-06-01 | RN | modified to use cmacros | ||
| llmul | 1985-04-17 | TC | ignore signs since they take care of themselves do a fast multiply if both hiwords of arguments are 0 | |||||||||
| llmul | 1986-10-10 | MH | slightly faster implementation, for 0 in upper words | |||||||||
| lldiv | llrem | ulldiv | ullrem | 1987-10-23 | SKS | fixed off-by-1 error for dividend close to 2**32. | ||||||
| llmul | 1989-03-20 | SKS | Remove redundant "MOV SP,BP" from epilogs | |||||||||
| llmul | 1989-05-18 | SKS | Preserve BX | |||||||||
| lldiv | llrem | ulldiv | ullrem | 1989-05-18 | SKS | Remove redundant "MOV SP,BP" from epilog | ||||||
| lldiv | llmul | llrem | llshl | llshr | ulldiv | ullrem | ullshr | 1989-11-28 | GJF | Fixed copyright | ||
| lldiv | llmul | llrem | llshl | llshr | ulldiv | ullrem | ullshr | 1993-11-19 | SMK | Modified to work on 64 bit integers | ||
| lldiv | llmul | llrem | llshl | llshr | ulldiv | ullrem | ullshr | 1994-01-17 | GDF | Minor changes to build with NT's masm386. | ||
| llshl | llshr | ullshr | 1994-07-08 | GDF | Faster, fatter version for NT. | |||||||
| llshl | llshr | ullshr | 1994-07-13 | GDF | Further improvements from JonM. | |||||||
| lldiv | llmul | llrem | ulldiv | ullrem | 1994-07-22 | GJF | Use esp-relative addressing for args. Shortened conditional jumps. Also, don't use xchg to do a simple move between regs. | |||||
| lldvrm | ulldvrm | 1998-10-06 | SMK | Initial version. | ||||||||
_all* and _aull* RoutinesCYCLES defined, the
            following program measures the execution times of signed and
            unsigned 64÷64-bit divisions as well as 64×64-bit
            multiplications in processor clock cycles and runs on
            Windows Vista® and later versions, else
            it measures the execution times in nano-seconds and runs on all
            versions of
            Windows™ NT
            – it executes each operation on 1 billion pairs of uniform
            distributed 64-bit pseudo-random numbers in a first pass, then on 1
            billion pairs of uniform distributed 33 to 64-bit pseudo-random
            numbers in a second pass:
        // Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
typedef	LONGLONG	SQWORD;
typedef	ULONGLONG	QWORD;
#define _(DIVIDEND, DIVISOR)	{(DIVIDEND), (DIVISOR), (DIVIDEND) / (DIVISOR), (DIVIDEND) % (DIVISOR)}
const	struct	_ull
{
	QWORD	ullDividend, ullDivisor, ullQuotient, ullRemainder;
} ullTable[] = {_(0x0000000000000000ULL, 0x0000000000000001ULL),
                _(0x0000000000000001ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000001ULL),
                _(0x0000000000000002ULL, 0x0000000000000002ULL),
                _(0x0000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x0000000000000001ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000002ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x0000000000000003ULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x000000000FFFFFFFULL, 0x0000000000000001ULL),
                _(0x0000000FFFFFFFFFULL, 0x000000000000000FULL),
                _(0x0000000FFFFFFFFFULL, 0x0000000000000010ULL),
                _(0x0000000000000100ULL, 0x000000000FFFFFFFULL),
                _(0x00FFFFFFF0000000ULL, 0x0000000010000000ULL),
                _(0x07FFFFFF80000000ULL, 0x0000000080000000ULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x0000FFFFFFFFFFFEULL),
                _(0x7FFFFFFEFFFFFFF0ULL, 0x7FFFFFFEFFFFFFF0ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x7FFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000000000001ULL),
                _(0x8000000000000000ULL, 0x0000000000000002ULL),
                _(0x8000000000000000ULL, 0x0000000000000003ULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFDULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFEULL),
                _(0x8000000000000000ULL, 0x00000000FFFFFFFFULL),
                _(0x8000000000000000ULL, 0x0000000100000000ULL),
                _(0x8000000000000000ULL, 0x0000000100000001ULL),
                _(0x8000000000000000ULL, 0x0000000100000002ULL),
                _(0x8000000000000000ULL, 0x0000000100000003ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFF00000000ULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFDULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0x8000000000000000ULL, 0xFFFFFFFFFFFFFFFFULL),
                _(0x8000000080000000ULL, 0x0000000080000000ULL),
                _(0x8000000080000001ULL, 0x0000000080000001ULL),
                _(0xFFFFFFFEFFFFFFF0ULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFCULL, 0x00000000FFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFCULL, 0x0000000100000002ULL),
                _(0xFFFFFFFFFFFFFFFEULL, 0x0000000080000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000002ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000000000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFDULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000000FFFFFFFFULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000002ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000100000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x00000001C0000001ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x0000000380000003ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x8000000000000000ULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0x7FFFFFFFFFFFFFFFULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFEULL),
                _(0xFFFFFFFFFFFFFFFFULL, 0xFFFFFFFFFFFFFFFFULL)};
const	struct	_ll
{
	SQWORD	llDividend, llDivisor, llQuotient, llRemainder;
} llTable[] = {_(0x0000000000000000LL, 0x0000000000000001LL),	// 0, 1
               _(0x0000000000000001LL, 0x0000000000000001LL),	// 1, 1
               _(0x0000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// 0, -1
               _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFFLL),	// 1, -1
               _(0x0000000000000001LL, 0xFFFFFFFFFFFFFFFELL),	// 1, -2
               _(0x0000000000000002LL, 0xFFFFFFFFFFFFFFFELL),	// 2, -2
               _(0x000000000FFFFFFFLL, 0x0000000000000001LL),
               _(0x0000000FFFFFFFFFLL, 0x000000000000000FLL),
               _(0x0000000FFFFFFFFFLL, 0x0000000000000010LL),
               _(0x0000000000000100LL, 0x000000000FFFFFFFLL),
               _(0x00FFFFFFF0000000LL, 0x0000000010000000LL),
               _(0x07FFFFFF80000000LL, 0x0000000080000000LL),
               _(0x7FFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
               _(0x7FFFFFFEFFFFFFF0LL, 0x0000FFFFFFFFFFFELL),
               _(0x7FFFFFFEFFFFFFF0LL, 0x7FFFFFFEFFFFFFF0LL),
               _(0x7FFFFFFFFFFFFFFFLL, 0x8000000000000000LL),	// llmax, llmin
               _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFDLL),	// llmax, -3
               _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// llmax, -2
               _(0x7FFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL),	// llmax, -1
               _(0x8000000000000000LL, 0x0000000000000001LL),	// llmin, 1
               _(0x8000000000000000LL, 0x0000000000000002LL),	// llmin, 2
               _(0x8000000000000000LL, 0x0000000000000003LL),	// llmin, 3
               _(0x8000000000000000LL, 0x00000000FFFFFFFELL),
               _(0x8000000000000000LL, 0x00000000FFFFFFFFLL),
               _(0x8000000000000000LL, 0x0000000100000000LL),
               _(0x8000000000000000LL, 0x0000000100000001LL),
               _(0x8000000000000000LL, 0x0000000100000002LL),
               _(0x8000000000000000LL, 0x8000000000000000LL),	// llmin, llmin
               _(0x8000000000000000LL, 0xFFFFFFFF00000000LL),
               _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFDLL),	// llmin, -3
               _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFELL),	// llmin, -2
               _(0x8000000000000000LL, 0xFFFFFFFFFFFFFFFFLL),	// llmin, -1
               _(0x8000000080000000LL, 0x0000000080000000LL),
               _(0x8000000080000001LL, 0x0000000080000001LL),
               _(0xFFFFFFFEFFFFFFF0LL, 0xFFFFFFFFFFFFFFFELL),
               _(0xFFFFFFFFFFFFFFFELL, 0x0000000080000000LL),
               _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000001LL),	// -2, 1
               _(0xFFFFFFFFFFFFFFFELL, 0x0000000000000002LL),	// -2, 2
               _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFELL),	// -2, -2
               _(0xFFFFFFFFFFFFFFFELL, 0xFFFFFFFFFFFFFFFFLL),	// -2, -1
               _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000001LL),	// -1, 1
               _(0xFFFFFFFFFFFFFFFFLL, 0x0000000000000002LL),	// -1, 2
               _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFELL),	// -1, -2
               _(0xFFFFFFFFFFFFFFFFLL, 0xFFFFFFFFFFFFFFFFLL)};	// -1, -1
#undef _
__declspec(naked)
__declspec(noinline)
QWORD	WINAPI	_aullnop(QWORD ullLeft, QWORD ullRight)
{
	__asm	ret	16
}
__forceinline	// companion for __emulu()
struct
{
	DWORD	ulQuotient, ulRemainder;
}	WINAPI	__edivmodu(QWORD ullDividend, DWORD ulDivisor)
{
	__asm	mov	eax, dword ptr ullDividend
	__asm	mov	edx, dword ptr ullDividend+4
	__asm	div	ulDivisor
}
__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1025];
	DWORD	dwFormat;
	DWORD	dwOutput;
	va_list	vaInput;
	va_start(vaInput, lpFormat);
	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
	va_end(vaInput);
	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;
	return dwOutput == dwFormat;
}
__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	DWORD	dw, dwCPUID[12];
	QWORD	qwT0, qwT1, qwT2, qwT3, qwT4, qwT5, qwT6, qwT7, qwT8, qwT9;
	QWORD	ullQuotient, ullRemainder;
	SQWORD	llQuotient, llRemainder;
	volatile
	QWORD	qwQuotient, qwRemainder;
	QWORD	qwDividend, qwDivisor = ~0ULL;
	QWORD	qwLeft = 0x9E3779B97F4A7C15ULL;		// 2**64 / golden ratio
	QWORD	qwRight = 0x28208A20A08A28ACULL;	// bit-vector of prime numbers:
							//  2**prime is set for each prime in [0, 63]
	HANDLE	hThread = GetCurrentThread();
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	if ((hOutput == INVALID_HANDLE_VALUE)
	 || (SetThreadIdealProcessor(hThread, 0UL) == -1L)
	 || (!SetThreadPriority(hThread, THREAD_PRIORITY_HIGHEST)))
		ExitProcess(GetLastError());
	__cpuid(dwCPUID, 0x80000000UL);
	if (*dwCPUID >= 0x80000004UL)
	{
		__cpuid(dwCPUID, 0x80000002UL);
		__cpuid(dwCPUID + 4, 0x80000003UL);
		__cpuid(dwCPUID + 8, 0x80000004UL);
	}
	else
		__movsb(dwCPUID, "unidentified processor", sizeof("unidentified processor"));
	PrintFormat(hOutput, "\r\nTesting unsigned 64-bit division...\r\n");
	for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);
		ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;
		ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;
		if (ullQuotient != ullTable[dw].ullQuotient)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);
		if (ullQuotient > ullTable[dw].ullDividend)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);
		if (ullRemainder != ullTable[dw].ullRemainder)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
		if (ullRemainder >= ullTable[dw].ullDivisor)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
	}
	for (dw = 0UL; dw < sizeof(ullTable) / sizeof(*ullTable); dw++)
	{
		PrintFormat(hOutput, "\r%ld", ~dw);
		ullQuotient = ullTable[dw].ullDividend / ullTable[dw].ullDivisor;
		if (ullQuotient != ullTable[dw].ullQuotient)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient, ullTable[dw].ullQuotient);
		if (ullQuotient > ullTable[dw].ullDividend)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a quotient %I64u greater dividend\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullQuotient);
		ullRemainder = ullTable[dw].ullDividend - ullTable[dw].ullDivisor * ullQuotient;
		if (ullRemainder != ullTable[dw].ullRemainder)
			PrintFormat(hOutput,
			            "\t%I64u / %I64u:\a remainder %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
		ullRemainder = ullTable[dw].ullDividend % ullTable[dw].ullDivisor;
		if (ullRemainder != ullTable[dw].ullRemainder)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not equal %I64u\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder, ullTable[dw].ullRemainder);
		if (ullRemainder >= ullTable[dw].ullDivisor)
			PrintFormat(hOutput,
			            "\t%I64u %% %I64u:\a remainder %I64u not less divisor\r\n",
			            ullTable[dw].ullDividend, ullTable[dw].ullDivisor, ullRemainder);
	}
	PrintFormat(hOutput, "\r\nTesting signed 64-bit division...\r\n");
	for (dw = 0UL; dw < sizeof(llTable) / sizeof(*llTable); dw++)
	{
		PrintFormat(hOutput, "\r%lu", dw);
		llQuotient = llTable[dw].llDividend / llTable[dw].llDivisor;
		llRemainder = llTable[dw].llDividend % llTable[dw].llDivisor;
		if (llQuotient != llTable[dw].llQuotient)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient, llTable[dw].llQuotient);
		if ((llTable[dw].llDividend < 0LL) && (llQuotient < llTable[dw].llDividend)
		 || (llTable[dw].llDividend >= 0LL) && (llQuotient > llTable[dw].llDividend))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d greater dividend\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient);
		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
		if ((llTable[dw].llDivisor < 0LL) && (llRemainder <= llTable[dw].llDivisor)
		 || (llTable[dw].llDivisor > 0LL) && (llRemainder >= llTable[dw].llDivisor))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not less divisor\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder);
		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
		llRemainder = llTable[dw].llDividend - llTable[dw].llDivisor * llQuotient;
		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
	}
	for (dw = 0UL; dw < sizeof(llTable) / sizeof(*llTable); dw++)
	{
		PrintFormat(hOutput, "\r%ld", ~dw);
		llQuotient = llTable[dw].llDividend / llTable[dw].llDivisor;
		if (llQuotient != llTable[dw].llQuotient)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient, llTable[dw].llQuotient);
		if ((llTable[dw].llDividend < 0LL) && (llQuotient < llTable[dw].llDividend)
		 || (llTable[dw].llDividend >= 0LL) && (llQuotient > llTable[dw].llDividend))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a quotient %I64d greater dividend\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llQuotient);
		llRemainder = llTable[dw].llDividend - llTable[dw].llDivisor * llQuotient;
		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d / %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
		llRemainder = llTable[dw].llDividend % llTable[dw].llDivisor;
		if (llRemainder != llTable[dw].llRemainder)
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not equal %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llRemainder);
		if ((llTable[dw].llDivisor < 0LL) && (llRemainder <= llTable[dw].llDivisor)
		 || (llTable[dw].llDivisor > 0LL) && (llRemainder >= llTable[dw].llDivisor))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a remainder %I64d not less divisor\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder);
		if ((llRemainder != 0LL) && ((llRemainder < 0LL) != (llTable[dw].llDividend < 0LL)))
			PrintFormat(hOutput,
			            "\t%I64d %% %I64d:\a sign of remainder %I64d not equal sign of quotient %I64d\r\n",
			            llTable[dw].llDividend, llTable[dw].llDivisor, llRemainder, llTable[dw].llDividend);
	}
	PrintFormat(hOutput, "\r\nTiming 64-bit division and multiplication on %.48hs\r\n", dwCPUID);
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0xAD93D23594C935A9
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwQuotient = _aullnop(qwLeft, qwRight);
		// 64-bit linear feedback shift register (Galois form)
		//  using primitive polynomial 0x2B5926535897936B
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwRemainder = _aullnop(qwLeft, qwRight);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwQuotient = qwLeft / qwRight;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwQuotient = qwLeft / qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwRemainder = qwLeft % qwRight;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwRemainder = qwLeft % qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwQuotient = qwLeft / qwRight;
		qwRemainder = qwLeft % qwRight;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwQuotient = qwLeft / qwRight;
		qwRemainder = qwLeft % qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT4))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT4))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwQuotient = qwLeft * qwRight;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwRemainder = qwLeft * qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT5))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT5))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT6))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT6))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT7))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT7))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
		qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwQuotient = (SQWORD) qwLeft / (SQWORD) qwRight;
		qwRemainder = (SQWORD) qwLeft % (SQWORD) qwRight;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT8))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT8))
#endif
		ExitProcess(GetLastError());
	qwT9 = qwT8 - qwT0;
	qwT8 -= qwT7;
	qwT7 -= qwT6;
	qwT6 -= qwT5;
	qwT5 -= qwT4;
	qwT4 -= qwT3;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()   %6lu.%09lu      0\r\n"
	            "_aulldiv()   %6lu.%09lu %6lu.%09lu\r\n"
	            "_aullrem()   %6lu.%09lu %6lu.%09lu\r\n"
	            "_aulldvrm()  %6lu.%09lu %6lu.%09lu\r\n"
	            "_allmul()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_alldiv()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_allrem()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_alldvrm()   %6lu.%09lu %6lu.%09lu\r\n"
	            "             %6lu.%09lu clock cycles\r\n",
	            __edivmodu(qwT1, 1000000000UL),
	            __edivmodu(qwT2, 1000000000UL), __edivmodu(qwT2 - qwT1, 1000000000UL),
	            __edivmodu(qwT3, 1000000000UL), __edivmodu(qwT3 - qwT1, 1000000000UL),
	            __edivmodu(qwT4, 1000000000UL), __edivmodu(qwT4 - qwT1, 1000000000UL),
	            __edivmodu(qwT5, 1000000000UL), __edivmodu(qwT5 - qwT1, 1000000000UL),
	            __edivmodu(qwT6, 1000000000UL), __edivmodu(qwT6 - qwT1, 1000000000UL),
	            __edivmodu(qwT7, 1000000000UL), __edivmodu(qwT7 - qwT1, 1000000000UL),
	            __edivmodu(qwT8, 1000000000UL), __edivmodu(qwT8 - qwT1, 1000000000UL),
	            __edivmodu(qwT9, 1000000000UL));
#else // CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()   %6lu.%07lu      0\r\n"
	            "_aulldiv()   %6lu.%07lu %6lu.%07lu\r\n"
	            "_aullrem()   %6lu.%07lu %6lu.%07lu\r\n"
	            "_aulldvrm()  %6lu.%07lu %6lu.%07lu\r\n"
	            "_allmul()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_alldiv()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_allrem()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_alldvrm()   %6lu.%07lu %6lu.%07lu\r\n"
	            "             %6lu.%07lu nano-seconds\r\n",
	            __edivmodu(qwT1, 10000000UL),
	            __edivmodu(qwT2, 10000000UL), __edivmodu(qwT2 - qwT1, 10000000UL),
	            __edivmodu(qwT3, 10000000UL), __edivmodu(qwT3 - qwT1, 10000000UL),
	            __edivmodu(qwT4, 10000000UL), __edivmodu(qwT4 - qwT1, 10000000UL),
	            __edivmodu(qwT5, 10000000UL), __edivmodu(qwT5 - qwT1, 10000000UL),
	            __edivmodu(qwT6, 10000000UL), __edivmodu(qwT6 - qwT1, 10000000UL),
	            __edivmodu(qwT7, 10000000UL), __edivmodu(qwT7 - qwT1, 10000000UL),
	            __edivmodu(qwT8, 10000000UL), __edivmodu(qwT8 - qwT1, 10000000UL),
	            __edivmodu(qwT9, 10000000UL));
#endif // CYCLES
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT0))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT0))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = _aullnop(qwDividend, qwDivisor);
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwRemainder = _aullnop(qwDividend, qwDivisor);
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT1))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT1))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = qwDividend / qwDivisor;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwQuotient = qwDividend / qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT2))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT2))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwRemainder = qwDividend % qwDivisor;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwRemainder = qwDividend % qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT3))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT3))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwQuotient = qwDividend / qwDivisor;
		qwRemainder = qwDividend % qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT4))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT4))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwDividend = __ull_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = qwDividend * qwDivisor;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwDivisor = __ull_rshift(qwRight, qwRight /* & 31 */);
		qwRemainder = qwDividend * qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT5))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT5))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwDividend = __ll_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwDivisor = __ll_rshift(qwRight, qwRight /* & 31 */);
		qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT6))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT6))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwDividend = __ll_rshift(qwLeft, qwLeft /* & 31 */);
		qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwDivisor = __ll_rshift(qwRight, qwRight /* & 31 */);
		qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT7))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT7))
#endif
		ExitProcess(GetLastError());
	for (dw = 500000000UL; dw > 0UL; dw--)
	{
		qwLeft = (qwLeft << 1)
		       ^ (((SQWORD) qwLeft >> 63) & 0xAD93D23594C935A9ULL);
		qwDividend = __ll_rshift(qwLeft, qwLeft /* & 31 */);
		qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
		qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
		qwRight = (qwRight >> 1)
		        ^ ((0ULL - (qwRight & 1ULL)) & 0x2B5926535897936BULL);
		qwDivisor = __ll_rshift(qwRight, qwRight /* & 31 */);
		qwQuotient = (SQWORD) qwDividend / (SQWORD) qwDivisor;
		qwRemainder = (SQWORD) qwDividend % (SQWORD) qwDivisor;
	}
#ifdef CYCLES
	if (!QueryThreadCycleTime(hThread, &qwT8))
#else
	if (!GetThreadTimes(hThread, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT9, (LPFILETIME) &qwT8))
#endif
		ExitProcess(GetLastError());
	qwT9 = qwT8 - qwT0;
	qwT8 -= qwT7;
	qwT7 -= qwT6;
	qwT6 -= qwT5;
	qwT5 -= qwT4;
	qwT4 -= qwT3;
	qwT3 -= qwT2;
	qwT2 -= qwT1;
	qwT1 -= qwT0;
#ifdef CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()   %6lu.%09lu      0\r\n"
	            "_aulldiv()   %6lu.%09lu %6lu.%09lu\r\n"
	            "_aullrem()   %6lu.%09lu %6lu.%09lu\r\n"
	            "_aulldvrm()  %6lu.%09lu %6lu.%09lu\r\n"
	            "_allmul()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_alldiv()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_allrem()    %6lu.%09lu %6lu.%09lu\r\n"
	            "_alldvrm()   %6lu.%09lu %6lu.%09lu\r\n"
	            "             %6lu.%09lu clock cycles\r\n",
	            __edivmodu(qwT1, 1000000000UL),
	            __edivmodu(qwT2, 1000000000UL), __edivmodu(qwT2 - qwT1, 1000000000UL),
	            __edivmodu(qwT3, 1000000000UL), __edivmodu(qwT3 - qwT1, 1000000000UL),
	            __edivmodu(qwT4, 1000000000UL), __edivmodu(qwT4 - qwT1, 1000000000UL),
	            __edivmodu(qwT5, 1000000000UL), __edivmodu(qwT5 - qwT1, 1000000000UL),
	            __edivmodu(qwT6, 1000000000UL), __edivmodu(qwT6 - qwT1, 1000000000UL),
	            __edivmodu(qwT7, 1000000000UL), __edivmodu(qwT7 - qwT1, 1000000000UL),
	            __edivmodu(qwT8, 1000000000UL), __edivmodu(qwT8 - qwT1, 1000000000UL),
	            __edivmodu(qwT9, 1000000000UL));
#else // CYCLES
	PrintFormat(hOutput,
	            "\r\n"
	            "_aullnop()   %6lu.%07lu      0\r\n"
	            "_aulldiv()   %6lu.%07lu %6lu.%07lu\r\n"
	            "_aullrem()   %6lu.%07lu %6lu.%07lu\r\n"
	            "_aulldvrm()  %6lu.%07lu %6lu.%07lu\r\n"
	            "_allmul()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_alldiv()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_allrem()    %6lu.%07lu %6lu.%07lu\r\n"
	            "_alldvrm()   %6lu.%07lu %6lu.%07lu\r\n"
	            "             %6lu.%07lu nano-seconds\r\n",
	            __edivmodu(qwT1, 10000000UL),
	            __edivmodu(qwT2, 10000000UL), __edivmodu(qwT2 - qwT1, 10000000UL),
	            __edivmodu(qwT3, 10000000UL), __edivmodu(qwT3 - qwT1, 10000000UL),
	            __edivmodu(qwT4, 10000000UL), __edivmodu(qwT4 - qwT1, 10000000UL),
	            __edivmodu(qwT5, 10000000UL), __edivmodu(qwT5 - qwT1, 10000000UL),
	            __edivmodu(qwT6, 10000000UL), __edivmodu(qwT6 - qwT1, 10000000UL),
	            __edivmodu(qwT7, 10000000UL), __edivmodu(qwT7 - qwT1, 10000000UL),
	            __edivmodu(qwT8, 10000000UL), __edivmodu(qwT8 - qwT1, 10000000UL),
	            __edivmodu(qwT9, 10000000UL));
#endif // CYCLES
	ExitProcess(ERROR_SUCCESS);
}i386-i64.c in the directory
            where you created the object library i386.lib before,
            then execute the following 4 command lines to compile it, link the
            generated object file i386-i64.obj with the routines
            from the object library i386.lib, and finally execute
            the image file i386-i64.exe:
        SET CL=/GAFy /Oxy /W4 /Zl SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE /DCYCLES i386-i64.c i386.lib kernel32.lib user32.lib .\i386-i64.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
i386-i64.c
i386-i64.c(128) : warning C4100: 'ullRight' : unreferenced formal parameter
i386-i64.c(128) : warning C4100: 'ullLeft' : unreferenced formal parameter
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE
/out:i386-i64.exe
i386-i64.obj
i386.lib
kernel32.lib
user32.lib
Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
_aullnop()        5.130620616      0
_aulldiv()       24.916274238     19.785653622
_aullrem()       24.686248015     19.555627399
_aulldvrm()      25.947651690     20.817031074
_allmul()         6.753417214      1.622796598
_alldiv()        27.691010847     22.560390231
_allrem()        27.923880075     22.793259459
_alldvrm()       29.695561448     24.564940832
                172.744664143 clock cycles
_aullnop()        8.388855142      0
_aulldiv()       25.816723410     17.427868268
_aullrem()       25.383319447     16.994464305
_aulldvrm()      26.106060709     17.717205567
_allmul()        10.095017621      1.706162479
_alldiv()        30.421659163     22.032804021
_allrem()        30.961920386     22.573065244
_alldvrm()       32.481759875     24.092904733
                189.655315753 clock cycles
Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
_aullnop()        4.323464204      0
_aulldiv()       20.430453818     16.106989614
_aullrem()       20.833517940     16.510053736
_aulldvrm()      21.894735828     17.571271624
_allmul()         5.704146716      1.380682512
_alldiv()        23.409770870     19.086306666
_allrem()        23.662985522     19.339521318
_alldvrm()       25.053725237     20.730261033
                145.312800135 clock cycles
_aullnop()        7.128891691      0
_aulldiv()       21.592044916     14.463153225
_aullrem()       21.438925969     14.310034278
_aulldvrm()      21.977489828     14.848598137
_allmul()         8.646452477      1.517560786
_alldiv()        25.699076924     18.570185233
_allrem()        26.119234835     18.990343144
_alldvrm()       27.478233473     20.349341782
                160.080350113 clock cycles
            Oops: on this
            Intel® Core™ i7
            processor, the division routines for signed 64-bit integers are up
            to 37 % slower than those for 64-bit unsigned integers.
         Copy
            i386-i64.exe
            to systems with other processors and execute it there too:
        
Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on AMD Ryzen 7 5700X 8-Core Processor
_aullnop()        5.122143146      0
_aulldiv()       20.223817270     15.101674124
_aullrem()       20.164726800     15.042583654
_aulldvrm()      21.157922084     16.035778938
_allmul()         7.981533048      2.859389902
_alldiv()        20.836136360     15.713993214
_allrem()        22.052193688     16.930050542
_alldvrm()       21.613653936     16.491510790
                139.152126332 clock cycles
_aullnop()        7.683783760      0
_aulldiv()       21.300318504     13.616534744
_aullrem()       21.362844062     13.679060302
_aulldvrm()      22.265051198     14.581267438
_allmul()         8.956695616      1.272911856
_alldiv()        23.670479992     15.986696232
_allrem()        24.239343886     16.555560126
_alldvrm()       25.328227580     17.644443820
                154.806744598 clock cycles
        Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz
_aullnop()        8.978870978      0
_aulldiv()       41.378538512     32.399667534
_aullrem()       41.459120072     32.480249094
_aulldvrm()      42.496702958     33.517831980
_allmul()        13.102594044      4.123723066
_alldiv()        48.877112646     39.898241668
_allrem()        48.630810323     39.651939345
_alldvrm()       57.155683201     48.176812223
                302.079432734 clock cycles
_aullnop()       13.583675374      0
_aulldiv()       41.220087960     27.636412586
_aullrem()       39.885909615     26.302234241
_aulldvrm()      41.619714690     28.036039316
_allmul()        18.266469971      4.682794597
_alldiv()        52.360017066     38.776341692
_allrem()        52.759947948     39.176272574
_alldvrm()       58.291439368     44.707763994
                317.987261992 clock cycles
            Oops: on this 15
            year
            old Intel® Core™2
            processor, the division routines for signed 64-bit integers are up
            to 54 % slower than those for unsigned 64-bit integers.
         Generate the import library i386.lib and link the
            object file i386-i64.lib with it, then execute the
            image file i386-i64.exe, now calling the (several times
            slower) 64-bit integer division and multiplication routines of
            NTDLL.dll:
        
LINK.EXE /LIB /DEF /EXPORT:_alldiv /EXPORT:_alldvrm /EXPORT:_allmul /EXPORT:_allrem /EXPORT:_allshl /EXPORT:_allshr /EXPORT:_aulldiv /EXPORT:_aulldvrm /EXPORT:_aullrem /EXPORT:_aullshr /MACHINE:I386 /NAME:NTDLL /OUT:i386.lib LINK.EXE i386-i64.obj .\i386-i64.exeNote: the existing object library
i386.lib and the image file i386-i64.exe
            are overwritten!
        Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
   Creating library i386.lib and object i386.exp
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE
Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
_aullnop()        5.173417920      0
_aulldiv()      102.145250711     96.971832791
_aullrem()      103.746558934     98.573141014
_aulldvrm()     103.711363598     98.537945678
_allmul()         9.640065030      4.466647110
_alldiv()       108.071904817    102.898486897
_allrem()       109.348691219    104.175273299
_alldvrm()      111.349818320    106.176400400
                653.187070549 clock cycles
_aullnop()        8.391292312      0
_aulldiv()       70.546796218     62.155503906
_aullrem()       72.698974389     64.307682077
_aulldvrm()      72.715990565     64.324698253
_allmul()        12.941190977      4.549898665
_alldiv()        85.064559941     76.673267629
_allrem()        86.951781779     78.560489467
_alldvrm()       89.225890859     80.834598547
                498.536477040 clock cycles
Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
_aullnop()        4.293480684      0
_aulldiv()       88.508826840     84.215346156
_aullrem()       86.771270732     82.477790048
_aulldvrm()      88.330881954     84.037401270
_allmul()         8.473599720      4.180119036
_alldiv()        91.503580475     87.210099791
_allrem()        92.444818641     88.151337957
_alldvrm()       94.478769821     90.185289137
                554.805228867 clock cycles
_aullnop()        7.264421067      0
_aulldiv()       61.477534879     54.213113812
_aullrem()       60.671739257     53.407318190
_aulldvrm()      60.621554727     53.357133660
_allmul()        11.255056723      3.990635656
_alldiv()        71.552476501     64.288055434
_allrem()        72.850638899     65.586217832
_alldvrm()       73.504622168     66.240201101
                419.198044221 clock cycles
            OUCH: here Microsoft’s division
            routines are 3.2 to 5.5 times slower than those presented above,
            and their multiplication routine is 2.5 to 4.5 times slower!
         Also copy this variant of
            i386-i64.exe
            to systems with other processors and execute it there:
        
Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on AMD Ryzen 7 5700X 8-Core Processor
_aullnop()        5.121743374      0
_aulldiv()       55.833332662     50.711589288
_aullrem()       56.467057234     51.345313860
_aulldvrm()      58.395325278     53.273581904
_allmul()         7.690866062      2.569122688
_alldiv()        60.314670158     55.192926784
_allrem()        62.087331252     56.965587878
_alldvrm()       63.125034508     58.003291134
                369.035360528 clock cycles
_aullnop()        7.682569518      0
_aulldiv()       36.154348392     28.471778874
_aullrem()       37.041158934     29.358589416
_aulldvrm()      39.300191898     31.617622380
_allmul()        10.278630848      2.596061330
_alldiv()        45.203128180     37.520558662
_allrem()        46.432703432     38.750133914
_alldvrm()       47.750769056     40.068199538
                269.843500258 clock cycles
            OUCH: there Microsoft’s division
            routines are up to 4 times slower than those presented above, and
            their multiplication routine is up to 2 times slower!
        Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz
_aullnop()        8.952309405      0
_aulldiv()      129.226132812    120.273823407
_aullrem()      134.708512677    125.756203272
_aulldvrm()     143.453665206    134.501355801
_allmul()        17.856662118      8.904352713
_alldiv()       151.624513041    142.672203636
_allrem()       152.423786688    143.471477283
_alldvrm()      171.419909639    162.467600234
                909.665491586 clock cycles
_aullnop()       13.637814045      0
_aulldiv()       97.951782280     84.313968235
_aullrem()      103.122246554     89.484432509
_aulldvrm()     108.556615922     94.918801877
_allmul()        23.786421340     10.148607295
_alldiv()       117.589788695    103.951974650
_allrem()       119.202419933    105.564605888
_alldvrm()      124.590005377    110.952191332
                708.437094146 clock cycles
            OUCH: all Microsoft routines are more
            than 2 times slower than those presented above, their division
            routines are even up to 4 times slower!
         Finally execute the following 10 command lines to recreate the
            object library i386.lib, but now with the division
            routines for processors which feature speculative execution, then
            link the object file i386-i64.lib generated before with
            it and execute the image file i386-i64.exe:
        
SET ML=/c /DJCCLESS /safeseh /W3 /X ML.EXE lldiv.asm ML.EXE lldvrm.asm ML.EXE llrem.asm ML.EXE ulldiv.asm ML.EXE ulldvrm.asm ML.EXE ullrem.asm LINK.EXE /LIB /OUT:i386.lib alldiv.obj alldvrm.obj allmul.obj alloca.obj alloca8.obj alloca16.obj allrem.obj allshl.obj allshr.obj aulldiv.obj aulldvrm.obj aullrem.obj aullshr.obj LINK.EXE i386-i64.obj i386.lib kernel32.lib user32.lib .\i386-i64.exe
Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
 Assembling: alldiv.asm
Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
 Assembling: alldvrm.asm
Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
 Assembling: allrem.asm
Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
 Assembling: aulldiv.asm
Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
 Assembling: aulldvrm.asm
Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
 Assembling: aullrem.asm
Microsoft (R) Library Manager Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
   Creating library i386.lib and object i386.exp
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE
Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
_aullnop()        5.130523022      0
_aulldiv()       22.148703486     17.018180464
_aullrem()       25.618235017     20.487711995
_aulldvrm()      27.484994154     22.354471132
_allmul()         6.756777960      1.626254938
_alldiv()        30.000776927     24.870253905
_allrem()        31.364631604     26.234108582
_alldvrm()       34.146860035     29.016337013
                182.651502205 clock cycles
_aullnop()        8.384999331      0
_aulldiv()       26.479597369     18.094598038
_aullrem()       27.445690820     19.060691489
_aulldvrm()      28.586813456     20.201814125
_allmul()        10.095121587      1.710122256
_alldiv()        31.576787296     23.191787965
_allrem()        33.327090912     24.942091581
_alldvrm()       35.484606261     27.099606930
                201.380707032 clock cycles
Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
_aullnop()        4.441024919      0
_aulldiv()       19.396569893     14.955544974
_aullrem()       22.313397424     17.872372505
_aulldvrm()      23.100780021     18.659755102
_allmul()         5.659121355      1.218096436
_alldiv()        25.226574313     20.785549394
_allrem()        26.355810792     21.914785873
_alldvrm()       28.902499259     24.461474340
                155.395777976 clock cycles
_aullnop()        7.081996184      0
_aulldiv()       22.156429579     15.074433395
_aullrem()       22.934132959     15.852136775
_aulldvrm()      23.886275722     16.804279538
_allmul()         8.362922762      1.280926578
_alldiv()        26.645217765     19.563221581
_allrem()        27.942975969     20.860979785
_alldvrm()       29.720484454     22.638488270
                168.730435394 clock cycles
            Oops: on this
            Intel® Core™ i7
            processor, the branch-avoiding division routine runs for big
            unsigned integers about 7.5 % faster than its conditionally
            branching variant, while the others are up to 15 % slower.
         Again copy this variant of
            i386-i64.exe
            to systems with other processors and execute it there too:
        
Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on AMD Ryzen 7 5700X 8-Core Processor
_aullnop()        5.120327104      0
_aulldiv()       17.308029902     12.187702798
_aullrem()       18.250468490     13.130141386
_aulldvrm()      20.019941914     14.899614810
_allmul()         7.951225720      2.830898616
_alldiv()        22.051554624     16.931227520
_allrem()        23.685382974     18.565055870
_alldvrm()       24.193433346     19.073106242
                138.580364074 clock cycles
_aullnop()        7.679050314      0
_aulldiv()       21.642133458     13.963083144
_aullrem()       21.867652058     14.188601744
_aulldvrm()      23.638699750     15.959649436
_allmul()         8.959283832      1.280233518
_alldiv()        24.276285226     16.597234912
_allrem()        24.838598612     17.159548298
_alldvrm()       26.549134226     18.870083912
                159.450837476 clock cycles
            Oops: on this
            AMD® Ryzen™
            processor, the branch-avoiding division routines run for big
            unsigned integers up to 20 % faster than their conditionally
            branching variants, else but up to 15 % slower.
        Testing unsigned 64-bit division...
-58
Testing signed 64-bit division...
-44
Timing 64-bit division and multiplication on Intel(R) Core(TM)2 Duo CPU     P8700  @ 2.53GHz
_aullnop()        8.956408592      0
_aulldiv()       40.006629475     31.050220883
_aullrem()       45.331733367     36.375324775
_aulldvrm()      52.817756545     43.861347953
_allmul()        13.100714751      4.144306159
_alldiv()        53.679959575     44.723550983
_allrem()        60.075486247     51.119077655
_alldvrm()       73.557314195     64.600905603
                347.526002747 clock cycles
_aullnop()       13.582502272      0
_aulldiv()       43.965933963     30.383431691
_aullrem()       46.175253389     32.592751117
_aulldvrm()      49.844225982     36.261723710
_allmul()        18.252733959      4.670231687
_alldiv()        53.847451160     40.264948888
_allrem()        59.859327003     46.276824731
_alldvrm()       65.922751282     52.340249010
                351.450179010 clock cycles
            OOPS: contrary to the expected results, on this 15
            year
            old Intel® Core™2
            processor the branch-avoiding division routines run up to 34 %
            slower than their conditionally branching variants!
        _rotl64() and _rotr64() Intrinsic Functions for i386 Platform_rotl64()
            and
            _rotr64()
            for rotation of 64-bit integers, with but a stupid
            implementation.
            Compiler intrinsics
        // Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
unsigned long long _allrol(unsigned long long value, unsigned int count)
{
    return _rotl64(value, count);
}
unsigned long long _allror(unsigned long long value, unsigned int count)
{
    return _rotr64(value, count);
}i386-rotate.c in an
            arbitrary, preferable empty directory, then execute the following 2
            command lines to compile it and display the generated assembly:
        SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-rotate.cFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-rotate.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-rotate.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allrol ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-rotate.c _TEXT SEGMENT _value$ = 8 ; size = 8 _count$ = 16 ; size = 4 __allrol PROC ; 5 : return _rotl64(value, count); 00000 8a 4c 24 0c mov cl, BYTE PTR _count$[esp-4] 00004 56 push esi 00005 8b 74 24 08 mov esi, DWORD PTR _value$[esp] 00009 57 push edi 0000a 8b 7c 24 10 mov edi, DWORD PTR _value$[esp+8] 0000e 8b c6 mov eax, esi 00010 8b d7 mov edx, edi 00012 f6 c1 20 test cl, 32 ; 00000020H 00015 74 04 je SHORT $LN3@allrol 00017 8b c7 mov eax, edi 00019 8b d6 mov edx, esi $LN3@allrol: 0001b 80 e1 1f and cl, 31 ; 0000001fH 0001e 74 12 je SHORT $LN4@allrol 00020 8b f0 mov esi, eax 00022 8b c2 mov eax, edx 00024 8b d6 mov edx, esi 00026 0f a5 c2 shld edx, eax, cl 00029 0f a5 f0 shld eax, esi, cl 0002c 8b ca mov ecx, edx 0002e 8b d0 mov edx, eax 00030 8b c1 mov eax, ecx $LN4@allrol: ; 6 : } 00032 5f pop edi 00033 5e pop esi 00034 c3 ret 0 __allrol ENDP _TEXT ENDS PUBLIC __allror ; Function compile flags: /Ogtpy _TEXT SEGMENT _value$ = 8 ; size = 8 _count$ = 16 ; size = 4 __allror PROC ; 10 : return _rotr64(value, count); 00040 8a 4c 24 0c mov cl, BYTE PTR _count$[esp-4] 00044 56 push esi 00045 8b 74 24 08 mov esi, DWORD PTR _value$[esp] 00049 57 push edi 0004a 8b 7c 24 10 mov edi, DWORD PTR _value$[esp+8] 0004e 8b c6 mov eax, esi 00050 8b d7 mov edx, edi 00052 f6 c1 20 test cl, 32 ; 00000020H 00055 74 04 je SHORT $LN3@allror 00057 8b c7 mov eax, edi 00059 8b d6 mov edx, esi $LN3@allror: 0005b 80 e1 1f and cl, 31 ; 0000001fH 0005e 74 12 je SHORT $LN4@allror 00060 8b f0 mov esi, eax 00062 8b c2 mov eax, edx 00064 8b d6 mov edx, esi 00066 0f ad c2 shrd edx, eax, cl 00069 0f ad f0 shrd eax, esi, cl 0006c 8b ca mov ecx, edx 0006e 8b d0 mov edx, eax 00070 8b c1 mov eax, ecx $LN4@allror: ; 11 : } 00072 5f pop edi 00073 5e pop esi 00074 c3 ret 0 __allror ENDP _TEXT ENDS ENDWith 24 instructions in 53 bytes, each function is as bad as such
optimisedcode can get!
 OUCH¹: instead to load its 64-bit argument
            into register pair EDX:EAX and swap them if the shift
            count÷32 is odd, this stupid code loads the
            64-bit argument into register pair EDI:ESI first, from
            there into register pair EDX:EAX, then loads the latter
            in reverse order from register pair EDI:ESI if the
            shift count÷32 is odd, clobbering registers EDI
            and ESI without necessity!
        
 OUCH²: instead to load register
            ESI from register EDX and then shift the
            register pairs EDX:EAX and EAX:ESI, this
            braindead code swaps registers EAX and
            EDX through register ESI, shifts register
            pairs EDX:EAX and EAX:ESI and finally
            swaps registers EAX and EDX once more, now
            through ECX!
        
 Note: the evaluation of the code generated with
            the compiler options /Oisy is left as an exercise to
            the reader.
        
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allrol() and _allror() Functions in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; Common "cdecl" calling and naming convention for i386 platform:
; - arguments are pushed on stack in reverse order (from right to left),
;   4-byte aligned;
; - 64-bit integer arguments are passed as pair of 32-bit integer arguments,
;   low part below high part;
; - 80-bit, 64-bit or 32-bit floating-point result is returned in FPU
;   register ST0;
; - 64-bit integer result is returned in registers EAX (low part) and
;   EDX (high part);
; - 32-bit integer or pointer result is returned in register EAX;
; - registers EAX, ECX and EDX are volatile and can be clobbered;
; - registers EBX, ESP, EBP, ESI and EDI must be preserved;
; - function names are prefixed with an underscore.
	.386
	.model  flat, C
	.code
_allrol	proc	public			; qword _allrol(qword value, dword count)
	mov	eax, [esp+4]
	mov	edx, [esp+8]		; edx:eax = value
	mov	ecx, [esp+12]		; ecx = count
ifndef SPACE
	test	cl, 63
	jz	short return		; count % 64 = 0?
endif
	test	cl, 32
	jz	short shift
swap:
	xchg	eax, edx
shift:
	push	ebx
	mov	ebx, edx
	shld	edx, eax, cl
	shld	eax, ebx, cl
	pop	ebx
return:
	ret
_allrol	endp
_allror	proc	public			; qword _allror(qword value, dword count)
	mov	eax, [esp+4]
	mov	edx, [esp+8]		; edx:eax = value
	mov	ecx, [esp+12]		; ecx = count
ifndef SPACE
	test	cl, 63
	jz	short return		; count % 64 = 0?
endif
	test	cl, 32
	jz	short shift
swap:
	xchg	eax, edx
shift:
	push	ebx
	mov	ebx, eax
	shrd	eax, edx, cl
	shrd	edx, ebx, cl
	pop	ebx
return:
	ret
_allror	endp
	end_abs64() Intrinsic Function for i386 Platform_abs64()
            alias
            llabs()
            provided by the Visual C compiler returns the
            absolute value of its 64-bit integer argument – but even this
            trivial routine is not properly optimised.
            Compiler intrinsics
        // Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
long long _allabs(long long argument)
{
    return _abs64(argument);
}i386-magnitude.c in an
            arbitrary, preferable empty directory, then execute the following 2
            command lines to compile it and display the generated assembly:
        SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-magnitude.cFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-magnitude.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-magnitude.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allabs ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-magnitude.c ; COMDAT __allabs _TEXT SEGMENT _argument$ = 8 ; size = 8 __allabs PROC ; COMDAT ; 5 : return _abs64(argument); 00000 8b 44 24 08 mov eax, DWORD PTR _argument$[esp] 00004 8b 4c 24 04 mov ecx, DWORD PTR _argument$[esp-4] 00008 99 cdq 00009 33 c2 xor eax, edx 0000b 33 ca xor ecx, edx 0000d 2b ca sub ecx, edx 0000f 1b c2 sbb eax, edx 00011 8b d0 mov edx, eax 00013 8b c1 mov eax, ecx ; 6 : } 00015 c3 ret 0 __allabs ENDP _TEXT ENDS END10 instructions in 22 bytes.
Spoiler: if you are curious whether the current version of the Visual C compiler has its flaws fixed, view its output in Compiler Explorer.
_allabs() Function in i386 AssemblerXORs commutativity, saving
            a MOV instruction and 2 bytes:
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model  flat, C
	.code
_allabs	proc	public			; sqword _allabs(sqword argument)
	mov	eax, [esp+8]
	mov	ecx, [esp+4]		; eax:ecx = argument
	cdq				; edx = (argument < 0) ? -1 : 0
	add	ecx, edx
	adc	eax, edx		; eax:ecx = (argument < 0) ? argument - 1 : argument
					;         = (argument < 0) ? ~-argument : argument
	xor	ecx, edx
	xor	edx, eax		; edx:ecx = (argument < 0) ? -argument : argument
					;         = |argument|
	mov	eax, ecx		; edx:eax = |argument|
	ret
_allabs	endp
	end// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
long long _allneg(long long argument)
{
    return -argument;
}i386-negate.c in an
            arbitrary, preferable empty directory, then execute the following 2
            command lines to compile it and display the generated assembly:
        SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-negate.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-negate.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-negate.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allneg ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-negate.c ; COMDAT __allneg _TEXT SEGMENT _argument$ = 8 ; size = 8 __allneg PROC ; COMDAT ; 5 : return -argument; 00000 8b 44 24 04 mov eax, DWORD PTR _argument$[esp-4] 00004 8b 54 24 08 mov edx, DWORD PTR _argument$[esp] 00008 f7 d8 neg eax 0000a 83 d2 00 adc edx, 0 0000d f7 da neg edx ; 6 : } 0000f c3 ret 0 __allneg ENDP _TEXT ENDS END6 instructions in 16 bytes.
Spoiler: if you are curious whether the current version of the Visual C compiler has its flaws fixed, view its output in Compiler Explorer.
_allneg() Function in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model  flat, C
	.code
_allneg	proc	public			; sqword _allneg(sqword argument)
	xor	eax, eax
	cdq				; edx:eax = 0
	sub	eax, [esp+4]
	sbb	edx, [esp+8]		; edx:eax = -argument
	ret
_allneg	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model  flat, C
	.code
_allneg	proc	public			; sqword _allneg(sqword argument)
	mov	edx, [esp+8]
	mov	eax, [esp+4]		; edx:eax = argument
	not	edx
	neg	eax
	sbb	edx, -1			; edx:eax = -argument
	ret
_allneg	endp
	end// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
void blunder(long long *argument)
{
    *argument = -*argument;
}i386-blunder.c in an
            arbitrary, preferable empty directory, then execute the following 2
            command lines to compile it and display the generated assembly:
        SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-blunder.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-blunder.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC _blunder ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-blunder.c ; COMDAT _blunder _TEXT SEGMENT _argument$ = 8 ; size = 4 _blunder PROC ; COMDAT ; 5 : *argument = -*argument; 00000 8b 44 24 04 mov eax, DWORD PTR _argument$[esp-4] 00004 8b 08 mov ecx, DWORD PTR [eax] 00006 8b 50 04 mov edx, DWORD PTR [eax+4] 00009 f7 d9 neg ecx 0000b 83 d2 00 adc edx, 0 0000e f7 da neg edx 00010 89 08 mov DWORD PTR [eax], ecx 00012 89 50 04 mov DWORD PTR [eax+4], edx ; 6 : } 00015 c3 ret 0 _blunder ENDP _TEXT ENDS ENDOUCH: this atrocity is a perfect declaration of bankruptcy!
Even a non-optimising compiler should but generate the following straightforward code, using 5 instructions in 14 bytes, instead of the 9 instructions in 22 bytes generated by the Visual C compiler:
	.code
	mov	eax, [esp+4]		; eax = address of argument
	neg	dword ptr [eax]
	not	dword ptr [eax+4]
	sbb	dword ptr [eax+4], -1
	ret// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
int _allsgn(long long argument)
{
    return (argument > 0) - (argument < 0);
}i386-signum.c in an
            arbitrary, preferable empty directory, then execute the following 2
            command lines to compile it and display the generated assembly:
        SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-signum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-signum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-signum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allsgn ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-signum.c ; COMDAT __allsgn _TEXT SEGMENT _argument$ = 8 ; size = 8 __allsgn PROC ; COMDAT ; 5 : return (argument > 0) - (argument < 0); 00000 8b 4c 24 08 mov ecx, DWORD PTR _argument$[esp] 00004 8b 54 24 04 mov edx, DWORD PTR _argument$[esp-4] 00008 85 c9 test ecx, ecx 0000a 7c 0d jl SHORT $LN5@allsgn 0000c 7f 04 jg SHORT $LN7@allsgn 0000e 85 d2 test edx, edx 00010 74 07 je SHORT $LN5@allsgn $LN7@allsgn: 00012 b8 01 00 00 00 mov eax, 1 00017 eb 02 jmp SHORT $LN6@allsgn $LN5@allsgn: 00019 33 c0 xor eax, eax $LN6@allsgn: 0001b 85 c9 test ecx, ecx 0001d 7f 0e jg SHORT $LN3@allsgn 0001f 7c 04 jl SHORT $LN8@allsgn 00021 85 d2 test edx, edx 00023 73 08 jae SHORT $LN3@allsgn $LN8@allsgn: 00025 b9 01 00 00 00 mov ecx, 1 0002a 2b c1 sub eax, ecx ; 6 : } 0002c c3 ret 0 $LN3@allsgn: ; 5 : return (argument > 0) - (argument < 0); 0002d 33 c9 xor ecx, ecx 0002f 2b c1 sub eax, ecx ; 6 : } 00031 c3 ret 0 __allsgn ENDP _TEXT ENDS END21 instructions in 50 bytes.
OUCH¹: 6 (in words: six) superfluous conditional branch instructions – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
 OUCH²: the first 2 highlighted instructions
            should be replaced with a single
            DEC instruction, saving 6 bytes.
        
 OUCH³: the last 2 highlighted instructions
            should be replaced with a 4 byte
            NOP, which can then be removed
            together with the following
            RET instruction, saving
            5 more bytes.
        
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allsgn() Function in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model  flat, C
	.code
_allsgn	proc	public			; int _allsgn(sqword argument)
if 0
	xor	eax, eax		; eax = 0
	cmp	eax, [esp+4]		; CF = (low dword of argument != 0)
	mov	ecx, [esp+8]		; ecx = high dword of argument
	cdq				; edx = 0
	sbb	edx, ecx		; edx:... = 0 - argument
	setl	al			; eax = (argument > 0) ? 1 : 0
	sar	ecx, 31			; ecx = (argument < 0) ? -1 : 0
	add	eax, ecx		; eax = (argument > 0) - (argument < 0)
					;     = {1, 0, -1}
elseif 0
	mov	eax, [esp+8]		; eax = high dword of argument
	xor	edx, edx		; edx = 0
	cmp	edx, [esp+4]		; CF = (low dword of argument != 0)
	sbb	edx, eax		; edx:... = 0 - argument
	cdq				; edx = (argument < 0) ? -1 : 0
	setl	al
	movzx	eax, al			; eax = (argument > 0) ? 1 : 0
	add	eax, edx		; eax = (argument > 0) - (argument < 0)
					;     = {1, 0, -1}
else
	xor	eax, eax
	cmp	eax, [esp+4]		; CF = (low dword of argument != 0)
	mov	edx, [esp+8]		; edx = high dword of argument
	sbb	eax, edx		; eax:... = 0 - argument
	sar	edx, 31			; edx = (argument < 0) ? -1 : 0
	shr	eax, 31			; eax = (argument > 0) ? 1 : 0
	or	eax, edx		; eax = (argument > 0) - (argument < 0)
					;     = {1, 0, -1}
endif
	ret
_allsgn	endp
	end// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifdef UNSIGNED
int _aullcmp(unsigned long long p, unsigned long long q)
#else
int _allcmp(long long p, long long q)
#endif
{
    return (p > q) - (p < q);
}i386-compare.c in an
            arbitrary, preferable empty directory, then execute the following 2
            command lines to compile it and display the generated assembly a
            first time:
        SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-compare.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-compare.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-compare.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allcmp ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-compare.c ; COMDAT __allcmd _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __allcmp PROC ; COMDAT ; 9 : return (p > q) - (p < q); 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 57 push edi 0000e 8b 7c 24 0c mov edi, DWORD PTR _p$[esp+4] 00012 3b ca cmp ecx, edx 00014 7c 0d jl SHORT $LN5@allcmp 00016 7f 04 jg SHORT $LN7@allcmp 00018 3b fe cmp edi, esi 0001a 76 07 jbe SHORT $LN5@allcmp $LN7@allcmp: 0001c b8 01 00 00 00 mov eax, 1 00021 eb 02 jmp SHORT $LN6@allcmp $LN5@allcmp: 00023 33 c0 xor eax, eax $LN6@allcmp: 00025 3b ca cmp ecx, edx 00027 7f 10 jg SHORT $LN3@allcmp 00029 7c 04 jl SHORT $LN8@allcmp 0002b 3b fe cmp edi, esi 0002d 73 0a jae SHORT $LN3@allcmp $LN8@allcmp: 0002f b9 01 00 00 00 mov ecx, 1 00034 5f pop edi 00035 2b c1 sub eax, ecx 00037 5e pop esi ; 10 : } 00038 c3 ret 0 $LN3@allcmp: ; 9 : return (p > q) - (p < q); 00039 33 c9 xor ecx, ecx 0003b 5f pop edi 0003c 2b c1 sub eax, ecx 0003e 5e pop esi ; 10 : } 0003f c3 ret 0 __allcmp ENDP _TEXT ENDS END29 instructions in 64 bytes.
OUCH¹: 6 (in words: six) superfluous conditional branch instructions, and 2 registers clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
 OUCH²: the first 2 highlighted instructions
            should be replaced with a single
            DEC instruction, saving 6 bytes.
        
 OUCH³: the last 2 highlighted instructions
            should be replaced with 2 2 byte
            NOPs, which can then be removed
            together with the 2
            POP and the
            following RET
            instruction, saving 7 more bytes.
        
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
 Compile the source file i386-compare.c a second time,
            now with the preprocessor macro UNSIGNED defined, and
            display the generated assembly:
        
CL.EXE /DUNSIGNED i386-compare.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-compare.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-compare.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __aullcmp ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-compare.c ; COMDAT __allcmd _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __aullcmp PROC ; COMDAT ; 9 : return (p > q) - (p < q); 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 57 push edi 0000e 8b 7c 24 0c mov edi, DWORD PTR _p$[esp+4] 00012 3b ca cmp ecx, edx 00014 72 0d jb SHORT $LN5@aullcmp 00016 77 04 ja SHORT $LN7@aullcmp 00018 3b fe cmp edi, esi 0001a 76 07 jbe SHORT $LN5@aullcmp $LN7@aullcmp: 0001c b8 01 00 00 00 mov eax, 1 00021 eb 02 jmp SHORT $LN6@aullcmp $LN5@aullcmp: 00023 33 c0 xor eax, eax $LN6@aullcmp: 00025 3b ca cmp ecx, edx 00027 77 10 ja SHORT $LN3@aullcmp 00029 72 04 jb SHORT $LN8@aullcmp 0002b 3b fe cmp edi, esi 0002d 73 0a jae SHORT $LN3@aullcmp $LN8@aullcmp: 0002f b9 01 00 00 00 mov ecx, 1 00034 5f pop edi 00035 2b c1 sub eax, ecx 00037 5e pop esi ; 10 : } 00038 c3 ret 0 $LN3@aullcmp: ; 9 : return (p > q) - (p < q); 00039 33 c9 xor ecx, ecx 0003b 5f pop edi 0003c 2b c1 sub eax, ecx 0003e 5e pop esi ; 10 : } 0003f c3 ret 0 __aullcmp ENDP _TEXT ENDS ENDAlso 29 instructions in 64 bytes.
OUCH¹: again 6 (in words: six) superfluous conditional branch instructions, and 2 registers clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
 OUCH²: the first 2 highlighted instructions
            should be replaced with a single
            DEC instruction, saving 6 bytes.
        
 OUCH³: the last 2 highlighted instructions
            should be replaced with 2 2 byte
            NOPs, which can then be removed
            together with the 2
            POP and the
            following RET
            instruction, saving 7 more bytes.
        
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allcmp() and _aullcmp() Functions in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model  flat, C
	.code
_allcmp	proc	public			; int _allcmp(sqword p, sqword q)
	xor	eax, eax		; eax = 0
	mov	ecx, [esp+4]
	mov	edx, [esp+8]		; edx:ecx = p
	cmp	ecx, [esp+12]
	sbb	edx, [esp+16]		; edx:... = p - q,
					; eflags = (p - q)
	cdq				; edx = 0
	setg	al			; eax = (p > q)
	setl	dl			; edx = (p < q)
	sub	eax, edx		; eax = (p > q) - (p < q)
					;     = {1, 0, -1}
	ret
_allcmp	endp
_aullcmp proc	public			; int _aullcmp(qword p, qword q)
	xor	eax, eax		; eax = 0
	mov	ecx, [esp+4]
	mov	edx, [esp+8]		; edx:ecx = p
	cmp	ecx, [esp+12]
	sbb	edx, [esp+16]		; edx:... = p - q,
					; eflags = (p - q)
	seta	al			; eax = (p > q)
	sbb	eax, 0			; eax = (p > q) - (p < q)
					;     = {1, 0, -1}
	ret
_aullcmp endp
	end*max() functions, the
            Visual C compiler provides a preprocessor macro
            __max:
        
            #define __max(a,b) (((a) > (b)) ? (a) : (b))
        
            Note: the header files shipped with the
            Windows platform software development kit provide the
            preprocessor macros
            min
            and
            max,
            which are (fortunately) but not defined when the preprocessor macro
            NOMINMAX is defined.
        // Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifdef UNSIGNED
unsigned long long _aullmax(unsigned long long p, unsigned long long q)
#else
long long _allmax(long long p, long long q)
#endif
{
    return p > q ? p : q;
}i386-maximum.c in an
            arbitrary, preferable empty directory, then execute the following 2
            command lines to compile it and display the generated assembly a
            first time:
        SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-maximum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-maximum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-maximum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allmax ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-maximum.c ; COMDAT __allmax _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __allmax PROC ; COMDAT ; 9 : return p > q ? p : q; 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 3b ca cmp ecx, edx 0000f 7c 0e jl SHORT $LN3@allmax 00011 8b 44 24 08 mov eax, DWORD PTR _p$[esp] 00015 7f 04 jg SHORT $LN5@allmax 00017 3b c6 cmp eax, esi 00019 76 04 jbe SHORT $LN3@allmax $LN5@allmax: 0001b 8b d1 mov edx, ecx 0001d 5e pop esi ; 10 : } 0001e c3 ret 0 $LN3@allmax: ; 9 : return p > q ? p : q; 0001f 8b c6 mov eax, esi 00021 5e pop esi ; 10 : } 00022 c3 ret 0 __allmax ENDP _TEXT ENDS END16 instructions in 35 bytes.
OUCH: 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
 Compile the source file i386-maximum.c a second time,
            now with the preprocessor macro UNSIGNED defined, and
            display the generated assembly:
        
CL.EXE /DUNSIGNED i386-maximum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-maximum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-maximum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __aullmax ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-maximum.c ; COMDAT __aullmax _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __aullmax PROC ; COMDAT ; 9 : return p > q ? p : q; 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 3b ca cmp ecx, edx 0000f 72 0e jb SHORT $LN3@aullmax 00011 8b 44 24 08 mov eax, DWORD PTR _p$[esp] 00015 77 04 ja SHORT $LN5@aullmax 00017 3b c6 cmp eax, esi 00019 76 04 jbe SHORT $LN3@aullmax $LN5@aullmax: 0001b 8b d1 mov edx, ecx 0001d 5e pop esi ; 10 : } 0001e c3 ret 0 $LN3@aullmax: ; 9 : return p > q ? p : q; 0001f 8b c6 mov eax, esi 00021 5e pop esi ; 10 : } 00022 c3 ret 0 __aullmax ENDP _TEXT ENDS ENDAlso 16 instructions in 35 bytes.
OUCH: again 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allmax() and _aullmax() Functions in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model  flat, C
	.code
_allmax	proc	public			; sqword _allmax(sqword p, sqword q)
	mov	ecx, [esp+4]
	mov	eax, [esp+8]		; eax:ecx = p
	sub	ecx, [esp+12]
	sbb	eax, [esp+16]		; eax:ecx = p - q
	cdq				; edx = (p < q) ? -1 : 0
	not	edx			; edx = (p < q) ? 0 : -1
	and	ecx, edx
	and	edx, eax		; edx:ecx = (p < q) ? 0 : p - q
	add	ecx, [esp+12]
	adc	edx, [esp+16]		; edx:ecx = (p < q) ? q : p
	mov	eax, ecx		; edx:eax = max(p, q)
	ret
_allmax	endp
_aullmax proc	public			; qword _aullmax(qword p, qword q)
	mov	eax, [esp+4]
	mov	edx, [esp+8]		; edx:eax = p
	sub	eax, [esp+12]
	sbb	edx, [esp+16]		; edx:eax = p - q
	cmc				; CF = ~(p < q)
	sbb	ecx, ecx		; ecx = (p >= q) ? -1 : 0
	and	eax, ecx
	and	edx, ecx		; edx:eax = (p >= q) ? p - q : 0
	add	eax, [esp+12]
	adc	edx, [esp+16]		; edx:eax = (p >= q) ? p : q
					;         = max(p, q)
	ret
_aullmax endp
	end*min() functions, the
            Visual C compiler provides a preprocessor macro
            min:
        
            #define __min(a,b) (((a) < (b)) ? (a) : (b))
        
            Note: the header files shipped with the
            Windows platform software development kit provide the
            preprocessor macros
            min
            and
            max,
            which are (fortunately) but not defined when the preprocessor macro
            NOMINMAX is defined.
        // Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifdef UNSIGNED
unsigned long long _aullmin(unsigned long long p, unsigned long long q)
#else
long long _allmin(long long p, long long q)
#endif
{
    return p < q ? p : q;
}i386-minimum.c in an
            arbitrary, preferable empty directory, then execute the following 2
            command lines to compile it and display the generated assembly a
            first time:
        SET CL=/c /FAsc /FaCON: /FoNUL: /Oxy /W4 /X /Zl CL.EXE i386-maximum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-minimum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-minimum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __allmin ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-minimum.c ; COMDAT __allmin _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __allmin PROC ; COMDAT ; 9 : return p < q ? p : q; 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 3b ca cmp ecx, edx 0000f 7f 0e jg SHORT $LN3@allmin 00011 8b 44 24 08 mov eax, DWORD PTR _p$[esp] 00015 7c 04 jl SHORT $LN5@allmin 00017 3b c6 cmp eax, esi 00019 73 04 jae SHORT $LN3@allmin $LN5@allmin: 0001b 8b d1 mov edx, ecx 0001d 5e pop esi ; 10 : } 0001e c3 ret 0 $LN3@allmin: ; 9 : return p < q ? p : q; 0001f 8b c6 mov eax, esi 00021 5e pop esi ; 10 : } 00022 c3 ret 0 __allmin ENDP _TEXT ENDS END16 instructions in 35 bytes.
OUCH: 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
 Compile the source file i386-minimum.c a second time,
            now with the preprocessor macro UNSIGNED defined, and
            display the generated assembly:
        
CL.EXE /DUNSIGNED i386-minimum.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-minimum.c ; Listing generated by Microsoft (R) Optimizing Compiler Version 16.00.40219.01 TITLE C:\Users\Stefan\Desktop\i386-minimum.c .686P .XMM include listing.inc .model flat INCLUDELIB LIBCMT INCLUDELIB OLDNAMES PUBLIC __aullmin ; Function compile flags: /Ogtpy ; File c:\users\stefan\desktop\i386-minimum.c ; COMDAT __aullmin _TEXT SEGMENT _p$ = 8 ; size = 8 _q$ = 16 ; size = 8 __aullmin PROC ; COMDAT ; 9 : return p < q ? p : q; 00000 8b 4c 24 08 mov ecx, DWORD PTR _p$[esp] 00004 8b 54 24 10 mov edx, DWORD PTR _q$[esp] 00008 56 push esi 00009 8b 74 24 10 mov esi, DWORD PTR _q$[esp] 0000d 3b ca cmp ecx, edx 0000f 77 0e ja SHORT $LN3@aullmin 00011 8b 44 24 08 mov eax, DWORD PTR _p$[esp] 00015 72 04 jb SHORT $LN5@aullmin 00017 3b c6 cmp eax, esi 00019 73 04 jae SHORT $LN3@aullmin $LN5@aullmin: 0001b 8b d1 mov edx, ecx 0001d 5e pop esi ; 10 : } 0001e c3 ret 0 $LN3@aullmin: ; 9 : return p < q ? p : q; 0001f 8b c6 mov eax, esi 00021 5e pop esi ; 10 : } 00022 c3 ret 0 __aullmin ENDP _TEXT ENDS ENDAlso 16 instructions in 35 bytes.
OUCH: again 3 superfluous conditional branch instructions, and 1 register clobbered without necessity – the optimiser and code generator of the Visual C compiler is a crime against all i386 processors since the Pentium®Pro and their users!
Spoiler: if you are curious whether the current version of the Visual C compiler has these deficiencies fixed, view its output in Compiler Explorer.
_allmin() and _aullmin() Functions in i386 Assembler; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model  flat, C
	.code
_allmin	proc	public			; sqword _allmin(sqword p, sqword q)
	mov	ecx, [esp+4]
	mov	eax, [esp+8]		; eax:ecx = p
	sub	ecx, [esp+12]
	sbb	eax, [esp+16]		; eax:ecx = p - q
	cdq				; edx = (p < q) ? -1 : 0
	and	ecx, edx
	and	edx, eax		; edx:ecx = (p < q) ? p - q : 0
	add	ecx, [esp+12]
	adc	edx, [esp+16]		; edx:ecx = (p < q) ? p : q
	mov	eax, ecx		; edx:eax = min(p, q)
	ret
_allmin	endp
_aullmin proc	public			; qword _aullmin(qword p, qword q)
	mov	eax, [esp+4]
	mov	edx, [esp+8]		; edx:eax = p
	sub	eax, [esp+12]
	sbb	edx, [esp+16]		; edx:eax = p - q
	sbb	ecx, ecx		; ecx = (p < q) ? -1 : 0
	and	eax, ecx
	and	edx, ecx		; edx:eax = (p < q) ? p - q : 0
	add	eax, [esp+12]
	adc	edx, [esp+16]		; edx:eax = (p < q) ? p : q
					;         = min(p, q)
	ret
_aullmin endp
	endacos(), asin(), atan(), atan2(), cos(), cosh(), exp(), fmod(), log(), log10(), pow(), sin(), sinh(), sqrt(), tan() and tanh() Standard Functions for i386 PlatformThe following math library functions have intrinsic forms on all architectures:
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
double blunder(double x)
{
    x = acos(x);
    x = asin(x);
    x = atan(x);
    x = atan2(x, x);
    x = cos(x);
    x = cosh(x);
    x = exp(x);
    x = fmod(x, x);
    x = log(x);
    x = log10(x);
    x = pow(x, x);
    x = sin(x);
    x = sinh(x);
    x = sqrt(x);
    x = tan(x);
    x = tanh(x);
    return x;
}i386-blunder.c in an
            arbitrary, preferable empty directory, then execute the following 3
            command lines to compile and (attempt to) link it a first time:
        SET CL=/Oi /W4 SET LINK=/ENTRY:blunder /MACHINE:I386 /SUBSYSTEM:CONSOLE CL.EXE /MD i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
i386-blunder.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/ENTRY:blunder /MACHINE:I386 /SUBSYSTEM:CONSOLE
/out:i386-blunder.exe
i386-blunder.obj
i386-blunder.obj : error LNK2019: unresolved external symbol __CItanh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CItan referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIsinh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIlog10 referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIfmod referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIexp referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIcosh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIatan2 referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIatan referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIasin referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol __CIacos referenced in function _fault
i386-blunder.exe : fatal error LNK1120: 11 unresolved externals
            OUCH¹: most obviously nobody at
            Microsoft ever built an application for the
            i386 platform which uses one of the floating-point
            functions
            acos(),
            asin(),
            atan(),
            atan2(),
            cosh(),
            exp(),
            fmod(),
            log10(),
            sinh(),
            tan()
            or
            tanh()
            with
            msvcrt.lib!
        Repeat the first trial without using the intrinsic functions:
CL.EXE /fp:strict /FImath.h /MD i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
i386-blunder.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/ENTRY:blunder /MACHINE:I386 /SUBSYSTEM:CONSOLE
/out:i386-blunder.exe
i386-blunder.obj
i386-blunder.obj : error LNK2019: unresolved external symbol _tanh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _tan referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _sqrt referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _sinh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _sin referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _pow referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _log10 referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _log referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _fmod referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _exp referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _cosh referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _cos referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _atan2 referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _atan referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _asin referenced in function _fault
i386-blunder.obj : error LNK2019: unresolved external symbol _acos referenced in function _fault
i386-blunder.exe : fatal error LNK1120: 16 unresolved externals
            OUCH²: the steaming pile of crap got even
            worse!
         Execute the following 2 command lines to compile and (attempt to)
            link i386-blunder.c a third time, now as
            DLL
            and with the static runtime library
            libcmt.lib:
        
SET LINK=/MACHINE:I386 /NOENTRY CL.EXE /LD i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
i386-blunder.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/MACHINE:I386 /NOENTRY
/out:i386-blunder.dll
/dll
/implib:i386-blunder.lib
i386-blunder.obj
LIBCMT.lib(crt0.obj) : error LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup
i386-blunder.dll : fatal error LNK1120: 1 unresolved externals
            OUCH³: despite building a
            DLL,
            the (intrinsic) floating-point functions reference an undocumented
            internal (startup) routine __tmainCRTStartup() for
            console applications, which in turn references a
            main()
            function – most obviously also nobody at
            Microsoft ever tried to build a
            DLL
            which uses floating-point functions!
        Repeat the third trial without using the intrinsic functions:
CL.EXE /fp:strict /FImath.h /MD i386-blunder.c
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
i386-blunder.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/MACHINE:I386 /NOENTRY
/out:i386-blunder.dll
/dll
/implib:i386-blunder.lib
i386-blunder.obj
LIBCMT.lib(crt0.obj) : error LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup
i386-blunder.dll : fatal error LNK1120: 1 unresolved externals
            OUCH⁴: like before!
        Note: a repetition of the last 2 trials in the 64-bit build environment is left as an exercise to the reader!
_CI* and _ftol* Routines; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.686
	.model	flat, C
single	record	sign:1, exponent:8, mantissa:23
bias	equ	1 shl (width exponent - 1) - 1
	.const
	public	_fltused
_fltused dword	9876h
	.code
; MSC internal intrinsic _CIacos():
; receives argument in FPU st(0), returns result in FPU st(0)
; NOTE: _CIacos() returns correct result for ±0.0 and ±1.0
_CIacos	proc	public
	fld	st(0)		; st(0) = st(1) = argument
	fmul	st(0), st(0)	; st(0) = argument**2,
				; st(1) = argument
	fld1			; st(0) = 1.0,
				; st(1) = argument**2,
				; st(2) = argument
	fsubrp	st(1), st(0)	; st(0) = 1.0 - argument**2,
				; st(1) = argument
	fsqrt			; st(0) = square root of (1.0 - argument**2),
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = square root of (1.0 - argument**2)
	fpatan			; st(0) = inverse circular cosine of argument
	ret
_CIacos	endp
; MSC internal intrinsic _CIasin():
; receives argument in FPU st(0), returns result in FPU st(0)
; NOTE: _CIasin() returns correct result for ±0.0 and ±1.0
_CIasin	proc	public
	fld	st(0)		; st(0) = st(1) = argument
	fmul	st(0), st(0)	; st(0) = argument**2,
				; st(1) = argument
	fld1			; st(0) = 1.0,
				; st(1) = argument**2,
				; st(2) = argument
	fsubrp	st(1), st(0)	; st(0) = 1.0 - argument**2,
				; st(1) = argument
	fsqrt			; st(0) = square root of (1.0 - argument**2),
				; st(1) = argument
	fpatan			; st(0) = inverse circular sine of argument
	ret
_CIasin	endp
; MSC internal intrinsic _CIatan():
; receives argument in FPU st(0), returns result in FPU st(0)
; NOTE: _CIatan() returns correct result for ±0.0 and ±INFINITY
_CIatan	proc	public
	fld1			; st(0) = 1.0,
				; st(1) = argument
	fpatan			; st(0) = inverse circular tangent of (argument / 1.0)
	ret
_CIatan	endp
; MSC internal intrinsic _CIatan2():
; receives arguments in FPU st(0) and st(1), returns result in FPU st(0)
; NOTE: _CIatan2() returns correct result for ±0.0 and ±INFINITY
_CIatan2 proc	public
	fxch	st(1)		; st(0) = denominator,
				; st(1) = numerator
	fpatan			; st(0) = inverse circular tangent of (numerator / denominator)
	ret
_CIatan2 endp
; MSC internal intrinsic _CIcos():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIcos	proc	public
	fcos			; st(0) = cosine of argument
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jnp	short return	; |argument| < 2**63?
	fldpi			; st(0) = pi,
				; st(1) = argument
	fadd	st(0), st(0)	; st(0) = 2.0 * pi,
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = 2.0 * pi
reduce:
	fprem1			; st(0) = argument modulo (2.0 * pi),
				; st(1) = 2.0 * pi
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jp	short reduce
	fstp	st(1)		; st(0) = argument modulo (2.0 * pi)
	fcos			; st(0) = cosine of argument modulo (2.0 * pi)
return:
	ret
_CIcos	endp
; MSC internal intrinsic _CIcosh():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIcosh	proc	public
	call	_CIexp		; st(0) = e**argument
	fld1			; st(0) = 1.0,
				; st(1) = e**argument
	fdiv	st(0), st(1)	; st(0) = 1.0 / e**argument = e**-argument,
				; st(1) = e**argument
	faddp	st(1), st(0)	; st(0) = e**argument + e**-argument
	push	(bias - 1) shl width mantissa
				; [esp] = 0x3F000000
				;       = 0.5F
	fmul	real4 ptr [esp]	; st(0) = hyperbolic cosine of argument
	pop	eax
	ret
_CIcosh	endp
; MSC internal intrinsic _CIexp():
; receives argument in FPU st(0), returns result in FPU st(0)
; NOTE: _CIexp() returns correct result for ±INFINITY
_CIexp	proc	public
	fldl2e			; st(0) = log2(e),
				; st(1) = exponent
	fmulp	st(1), st(0)	; st(0) = exponent * log2(e)
if 0
	fld1			; st(0) = 1.0,
				; st(1) = exponent * log2(e)
	fld	st(1)		; st(0) = exponent * log2(e),
				; st(1) = 1.0,
				; st(2) = exponent * log2(e)
	fprem			; st(0) = (exponent * log2(e)) modulo 1.0,
				; st(1) = 1.0,
				; st(2) = exponent * log2(e)
	f2xm1			; st(0) = 2.0**((exponent * log2(e)) modulo 1.0) - 1.0,
				; st(1) = 1.0,
				; st(2) = exponent * log2(e)
	faddp	st(1), st(0)	; st(0) = 2.0**((exponent * log2(e)) modulo 1.0),
				; st(1) = exponent * log2(e)
	fscale			; st(0) = e**exponent,
				; st(1) = exponent * log2(e)
else
	fld	st(0)		; st(0) = st(1) = exponent * log2(e)
	frndint			; st(0) = integer(exponent * log2(e)),
				; st(1) = exponent * log2(e)
	fsub	st(1), st(0)	; st(0) = integer(exponent * log2(e)),
				; st(1) = fraction(exponent * log2(e))
	fxch	st(1)		; st(0) = fraction(exponent * log2(e)),
				; st(1) = integer(exponent * log2(e))
	f2xm1			; st(0) = 2.0**fraction(exponent * log2(e)) - 1.0,
				; st(1) = integer(exponent * log2(e))
	fld1			; st(0) = 1.0,
				; st(1) = 2.0**fraction(exponent * log2(e)) - 1.0,
				; st(2) = integer(exponent * log2(e))
	faddp	st(1), st(0)	; st(0) = 2.0**fraction(exponent * log2(e)),
				; st(1) = integer(exponent * log2(e))
	fscale			; st(0) = e**exponent,
				; st(1) = integer(exponent * log2(e))
endif
	fstp	st(1)		; st(0) = e**exponent
	ret
_CIexp	endp
; MSC internal intrinsic _CIfmod():
; receives arguments in FPU st(0) and st(1), returns result in FPU st(0)
_CIfmod	proc	public
reduce:
	fprem			; st(0) = remainder,
				; st(1) = divisor
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jp	short reduce
	fstp	st(1)		; st(0) = remainder
	ret
_CIfmod	endp
; MSC internal intrinsic _CIlog():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIlog	proc	public
	fldln2			; st(0) = ln(2.0),
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = ln(2.0)
	fyl2x			; st(0) = natural logarithm of argument
	ret
_CIlog	endp
; MSC internal intrinsic _CIlog10():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIlog10 proc	public
	fldlg2			; st(0) = log10(2.0),
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = log10(2.0)
	fyl2x			; st(0) = logarithm to base 10 of argument
	ret
_CIlog10 endp
; MSC internal intrinsic _CIpow():
; receives arguments in FPU st(0) and st(1), returns result in FPU st(0)
_CIpow	proc	public
	fxch	st(1)		; st(0) = base,
				; st(1) = exponent
	fyl2x			; st(0) = exponent * log2(base)
	fld	st(0)		; st(0) = st(1) = exponent * log2(base)
	frndint			; st(0) = integer(exponent * log2(base)),
				; st(1) = exponent * log2(base)
	fsub	st(1), st(0)	; st(0) = integer(exponent * log2(base)),
				; st(1) = fraction(exponent * log2(base))
	fxch	st(1)		; st(0) = fraction(exponent * log2(base)),
				; st(1) = integer(exponent * log2(base))
	f2xm1			; st(0) = 2.0**fraction(exponent * log2(base)) - 1.0,
				; st(1) = integer(exponent * log2(base))
	fld1			; st(0) = 1.0,
				; st(1) = 2.0**fraction(exponent * log2(base)) - 1.0,
				; st(2) = integer(exponent * log2(base))
	faddp	st(1), st(0)	; st(0) = 2.0**fraction(exponent * log2(base)),
				; st(1) = integer(exponent * log2(base))
	fscale			; st(0) = base**exponent,
				; st(1) = integer(exponent * log2(base))
	fstp	st(1)		; st(0) = base**exponent
	ret
_CIpow	endp
; MSC internal intrinsic _CIsin():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIsin	proc	public
	fsin			; st(0) = sine of argument
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jnp	short return	; |argument| < 2**63?
	fldpi			; st(0) = pi,
				; st(1) = argument
	fadd	st(0), st(0)	; st(0) = 2.0 * pi,
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = 2.0 * pi
reduce:
	fprem1			; st(0) = argument modulo (2.0 * pi),
				; st(1) = 2.0 * pi
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jp	short reduce
	fstp	st(1)		; st(0) = argument modulo (2.0 * pi)
	fsin			; st(0) = sine of argument modulo (2.0 * pi)
return:
	ret
_CIsin	endp
; MSC internal intrinsic _CIsinh():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIsinh	proc	public
	call	_CIexp		; st(0) = e**argument
	fld1			; st(0) = 1.0,
				; st(1) = e**argument
	fdiv	st(0), st(1)	; st(0) = 1.0 / e**argument = e**-argument,
				; st(1) = e**argument
	fsubp	st(1), st(0)	; st(0) = e**argument - e**-argument
	push	(bias - 1) shl width mantissa
				; [esp] = 0x3F000000
				;       = 0.5F
	fmul	real4 ptr [esp]	; st(0) = hyperbolic sine of argument
	pop	eax
	ret
_CIsinh	endp
; MSC internal intrinsic _CIsqrt():
; receives argument in FPU st(0), returns result in FPU st(0)
_CIsqrt	proc	public
	fsqrt			; st(0) = square root of radicand
	ret
_CIsqrt	endp
; MSC internal intrinsic _CItan():
; receives argument in FPU st(0), returns result in FPU st(0)
_CItan	proc	public
	fptan			; st(0) = 1.0,
				; st(1) = tangent of argument
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jnp	short return	; |argument| < 2**63?
	fldpi			; st(0) = pi,
				; st(1) = argument
	fadd	st(0), st(0)	; st(0) = 2.0 * pi,
				; st(1) = argument
	fxch	st(1)		; st(0) = argument,
				; st(1) = 2.0 * pi
reduce:
	fprem1			; st(0) = argument modulo (2.0 * pi),
				; st(1) = 2.0 * pi
	fstsw	ax		; ax = FPU status word,
				; ah = B:C3:T:O:P:C2:C1:C0
	sahf			; SF:ZF:0:AF:0:PF:1:CF = ah
	jp	short reduce
	fstp	st(1)		; st(0) = argument modulo (2.0 * pi)
	fptan			; st(0) = 1.0,
				; st(1) = tangent of argument modulo (2.0 * pi)
return:
	fstp	st(0)		; st(0) = tangent of argument
	ret
_CItan	endp
; MSC internal intrinsic _CItanh():
; receives argument in FPU st(0), returns result in FPU st(0)
_CItanh	proc	public
	call	_CIexp		; st(0) = e**argument
	fmul	st(0), st(0)	; st(0) = e**argument * e**argument
				;       = e**(argument + argument)
	fld1			; st(0) = 1.0,
				; st(1) = e**(argument + argument)
	fadd	st(1), st(0)	; st(0) = 1.0,
				; st(1) = e**(argument + argument) + 1.0
	fadd	st(0), st(0)	; st(0) = 2.0,
				; st(1) = e**(argument + argument) + 1.0
	fdivrp	st(1), st(0)	; st(0) = 2.0 / (e**(argument + argument) + 1.0)
	fld1			; st(0) = 1.0,
				; st(1) = 2.0 / (e**(argument + argument) + 1.0)
	fsubrp	st(1), st(0)	; st(0) = 1.0 - 2.0 / (e**(argument + argument) + 1.0)
				;       = hyperbolic tangent of argument
	ret
_CItanh	endp
; MSC internal intrinsic _ftol():
; receives argument in FPU st(0), returns result in eax
; NOTE: fistp rounds to nearest (even) integer!
_ftol	proc	public
	push	eax
	fistp	dword ptr [esp]	; [esp] = integer(argument)
	pop	eax		; eax = integer(argument)
	ret
_ftol	endp
; MSC internal intrinsic _ftol2():
; receives argument in FPU st(0), returns result in edx:eax
; NOTE: fistp rounds to nearest (even) integer!
_ftol2	proc	public
	push	edx
	push	eax
	fistp	qword ptr [esp]	; [esp] = integer(argument)
	pop	eax
	pop	edx		; edx:eax = integer(argument)
	ret
_ftol2	endp
; MSC internal intrinsic _ftol2_sse():
; receives argument in FPU st(0), returns result in edx:eax
; NOTE: fisttp truncates, i.e. rounds towards ±0!
_ftol2_sse proc	public
	push	edx
	push	eax
	fisttp	qword ptr [esp]	; [esp] = integer(argument)
	pop	eax
	pop	edx		; edx:eax = integer(argument)
	ret
_ftol2_sse endp
	endi386-fpu.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 3 command lines to generate the object file
            i386-fpu.obj and add it to the existing object library
            i386.lib:
        SET ML=/c /Gy /safeseh /W3 /X ML.EXE i386-fpu.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-fpu.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
 Note: if the
            /Gy option
            to package every function in its own, separately linkable
            COMDAT
            section is not available with the version of the macro assembler
            ML.EXE you use,
            split the i386 assembler source into multiple pieces,
            with one function per source file.
        
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-fpu.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
memchr() Standard Function for i386 Platformmemchr()
            function is neither a compiler helper nor an
            intrinsic
            function, it is included here for entertainment due to its
            DIR "%source%\intel\mem*.asm" TYPE "%source%\intel\memchr.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             4,097 memccpy.asm
02/18/2011  03:40 PM             5,003 memchr.asm
02/18/2011  03:40 PM            22,486 memcpy.asm
02/18/2011  03:40 PM               475 memmove.asm
02/18/2011  03:40 PM             4,426 memset.asm
               5 File(s)         36,307 bytes
               0 Dir(s)    9,876,543,210 bytes free
        page    ,132
        title   memchr - search memory for a given character
;***
;memchr.asm - search block of memory for a given character
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines memchr() - search memory until a character is
;       found or a limit is reached.
;
;*******************************************************************************
        .xlist
        include cruntime.inc
        .list
page
;***
;char *memchr(buf, chr, cnt) - search memory for given character.
;
;Purpose:
;       Searched at buf for the given character, stopping when chr is
;       first found or cnt bytes have been searched through.
;
;       Algorithm:
;       char *
;       memchr (buf, chr, cnt)
;               char *buf;
;               int chr;
;               unsigned cnt;
;       {
;               while (cnt && *buf++ != c)
;                       cnt--;
;               return(cnt ? --buf : NULL);
;       }
;
;Entry:
;       char *buf - memory buffer to be searched
;       char chr - character to search for
;       unsigned cnt - max number of bytes to search
;
;Exit:
;       returns pointer to first occurence of chr in buf
;       returns NULL if chr not found in the first cnt bytes
;
;Uses:
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
        public  memchr
memchr  proc \
        buf:ptr byte, \
        chr:byte, \
        cnt:dword
        OPTION PROLOGUE:NONE, EPILOGUE:NONE
        .FPO    ( 0, 1, 0, 0, 0, 0 )
        mov     eax,[esp+0ch]   ; eax = count
        push    ebx             ; Preserve ebx
        test    eax,eax         ; check if count=0
        jz      short retnull   ; if count=0, leave
        mov     edx,[esp+8]     ; edx = buffer
        xor     ebx,ebx
        mov     bl,[esp+0ch]    ; bl = search char
        movzx   ebx,byte ptr [esp+12]
        test    edx,3           ; test if string is aligned on 32 bits
        test    dl,3
        jz      short main_loop_start
str_misaligned:                 ; simple byte loop until string is aligned
        mov     cl,byte ptr [edx]
        add     edx,1
        xor     cl,bl
        inc     edx
        cmp     cl,bl
        je      short found
        sub     eax,1           ; counter--
        dec     eax
        jz      short retnull
        test    edx,3           ; already aligned ?
        test    dl,3
        jne     short str_misaligned
main_loop_start:
        sub     eax,4
        jb      short tail_less_then_4
; set all 4 bytes of ebx to [value]
        push    edi             ; Preserve edi
        mov     edi,ebx         ; edi=0/0/0/char
        shl     ebx,8           ; ebx=0/0/char/0
        add     ebx,edi         ; ebx=0/0/char/char
        mov     edi,ebx         ; edi=0/0/char/char
        shl     ebx,10h         ; ebx=char/char/0/0
        add     ebx,edi         ; ebx = all 4 bytes = [search char]
        imul    ebx,01010101h
        jmp     short main_loop_entry   ; ecx >=0
return_from_main:
        pop     edi
tail_less_then_4:
        add     eax,4
        jz      retnull
tail_loop:                      ; 0 < eax < 4
        mov     cl,byte ptr [edx]
        add     edx,1
        xor     cl,bl
        inc     edx
        cmp     cl,bl
        je      short found
        sub     eax,1
        dec     eax
        jnz     short tail_loop
retnull:
        pop     ebx
        ret                     ; _cdecl return
main_loop:
        sub     eax,4
        jb      short return_from_main
main_loop_entry:
        mov     ecx,dword ptr [edx]     ; read 4 bytes
        xor     ecx,ebx         ; ebx is byte\byte\byte\byte
        mov     edi,7efefeffh
        add     edi,ecx
        xor     ecx,-1
        xor     ecx,edi
        add     edx,4
        lea     edi,[ecx-01010101h]
        not     ecx
        and     ecx,80808080h
        and     ecx,edi
        and     ecx,81010100h
        je      short main_loop
; found zero byte in the loop?
char_is_found:
        bsf     ecx,ecx
        shr     ecx,3
        lea     eax,[edx+ecx-4]
        pop     edi
        pop     ebx
        ret
        mov     ecx,[edx - 4]
        xor     cl,bl           ; is it byte 0
        je      short byte_0
        xor     ch,bl           ; is it byte 1
        je      short byte_1
        shr     ecx,10h         ; is it byte 2
        xor     cl,bl
        je      short byte_2
        xor     ch,bl           ; is it byte 3
        je      short byte_3
        jmp     short main_loop ; taken if bits 24-30 are clear and bit
                                ; 31 is set
byte_3:
        pop     edi             ; restore edi
found:
        lea     eax,[edx - 1]
        pop     ebx             ; restore ebx
        ret                     ; _cdecl return
byte_2:
        lea     eax,[edx - 2]
        pop     edi
        pop     ebx
        ret                     ; _cdecl return
byte_1:
        lea     eax,[edx - 3]
        pop     edi
        pop     ebx
        ret                     ; _cdecl return
byte_0:
        lea     eax,[edx - 4]
        pop     edi             ; restore edi
        pop     ebx             ; restore ebx
        ret                     ; _cdecl return
memchr  endp
        end
            With 76 instructions in 173 bytes, this routine is yet another
            true gem!
         Oops¹: the deleted
            XOR instruction followed
            by the deleted MOV instruction
            should be replaced with the inserted
            MOVZX instruction.
        
 Oops²: the deleted
            TEST instructions with
            immediate value 3 should be replaced with the inserted
            shorter ones, saving 6 bytes.
        
 OUCH¹: the deleted
            ADD and SUB
            instructions which increment respectively decrement by 1, should be
            replaced with the inserted shorter
            INC or
            DEC instructions, saving
            4 bytes!
        
 OUCH²: instead of the 6 deleted
            XOR instructions which
            perform superfluous partial register write operations the
            inserted CMP
            instructions should be used!
        
 OUCH³: instead of the 6 deleted
            instructions which copy the byte from register BL into
            the upper 3 bytes of register EBX the
            single inserted
            IMUL instruction should be
            used, saving 8 bytes!
        
 OUCH⁴: instead of the deleted
            XOR instruction with
            immediate operand -1 the inserted shorter
            NOT instruction
            should be used, saving 1 byte!
        
 OUCH⁵: when the 5 deleted
            instructions after label main_loop_entry: are replaced
            with the 4 inserted instructions, the 24 (in words:
            twenty-four) deleted instructions after
            label char_is_found: can be replaced with the 6 faster
            and shorter instructions inserted there, saving 45 (in
            words: fourty-five) bytes!
        
Note: Alan Mycroft posted the better, faster and shorter method on April 8, 1987 to the USENET news group comp.lang.c
You might be interested to know that such detection of null bytes in words
can be done in 3 or 4 instructions on almost any hardware (nay even in C).
(Code that follows relies on x being a 32 bit unsigned (or 2's complement
int with overflow ignored)...)
#define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
Then if e is an expression without side effects (e.g. variable)
has_nullbyte_(e)
is nonzero iff the value of e has a null byte.
(One can view this as explicit programming of the Manchester carry chain
present in many processors which is hidden from the instruction level).
Note: see Bit Twiddling Hacks – Determine if a word has a zero byte for a comparison of both methods and more details.
Note: with the modifications shown in the source, this routine has 51 instructions in 118 bytes, i.e. two thirds of the original’s instructions and bytes.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model  flat, C
	.code
memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)
	mov	eax, [esp+12]	; eax = count
	test	eax, eax
	jz	short return	; count = 0?
	movzx	ecx, byte ptr [esp+8]
	mov	edx, [esp+4]	; edx = address of buffer
head:
	test	dl, 3
	jz	short aligned	; count % 4 = 0?
unaligned:
	cmp	cl, [edx]
	je	short break
	inc	edx
	dec	eax
	jnz	short head	; count > 0?
	ret
aligned:
	sub	eax, 4
	jb	short tail	; count < 4?
	push	edi
	push	esi
	imul	ecx, 01010101h	; ecx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
next:
	mov	edi, [edx]	; edi = next 4 aligned bytes
	xor	edi, ecx
	lea	esi, [edi-01010101h]
	not	edi
	and	edi, 80808080h
	and	edi, esi
	jnz	short match
	add	edx, 4
	sub	eax, 4
	jae	short next	; count >= 4?
	pop	esi
	pop	edi
tail:
	add	eax, 4
	jz	short return	; count = 0?
@@:
	cmp	cl, [edx]
	je	short break
	inc	edx
	dec	eax
	jnz	short @b	; count > 0?
return:
	ret
break:
	mov	eax, edx	; eax = address of character
	ret
match:
	bsf	eax, edi	; eax = offset of character * 8 + 7
				;     = {7, 15, 23, 31}
	shr	eax, 3		; eax = offset of character
				;     = {0, 1, 2, 3}
	add	eax, edx	; eax = address of character
	pop	esi
	pop     edi
	ret
memchr	endp
	endi386-memchr.asm and the
            ANSI C
            source presented below as i386-memchr.c, then execute
            the 6 command lines following the
            ANSI C
            source to assemble i386-memchr.asm, compile
            i386-memchr.c, link the generated object files
            i386-memchr.obj and i386-memchr.tmp, and
            finally execute the image file i386-memchr.exe to
            demonstrate the correct operation:
        // Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1025];
	DWORD	dwFormat;
	DWORD	dwOutput;
	va_list	vaInput;
	va_start(vaInput, lpFormat);
	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
	va_end(vaInput);
	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;
	return dwOutput == dwFormat;
}
const	CHAR	szString[] = "^^9876543210$$";
const	LPCSTR	szFormat[] = {"0x%p: memchr(\"%hs\", \'$\', %lu) = NULL\r\n",
		              "0x%p: memchr(\"%hs\", \'$\', %lu) = 0x%p = \"%hs\"\r\n"};
__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	LPCSTR	lp;
	LPCSTR	lpString = szString + sizeof(szString);
	DWORD	dwError = ERROR_SUCCESS;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	if (hOutput == INVALID_HANDLE_VALUE)
		dwError = GetLastError();
	else
		while (--lpString >= szString)
		{
			lp = memchr(lpString, '$', szString + sizeof(szString) - lpString);
			PrintFormat(hOutput,
			            szFormat[lp != NULL],
			            lpString, lpString, szString + sizeof(szString) - lpString, lp, lp);
		}
	ExitProcess(dwError);
}SET ML=/c /safeseh /W3 /X ML.EXE i386-memchr.asm SET CL=/GAFy /Oy /W4 /Zl SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE /Foi386-memchr.tmp i386-memchr.obj i386-memchr.c kernel32.lib user32.lib .\i386-memchr.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
 Assembling: i386-memchr.asm
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
i386-memchr.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE
/out:i386-memchr.exe
i386-memchr.obj
i386-memchr.tmp
kernel32.lib
user32.lib
0x002F2082: memchr("", '$', 1) = NULL
0x002F2081: memchr("$", '$', 2) = 0x002F2080 = "$$"
0x002F2080: memchr("$$", '$', 3) = 0x002F2080 = "$$"
0x002F207F: memchr("0$$", '$', 4) = 0x002F2080 = "$$"
0x002F207E: memchr("10$$", '$', 5) = 0x002F2080 = "$$"
0x002F207D: memchr("210$$", '$', 6) = 0x002F2080 = "$$"
0x002F207C: memchr("3210$$", '$', 7) = 0x002F2080 = "$$"
0x002F207B: memchr("43210$$", '$', 8) = 0x002F2080 = "$$"
0x002F207A: memchr("543210$$", '$', 9) = 0x002F2080 = "$$"
0x002F2079: memchr("6543210$$", '$', 10) = 0x002F2080 = "$$"
0x002F2078: memchr("76543210$$", '$', 11) = 0x002F2080 = "$$"
0x002F2077: memchr("876543210$$", '$', 12) = 0x002F2080 = "$$"
0x002F2076: memchr("9876543210$$", '$', 13) = 0x002F2080 = "$$"
0x002F2075: memchr("^9876543210$$", '$', 14) = 0x002F2080 = "$$"
0x002F2074: memchr("^^9876543210$$", '$', 15) = 0x002F2080 = "$$"
        SmartImplementation in i386 Assembler
smartimplementation without loops for head and tail needs only 40 instructions in 102 bytes, i.e. about half the instructions of Microsoft’s poor implementation; the corresponding
smartimplementation of the missing
memrchr() function has also 40
            instructions in 101 bytes:
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model  flat, C
	.code
memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short return	; count = 0?
	cdq			; edx = 0
	mov	dl, [esp+8]	; edx = character
	imul	edx, 01010101h	; edx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	push	ebx
	mov	ebx, ecx	; ebx = address of buffer
	and	ecx, 3		; ecx = address of buffer % 4
				;     = 4 - number of unaligned bytes
	jz	short aligned	; address of buffer % 4 = 0?
unaligned:
	sub	ebx, ecx	; ebx = aligned address before buffer
	shl	ecx, 3		; ecx = (4 - number of unaligned bytes) * 8
				;     = 32 - number of unaligned bits
	dec	eax		; eax = ~0
	shl	eax, cl		; eax = ~0 for unaligned bytes, 0 elsewhere
	not	eax		; eax = 0 for unaligned bytes, ~0 elsewhere
	mov	ecx, [ebx]	; ecx = unaligned bytes
	xor	ecx, edx	; ecx = unaligned bytes ^ multiplied character
	or	eax, ecx	; eax = '\0' for matching bytes
	jmp	short mycroft
next:
	add	ebx, 4		; ebx = address of next 4 aligned bytes
	cmp	ebx, [esp+16]
	jae	short null	; address after buffer?
aligned:
	mov	eax, [ebx]	; eax = next 4 aligned bytes
	xor	eax, edx	; eax = next 4 aligned bytes ^ multiplied character
				;     = '\0' for matching bytes
mycroft:
	mov	ecx, eax
	sub	eax, 01010101h
	not	ecx
	and	eax, 80808080h
	and	eax, ecx	; eax = '\200' for matching bytes, '\0' elsewhere
	jz	short next	; no match in any byte?
match:
	bsf	eax, eax	; eax = offset of matching byte * 8 + 7
				;     = {7, 15, 23, 31}
	shr	eax, 3		; eax = offset of matching byte
				;     = {0, 1, 2, 3}
	add	eax, ebx	; eax = address of matching byte
	cmp	eax, [esp+16]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	pop	ebx
return:
	ret
memchr	endp
memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short return	; count = 0?
	cdq			; edx = 0
	mov	dl, [esp+8]	; edx = character
	imul	edx, 01010101h	; edx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	push	ebx
	mov	ebx, ecx	; ebx = address after buffer
	and	ecx, 3		; ecx = address after buffer % 4
				;     = number of tail bytes
	jz	short aligned	; address after buffer % 4 = 0?
unaligned:
	sub	ebx, ecx	; ebx = aligned address of tail bytes
	shl	ecx, 3		; ecx = number of tail bytes * 8
				;     = number of tail bits
	dec	eax		; eax = ~0
	shl	eax, cl		; eax = 0 for tail bytes, ~0 elsewhere
	mov	ecx, [ebx]	; ecx = tail bytes
	xor	ecx, edx	; ecx = tail bytes ^ multiplied character
	or	eax, ecx	; eax = '\0' for matching tail bytes
	jmp	short mycroft
previous:
	sub	ebx, 4		; ebx = address of previous 4 aligned bytes
	cmp	ebx, [esp+8]
	jb	short null	; address before buffer?
aligned:
	mov	eax, [ebx]	; eax = previous 4 aligned bytes
	xor	eax, edx	; eax = previous 4 aligned bytes ^ multiplied character
				;     = '\0' for matching bytes
mycroft:
	mov	ecx, eax
	sub	eax, 01010101h
	not	ecx
	and	eax, 80808080h
	and	eax, ecx	; eax = '\200' for matching bytes, '\0' elsewhere
	jz	short previous	; no match in any byte?
match:
	bsr	eax, eax	; eax = offset of matching byte * 8 + 7
				;     = {31, 23, 15, 7}
	shr	eax, 3		; eax = offset of matching byte
				;     = {3, 2, 1, 0}
	add	eax, ebx	; eax = address of matching byte
	cmp	eax, [esp+8]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	pop	ebx
return:
	ret
memrchr	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.xmm
	.model	flat, C
	.code
memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
if 0
	movd	xmm0, dword ptr [esp+8]
	punpcklbw xmm0, xmm0
	punpcklwd xmm0, xmm0
else
	mov	al, [esp+8]	; eax = character
	imul	eax, 01010101h	; eax = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	movd	xmm0, eax
endif
	pshufd	xmm0, xmm0, 0	; xmm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address of buffer % 16
				;     = 16 - number of unaligned bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before buffer
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 bytes
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm1	; eax = bitmask for matching bytes in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned bytes
	cmp	edx, [esp+12]
	jae	short null	; address after buffer?
aligned:
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 bytes
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm1	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short next	; no matching byte in chunk?
match:
	bsf	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+12]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret
memchr	endp
memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
if 0
	movd	xmm0, dword ptr [esp+8]
	punpcklbw xmm0, xmm0
	punpcklwd xmm0, xmm0
else
	mov	al, [esp+8]	; eax = character
	imul	eax, 01010101h	; eax = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	movd	xmm0, eax
endif
	pshufd	xmm0, xmm0, 0	; xmm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address after buffer % 16
				;     = number of tail bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address of tail bytes
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 bytes
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm1	; eax = bitmask for matching bytes in chunk
	neg	ecx		; ecx = -number of tail bytes
	shl	eax, cl
	shr	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
previous:
	sub	edx, 16		; edx = address of previous chunk of aligned bytes
	cmp	edx, [esp+4]
	jb	short null	; address before buffer?
aligned:
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 bytes
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm1	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short previous	; no matching byte in chunk?
match:
	bsr	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+4]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret
memrchr	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.xmm
	.model	flat, C
	.code
memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
	pxor	xmm0, xmm0	; xmm0 = 0
	movd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	pshufb	xmm1, xmm0	; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address of buffer % 16
				;     = 16 - number of unaligned bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before buffer
	movdqa	xmm0, [edx]	; xmm0 = chunk of 16 bytes
	pcmpeqb	xmm0, xmm1	; xmm0 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned bytes
	cmp	edx, [esp+12]
	jae	short null	; address after buffer?
aligned:
	movdqa	xmm0, [edx]	; xmm0 = chunk of 16 bytes
	pcmpeqb	xmm0, xmm1	; xmm0 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short next	; no matching byte in chunk?
match:
	bsf	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+12]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret
memchr	endp
memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
	pxor	xmm0, xmm0	; xmm0 = 0
	movd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	pshufb	xmm1, xmm0	; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address after buffer % 16
				;     = number of tail bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address of tail bytes
	movdqa	xmm0, [edx]	; xmm0 = chunk of 16 bytes
	pcmpeqb	xmm0, xmm1	; xmm0 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	neg	ecx		; ecx = -number of tail bytes
	shl	eax, cl
	shr	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
previous:
	sub	edx, 16		; edx = address of previous chunk of aligned bytes
	cmp	edx, [esp+4]
	jb	short null	; address before buffer?
aligned:
	movdqa	xmm0, [edx]	; xmm0 = chunk of 16 bytes
	pcmpeqb	xmm0, xmm1	; xmm0 = '\377' for each matching byte in chunk
	pmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short previous	; no matching byte in chunk?
match:
	bsr	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+4]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret
memrchr	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.xmm
	.model	flat, C
	.code
memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	vmovd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	vpshufb	xmm1, xmm1, xmm0; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address of buffer % 16
				;     = 16 - number of unaligned bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before buffer
	vpcmpeqb xmm0, xmm1, [edx]
				; xmm0 = '\377' for each matching byte in chunk
	vpmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned bytes
	cmp	edx, [esp+12]
	jae	short null	; address after buffer?
aligned:
	vpcmpeqb xmm0, xmm1, [edx]
				; xmm0 = '\377' for each matching byte in chunk
	vpmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short next	; no matching byte in chunk?
match:
	bsf	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+12]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret
memchr	endp
memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	vmovd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	vpshufb	xmm1, xmm1, xmm0; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	mov	edx, ecx
	and	ecx, 15		; ecx = address after buffer % 16
				;     = number of tail bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address of tail bytes
	vpcmpeqb xmm0, xmm1, [edx]
				; xmm0 = '\377' for each matching byte in chunk
	vpmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	neg	ecx		; ecx = -number of tail bytes
	shl	eax, cl
	shr	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
previous:
	sub	edx, 16		; edx = address of previous chunk of aligned bytes
	cmp	edx, [esp+4]
	jb	short null	; address before buffer?
aligned:
	vpcmpeqb xmm0, xmm1, [edx]
				; xmm0 = '\377' for each matching byte in chunk
	vpmovmskb eax, xmm0	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short previous	; no matching byte in chunk?
match:
	bsr	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+4]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret
memrchr	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.ymm
	.model	flat, C
	.code
memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
	vpbroadcastb ymm0, byte ptr [esp+8]
				; ymm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	[esp+12], ecx	; count = address after buffer
	mov	edx, ecx
	and	ecx, 31		; ecx = address of buffer % 32
				;     = 32 - number of unaligned bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before buffer
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each matching byte in chunk
	vpmovmskb eax, ymm1	; eax = bitmask for matching bytes in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
next:
	add	edx, 32		; edx = address of next chunk of aligned bytes
	cmp	edx, [esp+12]
	jae	short null	; address after buffer?
aligned:
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each matching byte in chunk
	vpmovmskb eax, ymm1	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short next	; no matching byte in chunk?
match:
	bsf	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+12]	; CF = (address inside buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret
memchr	endp
memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; eax = 0
	cmp	eax, [esp+12]
	je	short null	; count = 0?
	vpbroadcastb ymm0, byte ptr [esp+8]
				; ymm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of buffer
	add	ecx, [esp+12]	; ecx = address after buffer
	mov	edx, ecx
	and	ecx, 31		; ecx = address after buffer % 32
				;     = number of tail bytes
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address of tail bytes
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each matching byte in chunk
	vpmovmskb eax, ymm1	; eax = bitmask for matching bytes in chunk
	neg	ecx		; ecx = -number of tail bytes
	shl	eax, cl
	shr	eax, cl		; eax = bitmask for matching bytes in buffer
	jnz	short match
previous:
	sub	edx, 32		; edx = address of previous chunk of aligned bytes
	cmp	edx, [esp+4]
	jb	short null	; address before buffer?
aligned:
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each matching byte in chunk
	vpmovmskb eax, ymm1	; eax = bitmask for matching bytes in chunk
	test	eax, eax
	jz	short previous	; no matching byte in chunk?
match:
	bsr	eax, eax	; eax = offset of matching byte in chunk
	add	eax, edx	; eax = address of matching byte
	cmp	eax, [esp+4]	; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	ecx, ecx	; ecx = (address inside buffer) ? -1 : 0
	and	eax, ecx	; eax = address of character
null:
	ret
memrchr	endp
	endSmartImplementation in AMD64 Assembler
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.code
memchr	proc	public		; void *memchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; rax = 0
	test	r8, r8
	jz	short null	; count = 0?
	mov	r10, 0101010101010101h
if 0
	mov	r11, 8080808080808080h
elseif 0
	imul	r11, r10, 128	; r11 = 0x8080808080808080
else
	mov	r11, r10
	ror	r11, 1		; r11 = 0x8080808080808080
endif
	movzx	edx, dl		; rdx = character & 255
	imul	rdx, r10	; rdx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
				;     | character << 32
				;     | character << 40
				;     | character << 48
				;     | character << 56
	add	r8, rcx		; r8 = address after buffer
	mov	r9, rcx		; r9 = address of buffer
	and	ecx, 7		; rcx = address of buffer % 8
				;     = 8 - number of unaligned bytes
	jz	short aligned	; address of buffer % 8 = 0?
unaligned:
	sub	r9, rcx		; r9 = aligned address before buffer
	shl	ecx, 3		; rcx = (8 - number of unaligned bytes) * 8
				;     = 64 - number of unaligned bits
	dec	rax		; rax = ~0
	shl	rax, cl		; rax = ~0 for unaligned bytes, 0 elsewhere
	not	rax		; rax = 0 for unaligned bytes, ~0 elsewhere
	mov	rcx, [r9]	; rcx = unaligned bytes
	xor	rcx, rdx	; rcx = unaligned bytes ^ multiplied character
	or	rcx, rax	; rcx = '\0' for matching bytes
	jmp	short mycroft
next:
	add	r9, 8		; r9 = address of next 8 aligned bytes
	cmp	r9, r8
	jae	short null	; address after buffer?
aligned:
	mov	rcx, [r9]	; rcx = next 8 aligned bytes
	xor	rcx, rdx	; rcx = next 8 aligned bytes ^ multiplied character
				;     = '\0' for matching bytes
mycroft:
	mov	rax, rcx
	sub	rcx, r10
	not	rax
	and	rcx, r11
	and	rax, rcx	; rax = '\200' for matching bytes, '\0' elsewhere
	jz	short next	; no match in any byte?
match:
	bsf	rax, rax	; rax = offset of matching byte * 8 + 7
				;     = {7, 15, 23, 31, 39, 47, 55, 63}
	shr	eax, 3		; rax = offset of matching byte
				;     = {0, 1, 2, 3, 4, 5, 6, 7}
	add	rax, r9		; rax = address of matching byte
if 0
	cmp	rax, r8		; CF = (address inside buffer)
	sbb	rcx, rcx	; rcx = (address inside buffer) ? -1 : 0
	and	rax, rcx	; rax = address of character
else
	xor	ecx, ecx	; rcx = 0
	cmp	rax, r8		; CF = (address inside buffer)
	cmovnb	rax, rcx	; rax = address of character
endif
null:
	ret
memchr	endp
memrchr	proc	public		; void *memrchr(void const *buffer, int character, size_t count)
	xor	eax, eax	; rax = 0
	test	r8, r8
	jz	short null	; count = 0?
	mov	r10, 0101010101010101h
if 0
	mov	r11, 8080808080808080h
elseif 0
	imul	r11, r10, 128	; r11 = 0x8080808080808080
else
	mov	r11, r10
	ror	r11, 1		; r11 = 0x8080808080808080
endif
	movzx	edx, dl		; rdx = character & 255
	imul	rdx, r10	; rdx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
				;     | character << 32
				;     | character << 40
				;     | character << 48
				;     | character << 56
	add	r8, rcx		; r8 = address after buffer
	mov	r9, rcx		; r9 = address of buffer
	mov	rcx, r8
	and	ecx, 7		; rcx = address after buffer % 8
				;     = 8 - number of tail bytes
	jz	short aligned	; address after buffer % 8 = 0?
unaligned:
	sub	r8, rcx		; r8 = aligned address of tail bytes
	shl	ecx, 3		; rcx = (8 - number of tail bytes) * 8
				;     = 64 - number of tail bits
	dec	rax		; rax = ~0
	shl	rax, cl		; rax = '\0' for tail bytes, ~0 elsewhere
	mov	rcx, [r8]	; rcx = tail bytes
	xor	rcx, rdx	; rcx = tail bytes ^ multiplied character
	or	rcx, rax	; rcx = '\0' for matching tail bytes
	jmp	short mycroft
previous:
	sub	r8, 8		; r8 = address of previous 8 aligned bytes
	cmp	r8, r9
	jb	short null	; address before buffer?
aligned:
	mov	rcx, [r8]	; rcx = previous 8 aligned bytes
	xor	rcx, rdx	; rcx = previous 8 aligned bytes ^ multiplied character
				;     = '\0' for matching bytes
mycroft:
	mov	rax, rcx
	sub	rcx, r10
	not	rax
	and	rcx, r11
	and	rax, rcx	; rax = '\200' for matching bytes, '\0' elsewhere
	jz	short previous	; no match in any byte?
match:
	bsr	rax, rax	; rax = offset of matching byte * 8 + 7
				;     = {63, 55, 47, 39, 31, 23, 15, 7}
	shr	eax, 3		; rax = offset of matching byte
				;     = {7, 6, 5, 4, 3, 2, 1, 0}
	add	rax, r8		; rax = address of matching byte
if 0
	cmp	rax, r9		; CF = (address of matching byte < address of buffer)
	cmc			; CF = (address of matching byte >= address of buffer)
	sbb	rcx, rcx	; rcx = (address inside buffer) ? -1 : 0
	and	rax, rcx	; rax = address of character
else
	xor	ecx, ecx	; rcx = 0
	cmp	rax, r9		; CF = (address of matching byte < address of buffer)
	cmovb	rax, rcx	; rax = address of character
endif
null:
	ret
memrchr	endp
	endmem*() Standard Functionsmemcpy()
            and
            memset()
            are
            intrinsic
            functions, the Visual C compiler provides no
            inline implementation, but generates calls to external routines.
         Proper implementations of these plus the
            memchr(),
            memcmp(),
            memmem(),
            memmove()
            and memrchr() functions for the i386 and
            the AMD64 platform follow with build instructions.
        
 Note: the
            memmem() function is
            like the
            strstr()
            function and uses the same algorithm!
        
 Note: both
            memmem() and
            memrchr() are not provided by the
            Visual C compiler or its runtime libraries!
        
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define NULL	(void *) 0
#ifndef _WIN64
typedef	unsigned int	size_t;
#endif
#pragma function(memcmp, memcpy, memset)
void	*memccpy(void *destination, void const *source, int character, size_t count)
{
	char *dst = (unsigned char *) destination;
	char *src = (unsigned char *) source;
	while (count)
	{
		*dst++ = *src;
		if (*src == (unsigned char) character)
			return (void *) dst;
		src++;
		--count;
	}
	return NULL;
}
void	*memchr(void const *destination, int character, size_t count)
{
	char const *mem = (unsigned char const *) destination;
	while (count)
	{
		if (*mem == (unsigned char) character)
			return (void *) mem;
		mem++;
		--count;
	}
	return NULL;
}
int	memcmp(void const *source, void const *destination, size_t count)
{
	char *dst = (unsigned char *) destination;
	char *src = (unsigned char *) source;
	if (count && source != destination)
		do
			if (*src - *dst)
#if 0
				return *src - *dst;
#else
				return (*src > *dst) - (*src < *dst);
#endif
		while (src++, dst++, --count);
	return 0;
}
void	*memcpy(void *destination, void const *source, size_t count)
{
	char *dst = (unsigned char *) destination;
	char *src = (unsigned char *) source;
	while (count)
		*dst++ = *src++, --count;
	return destination;
}
void	*memmem(void const *haystack, size_t count, void const *needle, size_t length)
{
	char const *mem;
	char const *hay = (unsigned char const *) haystack;
	char const *pin = (unsigned char const *) needle;
	if (!count || length > count)
		return NULL;
	if (!length)
		return (void *) haystack;
	if (!--length)		// needle is a single character?
		return memchr(haystack, *pin, count);
	count -= length;	// maximum number of characters to scan in haystack
	while (mem = hay, hay = (unsigned char const *) memchr(hay, *pin, count), hay)
	{			// *hay is first character of pin; compare
				//  last character of pin first, then proceed
		if (hay[length] == pin[length]
#if 0
		 && length == 1 || !memcmp(hay + 1, pin + 1, length - 1))
#else
		 && !memcmp(hay, pin, length))
#endif
			return (void *) hay;
				// skip character in haystack,
				//  adjust number of characters left in haystack
		count -= ++hay - mem;
		if (!count)
			break;
	}
	return NULL;
}
void	*memmove(void *destination, void const *source, size_t count)
{
	char *dst = (unsigned char *) destination;
	char *src = (unsigned char *) source;
	if (dst < src || dst - src >= count)
		while (count)
			*dst++ = *src++, --count;
	else if (dst > src)
	{			// overlapping buffers
		dst += count;
		src += count;
		while (count)
			*--dst = *--src, --count;
	}
	return destination;
}
void	*memrchr(void const *destination, int character, size_t count)
{
	char const *mem = (unsigned char const *) destination + count;
	while (count)
	{
		if (*--mem == (unsigned char) character)
			return (void *) mem;
		--count;
	}
	return NULL;
}
void	*memset(void *destination, int character, size_t count)
{
	char *dst = (unsigned char *) destination;
	while (count)
		*dst++ = (unsigned char) character, --count;
	return destination;
}; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
memccpy	proc	public		; void *memccpy(void *destination, void const *source, int character, size_t count)
	push	esi
	push	edi
	mov	edx, [esp+12]	; edx = address of destination
	mov	esi, [esp+16]	; esi = address of source
	mov	eax, [esp+20]	; eax = character
	mov	ecx, [esp+24]	; ecx = count
	mov	edi, esi	; edi = address of source
	test	esi, esi	; ZF = 0 (required when count is 0)
	repne	scasb		; edi = address past character in source
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi-1] = character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	sub	edi, esi	; edi = length of source (including character)
				;     = count'
	mov	ecx, edi	; ecx = count'
	mov	edi, edx	; edi = address of destination
				; esi = address of source
	rep	movsb		; edi = address past character in destination
	and	eax, edi	; eax = ([edi-1] = character)
				;     ? address past character in destination : 0
	pop	edi
	pop	esi
	ret
memccpy	endp
memchr	proc	public		; void *memchr(void const *destination, int character, size_t count)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of destination
	mov	eax, [esp+8]	; eax = character
	mov	ecx, [esp+12]	; ecx = count
	repne	scasb
	dec	edi		; edi = address of character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of character
	mov	edi, edx
	ret
memchr	endp
memcmp	proc	public		; int memcmp(void const *source, void const *destination, size_t count)
	mov	eax, [esp+4]	; eax = address of source
	mov	edx, [esp+8]	; edx = address of destination
	cmp	edx, eax
	je	short equal	; address of destination = address of source?
	mov	ecx, [esp+12]	; ecx = count
if 0
	jecxz	short equal	; count = 0?
else
	cmp	ebx, ebx	; CF = 0,
				; ZF = 1 (required when count is 0)
endif
	xchg	esi, eax	; esi = address of source
	xchg	edi, edx	; edi = address of destination
	repe	cmpsb
	mov	edi, edx
	mov	esi, eax
	seta	al
	movzx	eax, al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
	ret
equal:
	xor	eax, eax
	ret
memcmp	endp
memcpy	proc	public		; void *memcpy(void *destination, void const *source, size_t count)
	mov	eax, [esp+4]	; eax = address of destination
	mov	edx, [esp+8]	; edx = address of source
	cmp	edx, eax
	je	short return	; address of source = address of destination?
	mov	ecx, [esp+12]	; ecx = count
;;	jecxz	short return	; count = 0?
	xchg	esi, edx	; esi = address of source
	xchg	edi, eax	; edi = address of destination
if 1
	rep	movsb
else
	shr	ecx, 1		; ecx = count / 2
	jnc	short @f	; count % 2 = 0?
	movsb
@@:
	shr	ecx, 1		; ecx = count / 4
	jnc	short @f	; count % 4 = 0?
	movsw
@@:
	rep	movsd
endif
	mov	esi, edx
	mov	edi, eax
	mov	eax, [esp+4]	; eax = address of destination
return:
	ret
memcpy	endp
memmem	proc	public		; void *memmem(void const *haystack, size_t count,
				;              void const *needle, size_t length)
	xor	eax, eax	; eax = address of needle in haystack = 0
	mov	ecx, [esp+8]	; ecx = length of haystack
	test	ecx, ecx
	jz	short empty	; length of haystack = 0?
	mov	edx, [esp+16]	; edx = length of needle
	cmp	edx, ecx
	ja	short empty	; length of needle > length of haystack?
	mov	eax, [esp+4]	; eax = address of haystack
	test	edx, edx
	jz	short empty	; length of needle = 0?
	push	ebx
	push	edi
	mov	edi, eax	; edi = address of haystack
	push	esi
search:
	mov	esi, [esp+24]	; esi = address of needle
	mov	al, [esi]	; al = first character of needle
				; edi = current address in haystack
	repne	scasb		; edi = next address in haystack,
				; ecx = current length of haystack
	jne	short break	; (first character of) needle not in haystack?
	dec	ecx		; ecx = next length of haystack
	mov	al, [esi+edx-1]	; al = last character of needle
	cmp	al, [edi+edx-2]
	jne	short continue	; last character of needle not in haystack?
compare:
	mov	eax, edi	; eax = next address in haystack
	mov	ebx, ecx	; ebx = next length of haystack
if 0
	dec	edi		; edi = current address in haystack
				;     = address of matching character
				; esi = address of needle
	mov	ecx, edx	; ecx = length of needle
else
				; edi = next address in haystack
	inc	esi		; esi = address of needle + 1
	mov	ecx, edx
	dec	ecx		; ecx = length of needle - 1,
				; ZF = (ecx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsb
	je	short match	; needle in haystack?
	mov	edi, eax	; edi = current address in haystack
	mov	ecx, ebx	; ecx = current length of haystack
continue:
	cmp	ecx, edx
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	pop	esi
	pop	edi
	pop	ebx
empty:
	ret
match:
	dec	eax		; eax = address of needle in haystack
	pop	esi
	pop	edi
	pop	ebx
	ret
memmem	endp
memmove	proc	public		; void *memmove(void *destination, void const *source, size_t count)
	mov	eax, [esp+4]	; eax = address of destination
	mov	edx, [esp+8]	; edx = address of source
	cmp	edx, eax
	je	short return	; address of source = address of destination?
	mov	ecx, [esp+12]	; ecx = count
;;	jecxz	short return	; count = 0?
	xchg	esi, edx	; esi = address of source
	xchg	edi, eax	; edi = address of destination
	ja	short default	; address of source > address of destination?
overlap:
	lea	edi, [edi+ecx-1]
	lea	esi, [esi+ecx-1]
	std
default:
	rep	movsb
	cld
	mov	esi, edx
	mov	edi, eax
	mov	eax, [esp+4]	; eax = address of destination
return:
	ret
memmove	endp
memrchr	proc	public		; void *memrchr(void const *destination, int character, size_t count)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of destination
	mov	eax, [esp+8]	; eax = character
	mov	ecx, [esp+12]	; ecx = count
	lea	edi, [edi+ecx-1]
	std
	repne	scasb
	cld
	inc	edi		; edi = address of character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of character
	mov	edi, edx
	ret
memrchr	endp
memset	proc	public		; void *memset(void *destination, int character, size_t count)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of destination
	mov	eax, [esp+8]	; eax = character
	mov	ecx, [esp+12]	; ecx = count
;;	jecxz	short @f	; count = 0?
	rep	stosb
@@:
	mov	eax, [esp+4]	; eax = address of destination
	mov	edi, edx
	ret
memset	endp
	endi386-mem.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 3 command lines to generate the object file
            i386-mem.obj and add it to the existing object library
            i386.lib:
        SET ML=/c /Gy /safeseh /W3 /X ML.EXE i386-mem.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-mem.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
 Note: if the
            /Gy option
            to package every function in its own, separately linkable
            COMDAT
            section is not available with the version of the macro assembler
            ML.EXE you use,
            split the i386 assembler source into multiple pieces,
            with one function per source file.
        
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-mem.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
; Copyright © 2009-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; Microsoft calling convention for AMD64 platform:
; - first 4 integer or pointer arguments (from left to right) are passed
;   in registers RCX/R1, RDX/R2, R8 and R9;
; - arguments larger than 8 bytes are passed by reference;
; - surplus arguments are pushed on stack in reverse order (from right
;   to left), 8-byte aligned;
; - caller allocates memory for return value larger than 8 bytes and
;   passes pointer to it as (hidden) first argument, thus shifting
;   all other arguments;
; - caller always allocates "home space" for 4 arguments on stack,
;   even when less than 4 arguments are passed, but does not need to push
;   first 4 arguments;
; - callee can spill first 4 arguments from registers to "home space";
; - callee can clobber "home space";
; - stack is 16-byte aligned: callee must decrement RSP by 8+n*16 bytes
;   when it calls other functions (CALL instruction pushes 8 bytes);
; - 64-bit integer or pointer result is returned in register RAX/R0;
; - 32-bit integer result is returned in register EAX;
; - registers RAX/R0, RCX/R1, RDX/R2, R8, R9, R10 and R11 are volatile
;   and can be clobbered;
; - registers RBX/R3, RSP/R4, RBP/R5, RSI/R6, RDI/R7, R12, R13, R14 and
;   R15 must be preserved.
	.code
memccpy	proc	public		; void *memccpy(void *destination, void const *source, int character, size_t count)
	mov	r11, rsi
	mov	r10, rdi
	mov	rsi, rcx	; rsi = address of destination
	mov	rdi, rdx	; rdi = address of source
	mov	rax, r8		; rax = character
	mov	rcx, r9		; rcx = count
	test	rdx, rdx	; ZF = 0 (required when count is 0)
	repne	scasb		; rdi = address past character in source
	neg	rcx		; CF = (rcx <> 0)
				;    = ([rdi-1] = character)
	sbb	rax, rax	; rax = (rcx = 0) ? 0 : -1
	sub	rdi, rdx	; rdi = length of source (including character)
				;     = count'
	mov	rcx, rdi	; rcx = count'
	mov	rdi, rsi	; rdi = address of destination
	mov	rsi, rdx	; rsi = address of source
	rep	movsb		; rdi = address past character in destination
	and	rax, rdi	; rax = ([rdi-1] = character)
				;     ? address past character in destination : 0
	mov	rdi, r10
	mov	rsi, r11
	ret
memccpy	endp
memchr	proc	public		; void *memchr(void const *destination, int character, size_t count)
	mov	r9, rdi
	mov	rdi, rcx	; rdi = address of destination
	mov	rcx, r8		; rcx = count
	mov	eax, edx	; rax = character
	repne	scasb
	lea	rax, [rdi-1]
	test	rcx, rcx
	cmovz	rax, rcx	; rax = (rcx = 0) ? 0 : address of character
	mov	rdi, r9
	ret
memchr	endp
memcmp	proc	public		; int memcmp(void const *source, void const *destination, size_t count)
;;	xor	eax, eax	; rax = 0
;;	test	r8, r8
;;	jz	short equal	; count = 0?
;;	cmp	rcx, rdx
;;	je	short equal	; address of source = address of destination?
	mov	r9, rsi
	mov	rsi, rcx	; rsi = address of source
	mov	rcx, r8		; rcx = count
	mov	r8, rdi
	mov	rdi, rdx	; rdi = address of destination
	xor	eax, eax	; rax = 0,
				; CF = 0,
				; ZF = 1 (required when count is 0)
	repe	cmpsb
	seta	al		; rax = (*source > *destination)
	sbb	rax, 0		; rax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
	mov	rdi, r8
	mov	rsi, r9
equal:
	ret
memcmp	endp
memcpy	proc	public		; void *memcpy(void *destination, void const *source, size_t count)
	mov	rax, rcx	; rax = address of destination
;;	test	r8, r8
;;	jz	short return	; count = 0?
;;	cmp	rcx, rdx
;;	je	short return	; address of destination = address of source?
	mov	r9, rdi
	mov	rdi, rcx	; rdi = address of destination
	mov	rcx, r8		; rcx = count
	mov	r8, rsi
	mov	rsi, rdx	; rsi = address of source
if 1
	rep	movsb
else
	shr	rcx, 1		; rcx = count / 2
	jnc	short @f	; count % 2 = 0?
	movsb
@@:
	shr	rcx, 1		; rcx = count / 4
	jnc	short @f	; count % 4 = 0?
	movsw
@@:
	shr	rcx, 1		; rcx = count / 8
	jnc	short @f	; count % 8 = 0?
	movsd
@@:
	rep	movsq
endif
	mov	rdi, r9
	mov	rsi, r8
return:
	ret
memcpy	endp
memmem	proc	public		; void *memmem(void const *haystack, size_t count,
				;              void const *needle, size_t length)
	xor	eax, eax	; rax = address of needle in haystack = 0
	test	rdx, rdx
	jz	short empty	; length of haystack = 0?
	cmp	rdx, r9
	jb	short empty	; length of haystack < length of needle?
	mov	rax, rcx	; rax = address of haystack
	test	r9, r9
	jz	short empty	; length of needle = 0?
	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of haystack
	mov	rcx, rdx	; rcx = length of haystack
	mov	r11, rsi
search:
	mov	al, [r8]	; al = first character of needle
				; rdi = current address in haystack
	repne	scasb		; rdi = next address in haystack,
				; rcx = current length of haystack
	jne	short break	; (first character of) needle not in haystack?
	dec	rcx		; rcx = next length of haystack
	mov	al, [r8+r9-1]	; al = last character of needle
	cmp	al, [rdi+r9-2]
	jne	short continue	; last character of needle not in haystack?
compare:
	mov	rax, rdi	; rax = next address in haystack
	mov	rdx, rcx	; rdx = next length of haystack
if 0
	dec	rdi		; rdi = current address in haystack
				;     = address of matching character
	mov	rsi, r8		; rsi = address of needle
	mov	rcx, r9		; rcx = length of needle
else
				; rdi = next address in haystack
	mov	rsi, r8
	inc	rsi		; rsi = address of needle + 1
	mov	rcx, r9
	dec	rcx		; rcx = length of needle - 1,
				; ZF = (rcx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsb
	je	short match	; needle in haystack?
	mov	rdi, rax	; rdi = current address in haystack
	mov	rcx, rdx	; rcx = current length of haystack
continue:
	cmp	rcx, r9
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	mov	rdi, r10
	mov	rsi, r11
empty:
	ret
match:
	dec	rax		; rax = address of needle in haystack
	mov	rdi, r10
	mov	rsi, r11
	ret
memmem	endp
memmove	proc	public		; void *memmove(void *destination, void const *source, size_t count)
	mov	rax, rcx	; rax = address of destination
;;	test	r8, r8
;;	jz	short return	; count = 0?
	cmp	rcx, rdx
	je	short return	; address of destination = address of source?
	mov	r9, rdi
	mov	rdi, rcx	; rdi = address of destination
	mov	rcx, r8		; rcx = count
	mov	r8, rsi
	mov	rsi, rdx	; rsi = address of source
	jb	short default	; address of destination < address of source?
	add	rdx, rcx	; rdx = address of source + count
	cmp	rdi, rdx
	jae	short default	; address of destination >= address of source + count?
overlap:
	lea	rdi, [rdi+rcx-1]
	lea	rsi, [rsi+rcx-1]
	std
default:
	rep	movsb
	cld
	mov	rdi, r9
	mov	rsi, r8
return:
	ret
memmove	endp
memrchr	proc	public		; void *memrchr(void const *destination, int character, size_t count)
	mov	r9, rdi
	lea	rdi, [rcx+r8-1]	; rdi = address of destination + count - 1
	mov	eax, edx	; rax = character
	mov	rcx, r8		; rcx = count
	std
	repne	scasb
	cld
	lea	rax, [rdi+1]
	test	rcx, rcx
	cmovz	rax, rcx	; rax = (rcx = 0) ? 0 : address of character
	mov	rdi, r9
	ret
memrchr	endp
memset	proc	public		; void *memset(void *destination, int character, size_t count)
	mov	r9, rcx		; r9 = address of destination
	mov	rcx, r8		; rcx = count
;;	jrcxz	short @f	; count = 0?
	mov	r8, rdi
	mov	rdi, r9		; rdi = address of destination
	mov	eax, edx	; rax = character
	rep	stosb
	mov	rdi, r8
@@:
	mov	rax, r9		; rax = address of destination
	ret
memset	endp
	endamd64-mem.asm in the directory where you created the
            object library amd64.lib before, then execute the
            following 3 command lines to generate the object file
            amd64-mem.obj and add it to the existing object library
            amd64.lib:
        SET ML=/c /Gy /W3 /X ML64.EXE amd64-mem.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-mem.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
 Note: if the
            /Gy option
            to package every function in its own, separately linkable
            COMDAT
            section is not available with the version of the macro assembler
            ML64.EXE
            you use, split the AMD64 assembler source into
            multiple pieces, with one function per source file.
        
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-mem.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
memcpy() and memset() with Intrinsic Functionsmemcpy()
            function as
            __movsb()
            alias REP MOVSB
            and the
            memset()
            function as
            __stosb()
            alias REP STOSB, but without
            shuffling as many registers as the external functions:
        // Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef _WIN64
typedef	unsigned int	size_t;
#endif
#pragma function(memcpy, memset)
#pragma intrinsic(__movsb, __stosb)
__inline
void	*memcpy(void *destination, void const *source, size_t count)
{
	__movsb((unsigned char *) destination, (unsigned char const *) source, count);
	return destination;
}
__inline
void	*memset(void *destination, int character, size_t count)
{
	__stosb((unsigned char *) destination, (unsigned char) character, count);
	return destination;
}strchr() Standard Function for i386 Platformstrchr()
            function is not a compiler helper function, it is like the
            memchr()
            included here for entertainment due to its DIR "%source%\intel\strchr.asm" TYPE "%source%\intel\strchr.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             5,904 strchr.asm
               1 File(s)          5,904 bytes
               0 Dir(s)    9,876,543,210 bytes free
        page    ,132
        title   strchr - search string for given character
;***
;strchr.asm - search a string for a given character
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       defines strchr() - search a string for a character
;
;*******************************************************************************
        .xlist
        include cruntime.inc
        .list
page
;***
;char *strchr(string, chr) - search a string for a character
;
;Purpose:
;       Searches a string for a given character, which may be the
;       null character '\0'.
;
;       Algorithm:
;       char *
;       strchr (string, chr)
;       char *string, chr;
;       {
;         while (*string && *string != chr)
;             string++;
;         if (*string == chr)
;             return(string);
;         return((char *)0);
;       }
;
;Entry:
;       char *string - string to search in
;       char chr     - character to search for
;
;Exit:
;       returns pointer to the first occurence of c in string
;       returns NULL if chr does not occur in string
;
;Uses:
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
found_bx:
        lea     eax,[edx - 1]
        pop     ebx                 ; restore ebx
        ret                         ; _cdecl return
        align   16
        public  strchr, __from_strstr_to_strchr
strchr  proc \
        string:ptr byte, \
        chr:byte
        OPTION PROLOGUE:NONE, EPILOGUE:NONE
        .FPO    ( 0, 2, 0, 0, 0, 0 )
        xor     eax,eax
        mov     al,[esp + 8]        ; al = chr (search char)
__from_strstr_to_strchr label proc
        push    ebx                 ; PRESERVE EBX
        mov     ebx,eax             ; ebx = 0/0/0/chr
        shl     eax,8               ; eax = 0/0/chr/0
        mov     edx,[esp + 8]       ; edx = buffer
        test    edx,3               ; test if string is aligned on 32 bits
        jz      short main_loop_start
str_misaligned:                     ; simple byte loop until string is aligned
        mov     cl,[edx]
        add     edx,1
        cmp     cl,bl
        je      short found_bx
        test    cl,cl
        jz      short retnull_bx
        test    edx,3               ; now aligned ?
        jne     short str_misaligned
main_loop_start:                    ; set all 4 bytes of ebx to [chr]
        or      ebx,eax             ; ebx = 0/0/chr/chr
        push    edi                 ; PRESERVE EDI
        mov     eax,ebx             ; eax = 0/0/chr/chr
        shl     ebx,10h             ; ebx = chr/chr/0/0
        push    esi                 ; PRESERVE ESI
        or      ebx,eax             ; ebx = all 4 bytes = [chr]
; in the main loop (below), we are looking for chr or for EOS (end of string)
main_loop:
        mov     ecx,[edx]           ; read  dword (4 bytes)
        mov     edi,7efefeffh       ; work with edi & ecx for looking for chr
        mov     eax,ecx             ; eax = dword
        mov     esi,edi             ; work with esi & eax for looking for EOS
        xor     ecx,ebx             ; eax = dword xor chr/chr/chr/chr
        add     esi,eax
        add     edi,ecx
        xor     ecx,-1
        xor     eax,-1
        xor     ecx,edi
        xor     eax,esi
        add     edx,4
        and     ecx,81010100h       ; test for chr
        jnz     short chr_is_found  ; chr probably has been found
        ; chr was not found, check for EOS
        and     eax,81010100h       ; is any flag set ??
        jz      short main_loop     ; EOS was not found, go get another dword
        and     eax,01010100h       ; is it in high byte?
        jnz     short retnull       ; no, definitely found EOS, return failure
        and     esi,80000000h       ; check was high byte 0 or 80h
        jnz     short main_loop     ; it just was 80h in high byte, go get
                                    ; another dword
retnull:
        pop     esi
        pop     edi
retnull_bx:
        pop     ebx
        xor     eax,eax
        ret                         ; _cdecl return
chr_is_found:
        mov     eax,[edx - 4]       ; let's look one more time on this dword
        cmp     al,bl               ; is chr in byte 0?
        je      short byte_0
        test    al,al               ; test if low byte is 0
        je      retnull
        cmp     ah,bl               ; is it byte 1
        je      short byte_1
        test    ah,ah               ; found EOS ?
        je      retnull
        shr     eax,10h             ; is it byte 2
        cmp     al,bl
        je      short byte_2
        test    al,al               ; if in al some bits were set, bl!=bh
        je      retnull
        cmp     ah,bl
        je      short byte_3
        test    ah,ah
        jz      retnull
        jmp     short main_loop     ; neither chr nor EOS found, go get
                                    ; another dword
byte_3:
        pop     esi
        pop     edi
        lea     eax,[edx - 1]
        pop     ebx                 ; restore ebx
        ret                         ; _cdecl return
byte_2:
        lea     eax,[edx - 2]
        pop     esi
        pop     edi
        pop     ebx
        ret                         ; _cdecl return
byte_1:
        lea     eax,[edx - 3]
        pop     esi
        pop     edi
        pop     ebx
        ret                         ; _cdecl return
byte_0:
        lea     eax,[edx - 4]
        pop     esi                 ; restore esi
        pop     edi                 ; restore edi
        pop     ebx                 ; restore ebx
        ret                         ; _cdecl return
strchr  endp
        end
            With 89 instructions in 206 bytes, this implementation is even
            worse than that of the memchr()
            routine shown and dissected above!
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
strchr	proc	public		; char *strchr(unsigned char const *string, int character)
	xor	eax, eax	; eax = 0
	cdq			; edx = 0
	mov	dl, [esp+8]	; edx = character
	imul	edx, 01010101h	; edx = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	mov	[esp+8], edx
	mov	ecx, [esp+4]	; ecx = address of string
	push	ebx
	mov	ebx, ecx
	and	ecx, 3		; ecx = address of string % 4
				;     = 4 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	ebx, ecx	; ebx = aligned address before string
	shl	ecx, 3		; ecx = (4 - number of unaligned characters) * 8
				;     = 32 - number of unaligned bits
	dec	eax		; eax = ~0
	shl	eax, cl		; eax = ~0 for unaligned characters, 0 elsewhere
	not	eax		; eax = 0 for unaligned characters, ~0 elsewhere
	mov	ecx, [ebx]	; ecx = unaligned characters
	xor	edx, ecx	; edx = unaligned characters ^ multiplied character
	or	edx, eax	; edx = '\0' for matching characters
	or	eax, ecx	; eax = unaligned characters, ~0 elsewhere
	jmp	mycroft
next:
	add	ebx, 4		; ebx = address of next 4 aligned characters
aligned:
	mov	edx, [esp+12]	; edx = multiplied character
	mov	eax, [ebx]	; eax = next 4 aligned characters
	xor	edx, eax	; edx = next 4 aligned characters ^ multiplied character
				;     = '\0' for matching characters
mycroft:
	mov	ecx, eax
	sub	eax, 01010101h
	not	ecx
	and	ecx, eax
	mov	eax, edx
	sub	eax, 01010101h
	not	edx
	and	eax, edx
	or	eax, ecx
	and	eax, 80808080h	; eax = '\200' for '\0' or matching characters
	jz	short next	; neither '\0' nor any matching character?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character * 8 + 7
				;     = {7, 15, 23, 31}
	shr	eax, 3		; eax = offset of '\0' or matching character
				;     = {0, 1, 2, 3}
	cdq			; edx = 0
	add	eax, ebx	; eax = address of '\0' or matching character
	mov	dl, [esp+12]	; edx = character
	cmp	dl, [eax]	; ZF = (character = matching character)
	setne	dl		; edx = (character = matching character) ? 0 : 1
	dec	edx		; edx = (character = matching character) ? -1 : 0
	and	eax, edx	; eax = address of matching character
	pop	ebx
	ret
strchr	endp
	endi386-strchr.asm and the
            ANSI C
            source presented below as i386-strchr.c, then execute
            the 6 command lines following the
            ANSI C
            source to assemble i386-strchr.asm, compile
            i386-strchr.c, link the generated object files
            i386-strchr.obj and i386-strchr.tmp, and
            finally execute the image file i386-strchr.exe to
            demonstrate the correct operation:
        // Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1025];
	DWORD	dwFormat;
	DWORD	dwOutput;
	va_list	vaInput;
	va_start(vaInput, lpFormat);
	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
	va_end(vaInput);
	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;
	return dwOutput == dwFormat;
}
const	CHAR	szString[] = "01234567890";
__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	LPCSTR	lpString = szString + sizeof(szString);
	DWORD	dwError = ERROR_SUCCESS;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	if (hOutput == INVALID_HANDLE_VALUE)
		dwError = GetLastError();
	else
		while (--lpString >= szString)
		{
			PrintFormat(hOutput,
			            "0x%p: strchr(\"%hs\", '0') = 0x%p\r\n",
			            lpString, lpString, strchr(lpString, '0'));
			PrintFormat(hOutput,
			            "0x%p: strchr(\"%hs\", '%hc') = 0x%p\r\n",
			            lpString, lpString, *lpString, strchr(lpString, *lpString));
			PrintFormat(hOutput,
			            "0x%p: strchr(\"%hs\", '\\0') = 0x%p\r\n",
			            lpString, lpString, strchr(lpString, '\0'));
		}
	ExitProcess(dwError);
}SET ML=/c /safeseh /W3 /X ML.EXE i386-strchr.asm SET CL=/GAFy /Oy /W4 /Zl SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE /Foi386-strchr.tmp i386-strchr.obj i386-strchr.c kernel32.lib user32.lib .\i386-strchr.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
 Assembling: i386-strchr.asm
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
i386-strchr.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE
/out:i386-strchr.exe
i386-strchr.obj
i386-strchr.tmp
kernel32.lib
user32.lib
0x00922027: strchr("", '0') = 0x00000000
0x00922027: strchr("", '▯') = 0x00922027
0x00922027: strchr("", '\0') = 0x00922027
0x00922026: strchr("0", '0') = 0x00922026
0x00922026: strchr("0", '0') = 0x00922026
0x00922026: strchr("0", '\0') = 0x00922027
0x00922025: strchr("90", '0') = 0x00922026
0x00922025: strchr("90", '9') = 0x00922025
0x00922025: strchr("90", '\0') = 0x00922027
0x00922024: strchr("890", '0') = 0x00922026
0x00922024: strchr("890", '8') = 0x00922024
0x00922024: strchr("890", '\0') = 0x00922027
0x00922023: strchr("7890", '0') = 0x00922026
0x00922023: strchr("7890", '7') = 0x00922023
0x00922023: strchr("7890", '\0') = 0x00922027
0x00922022: strchr("67890", '0') = 0x00922026
0x00922022: strchr("67890", '6') = 0x00922022
0x00922022: strchr("67890", '\0') = 0x00922027
0x00922021: strchr("567890", '0') = 0x00922026
0x00922021: strchr("567890", '5') = 0x00922021
0x00922021: strchr("567890", '\0') = 0x00922027
0x00922020: strchr("4567890", '0') = 0x00922026
0x00922020: strchr("4567890", '4') = 0x00922020
0x00922020: strchr("4567890", '\0') = 0x00922027
0x0092201F: strchr("34567890", '0') = 0x00922026
0x0092201F: strchr("34567890", '3') = 0x0092201F
0x0092201F: strchr("34567890", '\0') = 0x00922027
0x0092201E: strchr("234567890", '0') = 0x00922026
0x0092201E: strchr("234567890", '2') = 0x0092201E
0x0092201E: strchr("234567890", '\0') = 0x00922027
0x0092201D: strchr("1234567890", '0') = 0x00922026
0x0092201D: strchr("1234567890", '1') = 0x0092201D
0x0092201D: strchr("1234567890", '\0') = 0x00922027
0x0092201C: strchr("01234567890", '0') = 0x0092201C
0x0092201C: strchr("01234567890", '0') = 0x0092201C
0x0092201C: strchr("01234567890", '\0') = 0x00922027
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.xmm
	.model	flat, C
	.code
strchr	proc	public		; char *strchr(unsigned char const *string, int character)
if 0
	xor	eax, eax	; eax = 0
	mov	al, [esp+8]	; eax = character
	imul	eax, 01010101h	; eax = character
				;     | character << 8
				;     | character << 16
				;     | character << 24
	movd	xmm0, eax
else
	movd	xmm0, dword ptr [esp+8]
	punpcklbw xmm0, xmm0
	punpcklwd xmm0, xmm0
endif
	pshufd	xmm0, xmm0, 0	; xmm0 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 characters
	pxor	xmm2, xmm2	; xmm2 = 0
	pcmpeqb	xmm2, xmm1	; xmm2 = '\377' for each '\0' in chunk
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching character in chunk
	por	xmm1, xmm2	; xmm1 = '\377' for each '\0' or matching character in chunk
	pmovmskb eax, xmm1	; eax = bitmask for '\0' or matching characters in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' or matching characters in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	movdqa	xmm1, [edx]	; xmm1 = chunk of 16 characters
	pxor	xmm2, xmm2	; xmm2 = 0
	pcmpeqb	xmm2, xmm1	; xmm2 = '\377' for each '\0' in chunk
	pcmpeqb	xmm1, xmm0	; xmm1 = '\377' for each matching character in chunk
	por	xmm1, xmm2	; xmm1 = '\377' for each '\0' or matching character in chunk
	pmovmskb eax, xmm1	; eax = bitmask for '\0' or matching characters in chunk
	test	eax, eax
	jz	short next	; no '\0' or matching character in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character in chunk
	add	eax, edx	; eax = address of '\0' or matching character
	mov	cl, [esp+8]	; cl = character
	cmp	cl, [eax]	; ZF = (character = matching character)
	setne	cl		; ecx = (character = matching character) ? 0 : 1
	dec	ecx		; ecx = (character = matching character) ? -1 : 0
	and	eax, ecx	; eax = address of matching character
	ret
strchr	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.xmm
	.model	flat, C
	.code
strchr	proc	public		; char *strchr(unsigned char const *string, int character)
	pxor	xmm0, xmm0	; xmm0 = 0
	movd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	pshufb	xmm1, xmm0	; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	movdqa	xmm2, [edx]	; xmm2 = chunk of 16 characters
;;	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, xmm2	; xmm0 = '\377' for each '\0' in chunk
	pcmpeqb	xmm2, xmm1	; xmm2 = '\377' for each matching character in chunk
	por	xmm0, xmm2	; xmm0 = '\377' for each '\0' or matching character in chunk
	pmovmskb eax, xmm0	; eax = bitmask for '\0' or matching characters in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' or matching characters in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	movdqa	xmm2, [edx]	; xmm2 = chunk of 16 characters
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, xmm2	; xmm0 = '\377' for each '\0' in chunk
	pcmpeqb	xmm2, xmm1	; xmm2 = '\377' for each matching character in chunk
	por	xmm0, xmm2	; xmm0 = '\377' for each '\0' or matching character in chunk
	pmovmskb eax, xmm0	; eax = bitmask for '\0' or matching characters in chunk
	test	eax, eax
	jz	short next	; no '\0' or matching character in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character in chunk
	add	eax, edx	; eax = address of '\0' or matching character
	mov	cl, [esp+8]	; cl = character
	cmp	cl, [eax]	; ZF = (character = matching character)
	setne	cl		; ecx = (character = matching character) ? 0 : 1
	dec	ecx		; ecx = (character = matching character) ? -1 : 0
	and	eax, ecx	; eax = address of matching character
	ret
strchr	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.xmm
	.model	flat, C
	.code
strchr	proc	public		; char *strchr(unsigned char const *string, int character)
	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	vmovd	xmm1, dword ptr [esp+8]
				; xmm1 = character
	vpshufb	xmm1, xmm1, xmm0; xmm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	vmovdqa	xmm2, xmmword ptr [edx]
				; xmm2 = chunk of 16 characters
	vpcmpeqb xmm3, xmm2, xmm1
				; xmm3 = '\377' for each matching character in chunk
	vpcmpeqb xmm2, xmm2, xmm0
				; xmm2 = '\377' for each '\0' in chunk
	vpor	xmm2, xmm2, xmm3; xmm2 = '\377' for each '\0' or matching character in chunk
	vpmovmskb eax, xmm2	; eax = bitmask for '\0' or matching characters in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' or matching characters in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	vmovdqa	xmm2, xmmword ptr [edx]
				; xmm2 = chunk of 16 characters
	vpcmpeqb xmm3, xmm2, xmm1
				; xmm3 = '\377' for each matching character in chunk
	vpcmpeqb xmm2, xmm2, xmm0
				; xmm2 = '\377' for each '\0' in chunk
	vpor	xmm2, xmm2, xmm3; xmm2 = '\377' for each '\0' or matching character in chunk
	vpmovmskb eax, xmm2	; eax = bitmask for '\0' or matching characters in chunk
	test	eax, eax
	jz	short next	; no '\0' or matching character in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character in chunk
	add	eax, edx	; eax = address of '\0' or matching character
	mov	cl, [esp+8]	; cl = character
	cmp	cl, [eax]	; ZF = (character = matching character)
	setne	cl		; ecx = (character = matching character) ? 0 : 1
	dec	ecx		; ecx = (character = matching character) ? -1 : 0
	and	eax, ecx	; eax = address of matching character
	ret
strchr	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.ymm
	.model	flat, C
	.code
strchr	proc	public		; char *strchr(unsigned char const *string, int character)
	vpxor	ymm0, ymm0, ymm0; ymm0 = 0
	vpbroadcastb ymm1, byte ptr [esp+8]
				; ymm1 = multiplied character
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 31		; ecx = address of string % 32
				;     = 32 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	vmovdqa	ymm2, ymmword ptr [edx]
				; ymm2 = chunk of 32 characters
	vpcmpeqb ymm3, ymm2, ymm1
				; ymm3 = '\377' for each matching character in chunk
	vpcmpeqb ymm2, ymm2, ymm0
				; ymm2 = '\377' for each '\0' in chunk
	vpor	ymm2, ymm2, ymm3; ymm2 = '\377' for each '\0' or matching character in chunk
	vpmovmskb eax, ymm2	; eax = bitmask for '\0' or matching characters in chunk
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' or matching characters in string
	jnz	short match
next:
	add	edx, 32		; edx = address of next chunk of aligned characters
aligned:
	vmovdqa	ymm2, ymmword ptr [edx]
				; ymm2 = chunk of 32 characters
	vpcmpeqb ymm3, ymm2, ymm1
				; ymm3 = '\377' for each matching character in chunk
	vpcmpeqb ymm2, ymm2, ymm0
				; ymm2 = '\377' for each '\0' in chunk
	vpor	ymm2, ymm2, ymm3; ymm2 = '\377' for each '\0' or matching character in chunk
	vpmovmskb eax, ymm2	; eax = bitmask for '\0' or matching characters in chunk
	test	eax, eax
	jz	short next	; no '\0' or matching character in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' or matching character in chunk
	add	eax, edx	; eax = address of '\0' or matching character
	mov	cl, [esp+8]	; cl = character
	cmp	cl, [eax]	; ZF = (character = matching character)
	setne	cl		; ecx = (character = matching character) ? 0 : 1
	dec	ecx		; ecx = (character = matching character) ? -1 : 0
	and	eax, ecx	; eax = address of matching character
	ret
strchr	endp
	endstrlen() Standard Function for i386 Platformstrlen()
            function is not a compiler helper function, it is like the
            memchr()
            included here for entertainment due to its DIR "%source%\intel\strlen.asm" TYPE "%source%\intel\strlen.asm"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src\intel
02/18/2011  03:40 PM             3,208 strlen.asm
               1 File(s)          3,208 bytes
               0 Dir(s)    9,876,543,210 bytes free
        page    ,132
        title   strlen - return the length of a null-terminated string
;***
;strlen.asm - contains strlen() routine
;
;       Copyright (c) Microsoft Corporation. All rights reserved.
;
;Purpose:
;       strlen returns the length of a null-terminated string,
;       not including the null byte itself.
;
;*******************************************************************************
        .xlist
        include cruntime.inc
        .list
page
;***
;strlen - return the length of a null-terminated string
;
;Purpose:
;       Finds the length in bytes of the given string, not including
;       the final null character.
;
;       Algorithm:
;       int strlen (const char * str)
;       {
;           int length = 0;
;
;           while( *str++ )
;                   ++length;
;
;           return( length );
;       }
;
;Entry:
;       const char * str - string whose length is to be computed
;
;Exit:
;       EAX = length of the string "str", exclusive of the final null byte
;
;Uses:
;       EAX, ECX, EDX
;
;Exceptions:
;
;*******************************************************************************
        CODESEG
        public  strlen
strlen  proc \
        buf:ptr byte
        OPTION PROLOGUE:NONE, EPILOGUE:NONE
        .FPO    ( 0, 1, 0, 0, 0, 0 )
string  equ     [esp + 4]
        mov     ecx,string              ; ecx -> string
        test    ecx,3                   ; test if string is aligned on 32 bits
        test    cl,3
        je      short main_loop
str_misaligned:
        ; simple byte loop until string is aligned
        mov     al,byte ptr [ecx]
        add     ecx,1
        inc     ecx
        test    al,al
        je      short byte_3
        test    ecx,3
        test    cl,3
        jne     short str_misaligned
        jmp     short main_loop
byte_3:
        lea     eax,[ecx - 1]
        mov     ecx,string
        sub     eax,ecx
        ret
        add     eax,dword ptr 0         ; 5 byte nop to align label below
        align   16                      ; should be redundant
main_loop:
        mov     eax,dword ptr [ecx]     ; read 4 bytes
        mov     edx,7efefeffh
        add     edx,eax
        xor     eax,-1
        xor     eax,edx
        add     ecx,4
        test    eax,81010100h
        lea     edx,[eax-01010101h]
        not     eax
        and     eax,edx
        and     eax,80808080h
        je      short main_loop
        ; found zero byte in the loop
        bsf     eax,eax
        shr     eax,3
        lea     eax,[eax+ecx-4]
        mov     ecx,string
        sub     eax,ecx
        ret
        mov     eax,[ecx - 4]
        test    al,al                   ; is it byte 0
        je      short byte_0
        test    ah,ah                   ; is it byte 1
        je      short byte_1
        test    eax,00ff0000h           ; is it byte 2
        je      short byte_2
        test    eax,0ff000000h          ; is it byte 3
        je      short byte_3
        jmp     short main_loop         ; taken if bits 24-30 are clear and bit
                                        ; 31 is set
byte_3:
        lea     eax,[ecx - 1]
        mov     ecx,string
        sub     eax,ecx
        ret
byte_2:
        lea     eax,[ecx - 2]
        mov     ecx,string
        sub     eax,ecx
        ret
byte_1:
        lea     eax,[ecx - 3]
        mov     ecx,string
        sub     eax,ecx
        ret
byte_0:
        lea     eax,[ecx - 4]
        mov     ecx,string
        sub     eax,ecx
        ret
strlen  endp
        end
            With 44 instructions in 139 bytes, this routine is a real gem too
            – which nobody with a sane mind should but consider to use!
        CAVEAT: Intel’s current Optimization Reference Manual: Volume 1, published January 2024, presents this dumb implementation as Example 14-3!
 OOPS: the deleted
            TEST instructions with
            immediate value 3 should be replaced with the inserted
            shorter ones, saving 6 bytes.
        
 OUCH¹: the deleted
            ADD instruction which increment by 1 should
            be replaced with the inserted shorter
            INC instruction, saving 1 byte.
        
 Note: the 7 saved bytes allow to move the 4
            instructions after label byte_3: before the label
            main_loop:, (ab)using them to align the loop.
        
 OUCH²: instead of the deleted
            XOR instruction with
            immediate operand -1 the inserted shorter
            NOT instruction
            should be used, saving 1 byte!
        
 OUCH³: when the 5 deleted
            instructions after label main_loop: are replaced with
            the 4 instructions inserted there, the 22 (in words:
            twenty-two) deleted instructions at the
            end of the function can be replaced with the 6 faster and shorter
            instructions inserted there, saving 42 (in words:
            fourty-two) bytes!
        
Note: Alan Mycroft posted the better, faster and shorter method on April 8, 1987 to the USENET news group comp.lang.c
You might be interested to know that such detection of null bytes in words
can be done in 3 or 4 instructions on almost any hardware (nay even in C).
(Code that follows relies on x being a 32 bit unsigned (or 2's complement
int with overflow ignored)...)
#define has_nullbyte_(x) ((x - 0x01010101) & ~x & 0x80808080)
Then if e is an expression without side effects (e.g. variable)
has_nullbyte_(e)
is nonzero iff the value of e has a null byte.
(One can view this as explicit programming of the Manchester carry chain
present in many processors which is hidden from the instruction level).
Note: see Bit Twiddling Hacks – Determine if a word has a zero byte for a comparison of both methods and more details.
 Note: Microsoft’s
            strcat.asm, strchr.asm,
            strncat.asm and strncpy.asm sources suffer
            from the same plus some more deficiencies and flaws!
        
Note: with the modifications shown in the source, this routine has 27 instructions in 87 bytes, i.e. less than two thirds of the original’s instructions and bytes.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
strlen	proc	public		; size_t strlen(unsigned char const *string)
	mov	edx, [esp+4]	; edx = address of string
	mov	ecx, edx
	and	ecx, 3		; ecx = address of string % 4
				;     = 4 - number of unaligned characters
	jz	short aligned	; address of string % 4 = 0?
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	shl	ecx, 3		; ecx = (4 - number of unaligned characters) * 8
				;     = 32 - number of unaligned bits
if 0
	xor	eax, eax
	dec			; eax = ~0
else
	or	eax, -1		; eax = ~0
endif
	shl	eax, cl		; eax = ~0 for unaligned characters, 0 elsewhere
	not	eax		; eax = 0 for unaligned characters, ~0 elsewhere
	or	eax, [edx]	; eax = unaligned characters
	jmp	short mycroft
next:
	add	edx, 4		; edx = address of next 4 aligned characters
aligned:
	mov	eax, [edx]	; eax = next 4 aligned characters
mycroft:
	mov	ecx, eax
	sub	eax, 01010101h
	not	ecx
	and	eax, 80808080h
	and	eax, ecx	; eax = '\200' for matching characters, '\0' elsewhere
	jz	short next	; no '\0' in any character?
match:
	bsf	eax, eax	; eax = offset of '\0' * 8 + 7
				;     = {7, 15, 23, 31}
	shr	eax, 3		; eax = offset of '\0'
				;     = {0, 1, 2, 3}
	add	eax, edx	; eax = address of '\0'
	sub	eax, [esp+4]	; eax = length of string
	ret
strlen	endp
	endi386-strlen.asm and the
            ANSI C
            source presented below as i386-strlen.c, then execute
            the 6 command lines following the
            ANSI C
            source to assemble i386-strlen.asm, compile
            i386-strlen.c, link the generated object files
            i386-strlen.obj and i386-strlen.tmp, and
            finally execute the image file i386-strlen.exe to
            demonstrate the correct operation:
        // Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#pragma function(strlen)
__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1025];
	DWORD	dwFormat;
	DWORD	dwOutput;
	va_list	vaInput;
	va_start(vaInput, lpFormat);
	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
	va_end(vaInput);
	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;
	return dwOutput == dwFormat;
}
const	CHAR	szString[] = "987654321";
__declspec(noreturn)
VOID	CDECL	mainCRTStartup(VOID)
{
	LPCSTR	lpString = szString + sizeof(szString);
	DWORD	dwError = ERROR_SUCCESS;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	if (hOutput == INVALID_HANDLE_VALUE)
		dwError = GetLastError();
	else
		while (--lpString >= szString)
			PrintFormat(hOutput,
			            "0x%p: strlen(\"%hs\") = %lu\r\n",
			            lpString, lpString, strlen(lpString));
	ExitProcess(dwError);
}SET ML=/c /safeseh /W3 /X ML.EXE i386-strlen.asm SET CL=/GAFy /Oy /W4 /Zl SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE /Foi386-strlen.tmp i386-strlen.obj i386-strlen.c kernel32.lib user32.lib .\i386-strlen.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01
Copyright (C) Microsoft Corporation.  All rights reserved.
 Assembling: i386-strlen.asm
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
i386-strlen.c
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE
/out:i386-strlen.exe
i386-strlen.obj
i386-strlen.tmp
kernel32.lib
user32.lib
0x01202025: strlen("") = 0
0x01202024: strlen("1") = 1
0x01202023: strlen("21") = 2
0x01202022: strlen("321") = 3
0x01202021: strlen("4321") = 4
0x01202020: strlen("54321") = 5
0x0120201F: strlen("654321") = 6
0x0120201E: strlen("7654321") = 7
0x0120201D: strlen("87654321") = 8
0x0120201C: strlen("987654321") = 9
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.xmm
	.model	flat, C
	.code
strlen	proc	public		; size_t strlen(unsigned char const *string)
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, [edx]	; xmm0 = '\377' for each '\0' in chunk of characters
	pmovmskb eax, xmm0	; eax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, [edx]	; xmm0 = '\377' for each '\0' in chunk of characters
	pmovmskb eax, xmm0	; eax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' in chunk of characters
	add	eax, edx	; eax = address of '\0'
	sub	eax, [esp+4]	; eax = length of string
	ret
strlen	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.ymm
	.model	flat, C
	.code
strlen	proc	public		; size_t strlen(unsigned char const *string)
	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 15		; ecx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	vpcmpeqb xmm1, xmm0, [edx]
				; xmm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, xmm1	; eax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' in string
	jnz	short match
next:
	add	edx, 16		; edx = address of next chunk of aligned characters
aligned:
	vpcmpeqb xmm1, xmm0, [edx]
				; xmm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, xmm1	; eax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' in chunk of characters
	add	eax, edx	; eax = address of '\0'
	sub	eax, [esp+4]	; eax = length of string
	ret
strlen	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.ymm
	.model	flat, C
	.code
strlen	proc	public		; size_t strlen(unsigned char const *string)
	vpxor	ymm0, ymm0, ymm0; ymm0 = 0
	mov	ecx, [esp+4]	; ecx = address of string
	mov	edx, ecx
	and	ecx, 31		; ecx = address of string % 32
				;     = 32 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	edx, ecx	; edx = aligned address before string
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, ymm1	; eax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; eax = bitmask for '\0' in string
	jnz	short match
next:
	add	edx, 32		; edx = address of next chunk of aligned characters
aligned:
	vpcmpeqb ymm1, ymm0, [edx]
				; ymm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, ymm1	; eax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; eax = offset of '\0' in chunk of characters
	add	eax, edx	; eax = address of '\0'
	sub	eax, [esp+4]	; eax = length of string
	ret
strlen	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.code
strlen	proc	public		; size_t strlen(unsigned char const *string)
	mov	r8, 0101010101010101h
if 0
	mov	r9, 8080808080808080h
elseif 0
	imul	r9, r8, 128	; r9 = 0x8080808080808080
else
	mov	r9, r8
	ror	r9, 1		; r9 = 0x8080808080808080
endif
	mov	r10, rcx
	mov	rdx, rcx	; rdx = address of string
	and	rcx, 7		; rcx = address of string % 8
				;     = 8 - number of unaligned characters
	jz	short aligned	; address of string % 8 = 0?
unaligned:
	sub	rdx, rcx	; rdx = aligned address before string
	shl	ecx, 3		; rcx = (8 - number of unaligned characters) * 8
				;     = 64 - number of unaligned bits
ifdef AMD
	stc
	sbb	rax, rax	; rax = ~0
else
	or	rax, -1		; rax = ~0
endif
	shl	rax, cl		; rax = ~0 for unaligned characters, 0 elsewhere
	not	rax		; rax = 0 for unaligned characters, ~0 elsewhere
	or	rax, [rdx]	; rax = unaligned characters
	jmp	short mycroft
next:
	add	rdx, 8		; rdx = address of next 8 aligned characters
aligned:
	mov	rax, [rdx]	; rax = next 8 aligned characters
mycroft:
	mov	rcx, rax
	sub	rax, r8
	not	rcx
	and	rax, r9
	and	rcx, rax	; rax = '\200' for matching characters, '\0' elsewhere
	jz	short next	; no '\0' in any character?
match:
	bsf	rax, rcx	; rax = offset of '\0' * 8 + 7
				;     = {7, 15, 23, 31, 39, 47, 55, 63}
	shr	eax, 3		; rax = offset of '\0'
				;     = {0, 1, 2, 3, 4, 5, 6, 7}
	add	rax, rdx	; rax = address of '\0'
	sub	rax, r10	; rax = length of string
	ret
strlen	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.code
strlen	proc	public		; size_t strlen(unsigned char const *string)
	mov	rdx, rcx	; rdx = address of string
	mov	r8, rcx
	and	ecx, 15		; rcx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	rdx, rcx	; rdx = aligned address before string
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, [rdx]	; xmm0 = '\377' for each '\0' in chunk of characters
	pmovmskb eax, xmm0	; rax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; rax = bitmask for '\0' in string
	jnz	short match
next:
	add	rdx, 16		; rdx = address of next chunk of aligned characters
aligned:
	pxor	xmm0, xmm0	; xmm0 = 0
	pcmpeqb	xmm0, [rdx]	; xmm0 = '\377' for each '\0' in chunk of characters
	pmovmskb eax, xmm0	; rax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; rax = offset of '\0' in chunk of characters
	add	rax, rdx	; rax = address of '\0'
	sub	rax, r8		; rax = length of string
	ret
strlen	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.code
strlen	proc	public		; size_t strlen(unsigned char const *string)
	vpxor	xmm0, xmm0, xmm0; xmm0 = 0
	mov	rdx, rcx	; rdx = address of string
	mov	r8, rcx
	and	ecx, 15		; rcx = address of string % 16
				;     = 16 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	rdx, rcx	; rdx = aligned address before string
	vpcmpeqb xmm1, xmm0, [rdx]
				; xmm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, xmm1	; rax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; rax = bitmask for '\0' in string
	jnz	short match
next:
	add	rdx, 16		; rdx = address of next chunk of aligned characters
aligned:
	vpcmpeqb xmm1, xmm0, [rdx]
				; xmm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, xmm1	; rax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; rax = offset of '\0' in chunk of characters
	add	rax, rdx	; rax = address of '\0'
	sub	rax, r8		; rax = length of string
	ret
strlen	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.code
strlen	proc	public		; size_t strlen(unsigned char const *string)
	vpxor	ymm0, ymm0, ymm0; ymm0 = 0
	mov	rdx, rcx	; rdx = address of string
	mov	r8, rcx
	and	ecx, 31		; rcx = address of string % 32
				;     = 32 - number of unaligned characters
	jz	short aligned
unaligned:
	sub	rdx, rcx	; rdx = aligned address before string
	vpcmpeqb ymm1, ymm0, [rdx]
				; ymm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, ymm1	; rax = bitmask for '\0' in chunk of characters
	shr	eax, cl
	shl	eax, cl		; rax = bitmask for '\0' in string
	jnz	short match
next:
	add	rdx, 32		; rdx = address of next chunk of aligned characters
aligned:
	vpcmpeqb ymm1, ymm0, [rdx]
				; ymm1 = '\377' for each '\0' in chunk of characters
	vpmovmskb eax, ymm1	; rax = bitmask for '\0' in chunk of characters
	test	eax, eax
	jz	short next	; no '\0' in chunk?
match:
	bsf	eax, eax	; rax = offset of '\0' in chunk of characters
	add	rax, rdx	; rax = address of '\0'
	sub	rax, r8		; rax = length of string
	ret
strlen	endp
	endstrrchr() and strstr() Standard Functions for i386 Platformstrrchr()
            and
            strstr()
            functions are no compiler helper functions, they are like the
            memchr()
            function included here for entertainment due to their
            extra ordinary DIR "%source%\str*.c" TYPE "%source%\strrchr.c" TYPE "%source%\strstr.c"
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\crt\src
02/18/2011  03:40 PM             1,998 strcat.c
02/18/2011  03:40 PM               541 strcat_s.c
02/18/2011  03:40 PM             1,102 strchr.c
02/18/2011  03:40 PM             1,566 strcmp.c
02/18/2011  03:40 PM             2,532 strcoll.c
02/18/2011  03:40 PM               479 strcpy_s.c
02/18/2011  03:40 PM               337 strcspn.c
02/18/2011  03:40 PM             3,227 strdate.c
02/18/2011  03:40 PM             1,895 strdup.c
02/18/2011  03:40 PM             4,085 stream.c
02/18/2011  03:40 PM             4,414 strerror.c
02/18/2011  03:40 PM            42,150 strftime.c
02/18/2011  03:40 PM             2,757 stricmp.c
02/18/2011  03:40 PM             2,570 stricoll.c
02/18/2011  03:40 PM             1,009 strlen.c
02/18/2011  03:40 PM             1,276 strlen_s.c
02/18/2011  03:40 PM             5,994 strlwr.c
02/18/2011  03:40 PM             1,496 strncat.c
02/18/2011  03:40 PM               564 strncat_s.c
02/18/2011  03:40 PM             2,546 strncmp.c
02/18/2011  03:40 PM             1,250 strncnt.c
02/18/2011  03:40 PM             3,108 strncoll.c
02/18/2011  03:40 PM             1,464 strncpy.c
02/18/2011  03:40 PM               536 strncpy_s.c
02/18/2011  03:40 PM             3,628 strnicmp.c
02/18/2011  03:40 PM             2,988 strnicol.c
02/18/2011  03:40 PM             1,243 strnset.c
02/18/2011  03:40 PM               580 strnset_s.c
02/18/2011  03:40 PM               337 strpbrk.c
02/18/2011  03:40 PM             1,460 strrchr.c
02/18/2011  03:40 PM             1,204 strrev.c
02/18/2011  03:40 PM             1,204 strset.c
02/18/2011  03:40 PM               519 strset_s.c
02/18/2011  03:40 PM             4,922 strspn.c
02/18/2011  03:40 PM             1,371 strstr.c
02/18/2011  03:40 PM             3,226 strtime.c
02/18/2011  03:40 PM             3,500 strtod.c
02/18/2011  03:40 PM             4,167 strtok.c
02/18/2011  03:40 PM               450 strtok_s.c
02/18/2011  03:40 PM             8,862 strtol.c
02/18/2011  03:40 PM             7,726 strtoq.c
02/18/2011  03:40 PM             6,094 strupr.c
02/18/2011  03:40 PM             4,739 strxfrm.c
              43 File(s)        147,116 bytes
               0 Dir(s)    9,876,543,210 bytes free
/***
*strrchr.c - find last occurrence of character in string
*
*       Copyright (c) Microsoft Corporation. All rights reserved.
*
*Purpose:
*       defines strrchr() - find the last occurrence of a given character
*       in a string.
*
*******************************************************************************/
#include <cruntime.h>
#include <string.h>
/***
*char *strrchr(string, ch) - find last occurrence of ch in string
*
*Purpose:
*       Finds the last occurrence of ch in string.  The terminating
*       null character is used as part of the search.
*
*Entry:
*       char *string - string to search in
*       char ch - character to search for
*
*Exit:
*       returns a pointer to the last occurrence of ch in the given
*       string
*       returns NULL if ch does not occurr in the string
*
*Exceptions:
*
*******************************************************************************/
char * __cdecl strrchr (
        const char * string,
        int ch
        )
{
        char *start = (char *)string;
        while (*string++)                       /* find end of string */
                ;
                                                /* search towards front */
        while (--string != start && *string != (char)ch)
                ;
        if (*string == (char)ch)                /* char found ? */
                return( (char *)string );
        return(NULL);
}
/***
*strstr.c - search for one string inside another
*
*       Copyright (c) Microsoft Corporation. All rights reserved.
*
*Purpose:
*       defines strstr() - search for one string inside another
*
*******************************************************************************/
#include <cruntime.h>
#include <string.h>
/***
*char *strstr(string1, string2) - search for string2 in string1
*
*Purpose:
*       finds the first occurrence of string2 in string1
*
*Entry:
*       char *string1 - string to search in
*       char *string2 - string to search for
*
*Exit:
*       returns a pointer to the first occurrence of string2 in
*       string1, or NULL if string2 does not occur in string1
*
*Uses:
*
*Exceptions:
*
*******************************************************************************/
char * __cdecl strstr (
        const char * str1,
        const char * str2
        )
{
        char *cp = (char *) str1;
        char *s1, *s2;
        if ( !*str2 )
            return((char *)str1);
        while (*cp)
        {
                s1 = cp;
                s2 = (char *) str2;
                while ( *s1 && *s2 && !(*s1-*s2) )
                        s1++, s2++;
                if (!*s2)
                        return(cp);
                cp++;
        }
        return(NULL);
}
            OUCH¹: the
            strrchr()
            function traverses its input string without necessity
            twice!
         OUCH²: the
            strstr()
            function has quadratic, i.e.
            𝒪(n2) runtime – a real
            shame!
        
strrchr()
            function for processors which support the
            Streaming SIMD Extensions 4.2
            alias Nehalem New Instructions, introduced
            November 11, 2008
            with the Core™i* line of processors,
            needs only 14 instructions in just 42 bytes:
        ; Copyright © 2009-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.xmm
	.model	flat, C
	.code
strrchr	proc	public		; char *strrchr(unsigned char const *string, int character)
	xor	eax, eax	; eax = 0
	mov	edx, [esp+4]	; edx = address of string
	and	edx, -16	; edx = aligned address before string
	movzx	ecx, byte ptr [esp+8]
	movd	xmm0, ecx	; xmm0 = prototype string "‹character›"
@@:
	pcmpistri xmm0, [edx], 40h
				; CF = ('‹character›' in chunk of characters),
				; ZF = ('\0' in chunk of characters),
				; ecx = ('\0' or '‹character›' in chunk of characters)
				;     ? index of '\0' or last matching '‹character›' : 16
	lea	ecx, [ecx+edx]
	cmovc	eax, ecx	; eax = address of last matching '‹character›'
	lea	edx, [edx+16]
	jnz	short @b	; no '\0' in chunk of characters?
	xor	ecx, ecx	; ecx = 0
	cmp	eax, [esp+4]
	cmovb	eax, ecx	; eax = (address of '‹character›' < address of string) ? 0
	ret
strrchr	endp
	endstr*() Standard Functionsstrcat(),
            strchr(),
            strcmp(),
            strcoll(),
            strcpy(),
            strcspn(),
            strlen(),
            strncat(),
            strncmp(),
            strncpy(),
            strnlen(),
            strnset(),
            strpbrk(),
            strrchr(),
            strrev(),
            strset(),
            _strset()
            strspn(),
            strstr()
            strtok()
            strtok_s()
            and strtok_r() functions for the i386 and
            the AMD64 platform follow with build instructions.
         Note: only
            strcat(),
            strcmp(),
            strcpy(),
            strlen()
            and
            strset()
            are available as
            intrinsic
            functions.
        
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define NULL	(void *) 0
#ifndef _WIN64
typedef	unsigned int	size_t;
#endif
#pragma function(strcat, strcmp, strcpy, strlen, strset)
#pragma intrinsic(memcmp)
void	*memchr(void const *memory, int character, size_t count);
int	memcmp(void const *source, void const *destination, size_t count);
size_t	strlen(unsigned char const *string);
char	*strstr(unsigned char const *haystack, unsigned char const *needle)
{
#if 0
	if (*needle == '\0')	// needle is an empty string?
		return (char *) haystack;
	if (*haystack == '\0')	// haystack is an empty string?
		return NULL;
	return (char *) memmem(haystack, strlen(haystack), needle, strlen(needle));
#else
	unsigned char const *string;
	size_t length = strlen(needle);
	size_t count = strlen(haystack);
	if (!count || length > count)
		return NULL;
	if (!length)		// needle is an empty string?
		return (char *) haystack;
	if (!--length)		// needle is a single character?
		return memchr(haystack, *needle, count);
	count -= length;	// maximum number of characters to scan in haystack
	while (string = haystack,
	       haystack = (unsigned char const *) memchr(haystack, *needle, count),
	       haystack)	// *haystack is first character of needle; compare
	{			//  last character of needle first, then proceed
		if (haystack[length] == needle[length]
#if 0
		 && length == 1 || !memcmp(haystack + 1, needle + 1, length - 1))
#else
		 && !memcmp(haystack, needle, length))
#endif
			return (char *) haystack;
				// skip character in haystack,
				//  adjust number of characters left in haystack
		count -= ++haystack - string;
		if (!count)
			break;
	}
	return NULL;
#endif
}
char	*strrchr(unsigned char const *string, int character)
{
	char *pointer = NULL;
	do
		if (*string == (unsigned char) character)
			pointer = (char *) string;
	while (*string++);
	return pointer;
}
char	*strchr(unsigned char const *string, int character)
{
	do
		if (*string == (unsigned char) character)
			return (char *) string;
	while (*string++);
	return NULL;
}
char	*strcat(unsigned char *destination, unsigned char const *source)
{
	char *string = (char *) destination;
#if 0
	destination += strlen(destination);
#else
	while (*destination)
		destination++;
#endif
	while (*source)
		*destination++ = *source++;
	return string;
}
char	*strncat(unsigned char *destination, unsigned char const *source, size_t count)
{
	char *string = (char *) destination;
#if 0
	destination += strlen(destination);
#else
	while (*destination)
		destination++;
#endif
	while (count && *source)
		*destination++ = *source++, --count;
	*destination = '\0';
	return string;
}
char	*strcpy(unsigned char *destination, unsigned char const *source)
{
	char *string = (char *) destination;
	while (*source)
		*destination++ = *source++;
	return string;
}
char	*strncpy(unsigned char *destination, unsigned char const *source, size_t count)
{
	char *string = (char *) destination;
	while (count && *source)
		*destination++ = *source++, --count;
	while (count)
		*destination++ = '\0', --count;
	return string;
}
int	strcmp(unsigned char const *source, unsigned char const *destination)
{
	if (source != destination)
		do
			if (*source - *destination)
#if 0
				return *source - *destination;
#else
				return (*source > *destination) - (*source < *destination);
#endif
		while (destination++, *source++);
	return 0;
}
int	strncmp(unsigned char const *source, unsigned char const *destination, size_t count)
{
	if (count && source != destination)
		do
			if (*source - *destination)
#if 0
				return *source - *destination;
#else
				return (*source > *destination) - (*source < *destination);
#endif
		while (source++, *destination++ && --count);
	return 0;
}
size_t	strlen(unsigned char const *string)
{
#if 0
	unsigned char *source = string;
	while (*source)
		source++;
	return source - string;
#else
	return (unsigned char *) memchr(string, '\0', ~(size_t) 0) - string;
#endif
}
size_t	strnlen(unsigned char const *string, size_t count)
{
	unsigned char *nul = memchr(string, '\0', count);
	return nul ? nul - string : count;
}
__declspec(safebuffers)
size_t	strspn(unsigned char const *string, unsigned char const *delimiter)
{
	// yield number of leading characters in array 'string'
	//  which are equal to any character in array 'delimiter'
	size_t bitmap[256 / (8 * sizeof(size_t))] = {0};
	if (!*delimiter)
		return 0;
	if (!*string)
		return 0;
	do			// build bitmap
		bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
	while (*++delimiter);
	delimiter = string;
	while (bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t))))
		string++;
	return string - delimiter;
}
__declspec(safebuffers)
size_t	strcspn(unsigned char const *string, unsigned char const *delimiter)
{
	// yield number of leading characters in array 'string'
	//  which differ from each character in array 'delimiter'
	size_t bitmap[256 / (8 * sizeof(size_t))] = {1};
	if (!*delimiter)
		return strlen(string);
	if (!*string)
		return 0;
	do			// build bitmap
		bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
	while (*++delimiter);
	delimiter = string;
	while (!(bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t)))))
		string++;
	return string - delimiter;
}
__declspec(safebuffers)
char	*strpbrk(unsigned char const *string, unsigned char const *delimiter)
{
	// yield pointer to first character in array 'string'
	//  which is equal to any character in array 'delimiter'
#if 0
	string += strcspn(string, delimiter);
	return *string ? (char *) string : NULL;
#else
	size_t bitmap[256 / (8 * sizeof(size_t))] = {0};
	if (!*delimiter)
		return NULL;
	if (!*string)
		return NULL;
	do			// build bitmap
		bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
	while (*++delimiter);
	do
		if (bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t))))
			return (char *) string;
	while (*++string);
	return NULL;
#endif
}
char	*strset(char *string, int character)
{
	char *destination = string;
	while (*destination)
		*destination++ = (char) character;
	return string;
}
__declspec(safebuffers)
char	*strtok_r(unsigned char *string, unsigned char const *delimiter, char **next)
{
#if 0
	if (!string)
		string = (unsigned char *) *next;
	if (!string || !*string)
		return *next = NULL;
				// skip leading delimiters
	string += strspn(string, delimiter);
	if (!*string)		// no characters left?
		return *next = NULL;
				// skip token, i.e. non-delimiters,
				//  and save its address
	*next = (char *) string + strcspn(string, delimiter);
	if (!**next)		// no characters left?
		*next = NULL;
	else			// terminate token
		*(*next)++ = '\0';
	return (char *) string;
#else
	size_t bitmap[256 / (8 * sizeof(size_t))] = {0};
	if (!string)
		string = (unsigned char *) *next;
	if (!string || !*string)
		return *next = NULL;
	if (!*delimiter)
		return *next = NULL, (char *) string;
	do			// build bitmap
		bitmap[*delimiter / (8 * sizeof(size_t))] |= (size_t) 1 << *delimiter % (8 * sizeof(size_t));
	while (*++delimiter);
				// skip leading delimiters
	while (bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t))))
		string++;
	if (!*string)		// no characters left?
		return *next = NULL;
	delimiter = string;	// save (address of) token
	*bitmap |= 1;		// add '\0' as delimiter
	do			// skip token, i.e. non-delimiters
		string++;
	while (!(bitmap[*string / (8 * sizeof(size_t))] & ((size_t) 1 << *string % (8 * sizeof(size_t)))));
	if (!*string)		// no characters left?
		string = NULL;
	else			// terminate token
		*string++ = '\0';
	*next = (char *) string;// save (address of) next character
	return (char *) delimiter;
#endif
}// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#ifndef _WIN64
typedef	unsigned int	size_t;
#endif
void	*memcpy(void *destination, void const *source, size_t count);
size_t	strlen(unsigned char const *string);
char	*strncpy(unsigned char *destination, unsigned char const *source, size_t count);
char	*stpcat(unsigned char *destination, unsigned char const *source)
{
	// returns pointer to terminating '\0'
#if 0
	return stpcpy(destination + strlen(destination), source);
#else
	while (*destination)
		destination++;
	while (*destination = *source)
		destination++, source++;
	return destination;
#endif
}
char	*strncat(unsigned char *destination, unsigned char const *source, size_t count)
{
	// returns pointer to terminating '\0'
#if 0
	destination += strlen(destination);
#else
	while (*destination)
		destination++;
#endif
	while (count && *source)
		*destination++ = *source++, --count;
	*destination = '\0';
	return destination;
}
char	*stpcpy(unsigned char *destination, unsigned char const *source)
{
	// returns pointer to terminating '\0'
#if 0
	size_t length = strlen(source);
	return (char *) memcpy(destination, source, length + 1) + length;
#else
	while (*destination = *source)
		destination++, source++;
	return destination;
#endif
}
char	*stpncpy(unsigned char *destination, unsigned char const *source, size_t count)
{
	// returns pointer to terminating '\0' or
	//  past last character copied to destination
#if 0
	size_t length = strlen(source);
	if (length > count)
		length = count;
	return strncpy(destination, source, count) + length;
#else
	char *pointer;
	while (count && (*destination = *source))
		destination++, source++, --count;
	pointer = (char *) destination;
	while (count)
		*destination++ = '\0', --count;
	return pointer;
#endif
}; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: characters are unsigned!
	.386
	.model	flat, C
	.code
stpcpy	proc	public		; char *stpcpy(char *destination, char const *source)
	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of source string (including '\0')
	sub	edi, ecx	; edi = address of source string
	mov	eax, esi
	mov	esi, edi	; esi = address of source string
	mov	edi, [esp+4]	; edi = address of destination string
	rep	movsb
	dec	edi		; edi = address of '\0'
	mov	esi, eax
	mov	eax, edi	; eax = address of '\0'
	mov	edi, edx
	ret
stpcpy	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: characters are unsigned!
	.386
	.model	flat, C
	.code
strcat	proc	public		; char *strcat(char *destination, char const *source)
	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of source string (including '\0')
	push	ecx
	mov	edi, [esp+8]	; edi = address of destination string
;;	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	dec	edi		; edi = address of '\0'
				;     = end of destination string
	mov	eax, esi
	mov	esi, [esp+12]	; esi = address of source string
	pop	ecx		; ecx = length of source string (including '\0')
	rep	movsb
	mov	edi, edx
	mov	esi, eax
	mov	eax, [esp+4]	; eax = address of destination string
	ret
strcat	endp
strchr	proc	public		; char *strchr(char const *string, int character)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of string (including '\0')
	sub	edi, ecx	; edi = address of string
	mov	eax, [esp+8]	; eax = character
	repne	scasb
	dec	edi		; edi = address of character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of character
	mov	edi, edx
	ret
strchr	endp
strcmp	proc	public		; int strcmp(char const *source, char const *destination)
	push	esi
	push	edi
	xor	eax, eax	; eax = 0
	mov	esi, [esp+12]	; esi = address of source string
	mov	edi, [esp+16]	; edi = address of destination string
	cmp	edi, esi
	je	short equal	; address of destination string = address of source string?
;;	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of destination string (including '\0')
	sub	edi, ecx	; edi = address of destination string
;;	xor	eax, eax	; eax = 0
	repe	cmpsb
	seta	al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
equal:
	pop	edi
	pop	esi
	ret
strcmp	endp
; NOTE: strcoll() is a second implementation of strcmp()!
strcoll	proc	public		; int strcoll(char const *source, char const *destination)
	mov	ecx, [esp+4]	; ecx = address of source string
	mov	edx, [esp+8]	; edx = address of destination string
	sub	edx, ecx
	jz	short equal	; address of destination string = address of source string?
compare:
	mov	al, [ecx]
	cmp	al, [ecx+edx]
	jne	short different
	inc	ecx
	test	al, al
	jnz	short compare	; *source <> '\0'?
equal:
	xor	eax, eax	; eax = 0
	ret
different:
	sbb	eax, eax	; eax = (*source < *destination) ? -1 : 0
	or	eax, 1		; eax = (*source < *destination)
				;     - (*source > *destination)
				;     = {-1, 0, 1}
	ret
strcoll	endp
strcpy	proc	public		; char *strcpy(char *destination, char const *source)
	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of source string (including '\0')
	sub	edi, ecx	; edi = address of source string
	mov	eax, esi
	mov	esi, edi	; esi = address of source string
	mov	edi, [esp+4]	; edi = address of destination string
	rep	movsb
	mov	edi, edx
	mov	esi, eax
	mov	eax, [esp+4]	; eax = address of destination string
	ret
strcpy	endp
strcspn	proc	public		; size_t strcspn(char const *string, char const *delimiter)
	mov	eax, [esp+4]	; eax = address of string
	mov	edx, [esp+8]	; edx = address of delimiter
	xor	ecx, ecx	; ecx = 0
	cmp	cl, [edx]
	je	short empty	; delimiter[0] = '\0'?
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx		; bitmap[0..255] = 0,
				; esp = address of bitmap
setup:
	bts	[esp], ecx	; bitmap[ecx] = 1
	mov	cl, [edx]	; ecx = *delimiter
	inc	edx		; edx = ++delimiter
	cmp	cl, ch
	jne	short setup	; ecx <> '\0'?
	mov	edx, eax	; edx = address of string
skip:
	mov	cl, [eax]	; ecx = *string
	inc	eax		; eax = ++string
	bt	[esp], ecx
	jnc	short skip	; bitmap[ecx] = 0 (no match)?
stop:
	sbb	eax, edx	; eax = number of non-matching characters
	add	esp, 32
	ret
empty:
	mov	edx, eax	; edx = address of string
count:
	inc	eax		; eax = ++string
	cmp	cl, [eax-1]
	jne	short count
if 1
	dec	eax		; eax = --string
	sub	eax, edx
else
	stc
	sbb	eax, edx	; eax = number of characters
endif
	ret
strcspn	endp
strlen	proc	public		; size_t strlen(char const *string)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb		; ecx = -1 - (address of '\0' + 1 - address of string)
				;     = -1 - (length of string + 1)
				;     = -2 - length of string
if 0
	mov	eax, -2
	sub	eax, ecx	; eax = -2 + 2 + length of string
				;     = length of string
else
	mov	eax, ecx	; eax = -1 - (length of string + 1)
	not	eax		; eax = length of string + 1
	dec	eax		; eax = length of string
endif
	mov	edi, edx
	ret
strlen	endp
strncat	proc	public		; char *strncat(char *destination, char const *source, size_t count)
	push	esi
	push	edi
	mov	esi, [esp+16]	; esi = address of source string
	mov	edx, [esp+20]	; edx = count
	mov	edi, esi	; edi = address of source string
	mov	ecx, edx	; ecx = count
	xor	eax, eax	; eax = '\0'
	repne	scasb
	sub	edx, ecx	; edx = length of source string (including '\0')
	mov	edi, [esp+12]	; edi = address of destination string
;;	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	dec	edi		; edi = address of '\0'
				;     = end of destination string
	mov	ecx, edx	; ecx = length of source string (including '\0')
	rep	movsb
;;	xor	eax, eax	; eax = '\0'
	stosb
	mov	eax, [esp+12]	; eax = address of destination string
	pop	edi
	pop	esi
	ret
strncat	endp
strncmp	proc	public		; int strncmp(char const *source, char const *destination, size_t count)
	push	esi
	push	edi
	xor	eax, eax	; eax = 0
	mov	esi, [esp+12]	; esi = address of source string
	mov	edi, [esp+16]	; edi = address of destination string
	cmp	edi, esi
	je	short equal	; address of destination string = address of source string?
	mov	ecx, [esp+20]	; ecx = count
	test	ecx, ecx
	jz	short equal	; count = 0?
;;	xor	eax, eax	; eax = 0,
;;				; CF = 0,
;;				; ZF = 1 (required when count is 0)
	repe	cmpsb
	seta	al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
equal:
	pop	edi
	pop	esi
	ret
strncmp	endp
strncpy	proc	public		; char *strncpy(char *destination, char const *source, size_t count)
	push	esi
	push	edi
	mov	esi, [esp+16]	; esi = address of source string
	mov	edx, [esp+20]	; edx = count
	mov	edi, esi	; edi = address of source string
	mov	ecx, edx	; ecx = count
	xor	eax, eax	; eax = '\0'
	repne	scasb
	sub	ecx, edx
	neg	ecx		; ecx = length of source string (including '\0')
	sub	edx, ecx	; edx = count - length of source string (including '\0')
	mov	edi, [esp+12]	; edi = address of destination string
	rep	movsb
	mov	ecx, edx	; ecx = count - length of source string (including '\0')
;;	xor	eax, eax	; eax = '\0'
	rep	stosb
	mov	eax, [esp+12]	; eax = address of destination string
	pop	edi
	pop	esi
	ret
strncpy	endp
strnlen	proc	public		; size_t strnlen(char const *string, size_t count)
	mov	ecx, [esp+8]	; ecx = count
	test	ecx, ecx
	jz	short empty	; count = 0?
	xor	eax, eax	; eax = '\0'
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
;;	test	edi, edi	; ZF = 0 (required when count is 0)
	repne	scasb		; ecx = (length of string < count)
				;     ? count - (length of string + 1) : 0
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi-1] = '\0')
				;    = (length of string < count),
				; ecx = (length of string < count)
				;     ? length of string + 1 - count : 0
	sbb	ecx, eax	; ecx = (length of string < count)
				;     ? length of string - count : 0
	add	ecx, [esp+8]	; ecx = (length of string < count)
				;     ? length of string : count
	mov	edi, edx
empty:
	mov	eax, ecx	; eax = (length of string < count)
				;     ? length of string : count
	ret
strnlen	endp
strnset	proc	public		; char *strnset(char *string, int character, size_t count)
	mov	edx, [esp+4]	; edx = address of string
	mov	ecx, [esp+12]	; ecx = count
	test	ecx, ecx
	jz	short zero	; count = 0?
	xor	eax, eax	; eax = '\0'
	push	edi
	mov	edi, edx	; edi = address of string
;;	test	edi, edi	; ZF = 0 (required when count is 0)
	repne	scasb		; ecx = (length of string < count)
				;     ? count - (length of string + 1) : 0
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi-1] = '\0')
				;    = (length of string < count),
				; ecx = (length of string < count)
				;     ? length of string + 1 - count : 0
	sbb	ecx, eax	; ecx = (length of string < count)
				;     ? length of string - count : 0
	add	ecx, [esp+16]	; ecx = (length of string < count)
				;     ? length of string : count
	mov	eax, [esp+12]	; eax = character
	mov	edi, edx	; edi = address of string
	rep	stosb
	pop	edi
zero:
	mov	eax, edx	; eax = address of string
	ret
strnset	endp
strpbrk	proc	public		; char *strpbrk(char const *string, char const *delimiter)
	mov	eax, [esp+4]	; eax = address of string
	mov	edx, [esp+8]	; edx = address of delimiter
	xor	ecx, ecx	; ecx = 0
	cmp	cl, [edx]
	je	short empty	; delimiter[0] = '\0'?
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx		; bitmap[0..255] = 0,
				; esp = address of bitmap
setup:
	bts	[esp], ecx	; bitmap[ecx] = 1
	mov	cl, [edx]	; ecx = *delimiter
	inc	edx		; edx = ++delimiter
	cmp	cl, ch
	jne	short setup	; ecx <> '\0'?
skip:
	mov	cl, [eax]	; ecx = *string
	inc	eax		; eax = ++string
	bt	[esp], ecx
	jnc	short skip	; bitmap[ecx] = 0 (no match)?
stop:
	dec	eax		; eax = --string
	neg	ecx
	sbb	ecx, ecx	; ecx = (*string = '\0') ? 0 : -1
	and	eax, ecx	; eax = (*string = '\0') ? 0 : address of string
	add	esp, 32
	ret
empty:
	xor	eax, eax
	ret
strpbrk	endp
strrchr	proc	public		; char *strrchr(char const *string, int character)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of string (including '\0')
	dec	edi		; edi = address of '\0'
				;     = end of string
	mov	eax, [esp+8]	; eax = character
	std
	repne	scasb
	cld
	inc	edi		; edi = address of character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of character
	mov	edi, edx
	ret
strrchr	endp
strrev	proc	public		; char *strrev(char *string)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	add	ecx, edi	; ecx = address of string - 1
	dec	edi		; edi = address of '\0'
				;     = end of string
	jmp	short continue
reverse:
	mov	al, [ecx]
	mov	ah, [edi]
	mov	[ecx], ah
	mov	[edi], al
continue:
	inc	ecx
	dec	edi
	cmp	edi, ecx
	ja	short reverse
	mov	edi, edx
	mov	eax, [esp+4]	; eax = address of string
	ret
strrev	endp
strset	proc	public		; char *strset(char *string, int character)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of string (including '\0')
	sub	edi, ecx	; edi = address of string
	dec	ecx		; ecx = length of string
	mov	eax, [esp+8]	; eax = character
	rep	stosb
	mov	edi, edx
	mov	eax, [esp+4]	; eax = address of string
	ret
strset	endp
strspn	proc	public		; size_t strspn(char const *string, char const *delimiter)
	mov	eax, [esp+4]	; eax = address of string
	mov	edx, [esp+8]	; edx = address of delimiter
	xor	ecx, ecx	; ecx = 0
	cmp	cl, [edx]
	je	short empty	; delimiter[0] = '\0'?
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx
	push	ecx		; bitmap[0..255] = 0,
				; esp = address of bitmap
	mov	cl, [edx]	; ecx = *delimiter
	inc	edx		; edx = ++delimiter
setup:
	bts	[esp], ecx	; bitmap[ecx] = 1
	mov	cl, [edx]	; ecx = *delimiter
	inc	edx		; edx = ++delimiter
	cmp	cl, ch
	jne	short setup	; ecx <> '\0'?
	mov	edx, eax	; edx = address of string
skip:
	mov	cl, [eax]	; ecx = *string
	inc	eax		; eax = ++string
	bt	[esp], ecx
	jc	short skip	; bitmap[ecx] = 1 (match)?
if 1
	dec	eax		; eax = --string
	sub	eax, edx
else
	stc
	sbb	eax, edx	; eax = number of matching characters
endif
	add	esp, 32
	ret
empty:
	xor	eax, eax
	ret
strspn	endp
strstr	proc	public		; char *strstr(char const *haystack, char const *needle)
	push	edi
	mov	edi, [esp+12]	; edi = address of needle
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of needle (including '\0')
	dec	ecx		; ecx = length of needle
	mov	eax, [esp+8]	; eax = address of haystack
	jz	short empty	; length of needle = 0?
	mov	edx, ecx	; edx = length of needle
ifdef SIMPLE
	push	esi
compare:
	mov	esi, eax	; esi = current address in haystack
	mov	edi, [esp+16]	; edi = address of needle
	mov	ecx, edx	; ecx = length of needle
	repe	cmpsb
	je	short match	; needle in haystack?
	inc	eax		; eax = next address in haystack
	cmp	byte ptr [esi-1], 0
	jne	short compare	; non-matching character in haystack <> '\0'?
	xor	eax, eax
match:
else ; SIMPLE
	mov	edi, eax	; edi = address of haystack
	xor	eax, eax	; eax = '\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasb
	not	ecx		; ecx = length of haystack (including '\0')
	sub	edi, ecx	; edi = address of haystack
	dec	ecx		; ecx = length of haystack
	jz	short empty	; length of haystack = 0?
	cmp	ecx, edx
	jb	short empty	; length of haystack < length of needle?
	push	esi
	push	ebx
search:
	mov	esi, [esp+20]	; esi = address of needle
	mov	al, [esi]	; al = first character of needle
				; edi = current address in haystack
	repne	scasb		; edi = next address in haystack,
				; ecx = current length of haystack
	jne	short break	; (first character of) needle not in haystack?
	dec	ecx		; ecx = next length of haystack
	mov	al, [esi+edx-1]	; al = last character of needle
	cmp	al, [edi+edx-2]
	jne	short continue	; last character of needle not in haystack?
compare:
	mov	eax, edi	; eax = next address in haystack
	mov	ebx, ecx	; ebx = next length of haystack
if 0
	dec	edi		; edi = current address in haystack
				;     = address of matching character
				; esi = address of needle
	mov	ecx, edx	; ecx = length of needle
else
				; edi = next address in haystack
	inc	esi		; esi = address of needle + 1
	mov	ecx, edx
	dec	ecx		; ecx = length of needle - 1,
				; ZF = (ecx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsb
	je	short match	; needle in haystack?
	mov	edi, eax	; edi = current address in haystack
	mov	ecx, ebx	; ecx = current length of haystack
continue:
	cmp	ecx, edx
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	pop	ebx
	pop	esi
	pop	edi
	ret
match:
	dec	eax		; eax = address of needle in haystack
	pop	ebx
endif ; SIMPLE
	pop	esi
empty:
	pop	edi
	ret
strstr	endp
strtok_r proc	public		; char *strtok_r(char *string, char const *delimiter, char **next)
	mov	ecx, [esp+4]	; ecx = address of string
	mov	eax, [esp+8]	; eax = address of delimiter
	mov	edx, [esp+12]	; edx = address of address of next
	test	ecx, ecx
	jnz	short start	; address of string <> 0?
	or	ecx, [edx]	; ecx = address of next
	jz	short null	; address of next = 0 = address of string?
start:
	cmp	byte ptr [ecx], 0
	je	short null	; string[0] = '\0'?
	cmp	byte ptr [eax], 0
	je	short empty	; delimiter[0] = '\0'?
	push	ebx
	xor	ebx, ebx	; ebx = 0
	push	ebx
	push	ebx
	push	ebx
	push	ebx
	push	ebx
	push	ebx
	push	ebx
	push	ebx		; bitmap[0..255] = 0,
				; esp = address of bitmap
	mov	bl, [eax]	; ebx = *delimiter
	inc	eax		; eax = ++delimiter
setup:
	bts	[esp], ebx	; bitmap[ebx] = 1
	mov	bl, [eax]	; ebx = *delimiter
	inc	eax		; eax = ++delimiter
	cmp	bl, bh
	jne	short setup	; ebx <> '\0'?
skip:
	mov	bl, [ecx]	; ebx = *string
	inc	ecx		; ecx = ++string
	bt	[esp], ebx
	jc	short skip	; bitmap[ebx] = 1 (ebx is a delimiter)?
	cmp	bl, bh
	je	short none	; ebx = '\0'?
	mov	bl, bh		; ebx = 0
	bts	[esp], ebx	; bitmap['\0'] = 1
	mov	eax, ecx
	dec	eax		; eax = address of token
token:
	mov	bl, [ecx]	; ebx = *string
	inc	ecx		; ecx = ++string
	bt	[esp], ebx
	jnc	short token	; bitmap[ebx] = 0 (ebx is not a delimiter)?
	cmp	bl, bh
	je	short last	; ebx = '\0'?
	mov	[ecx-1], bh	; string[-1] = '\0' (terminate token)
	mov	[edx], ecx	; *next = address of string
	add	esp, 32
	pop	ebx
	ret
none:
	mov	eax, ebx	; eax = 0
last:
	mov	[edx], ebx	; *next = 0
	add	esp, 32
	pop	ebx
	ret
null:
	xor	eax, eax	; eax = 0
	mov	[edx], eax	; *next = 0
	ret
empty:
	mov	eax, ecx	; eax = address of string
	xor	ecx, ecx
	mov	[edx], ecx	; *next = 0
	ret
strtok_r endp
	endi386-str.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 3 command lines to generate the object file
            i386-str.obj and add it to the existing object library
            i386.lib:
        SET ML=/c /Gy /safeseh /W3 /X ML.EXE i386-str.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-str.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
 Note: if the
            /Gy option
            to package every function in its own, separately linkable
            COMDAT
            section is not available with the version of the macro assembler
            ML.EXE you use,
            split the i386 assembler source into multiple pieces,
            with one function per source file.
        
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-str.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: characters are unsigned!
	.code
strcat	proc	public		; char *strcat(char *destination, char const *source)
	mov	r9, rcx		; r9 = address of destination string
	mov	r10, rdi
ifdef VARIANT
	mov	rdi, rcx	; rdi = address of destination string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	dec	rdi		; rdi = address of '\0'
				;     = end of destination string
	mov	r11, rsi
	mov	rsi, rdi	; rsi = end of destination string
	mov	rdi, rdx	; rdi = address of source string
;;	xor	eax, eax
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of source string (including '\0')
	mov	rdi, rsi	; rdi = end of destination string
	mov	rsi, rdx	; rsi = address of source string
else ; VARIANT
	mov	rdi, rdx	; rdi = address of source string
;;	xor	eax, eax
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of source string (including '\0')
	mov	r11, rsi
	mov	rsi, rdx	; rsi = address of source string
	mov	rdx, rcx
	mov	rdi, r9		; rdi = address of destination string
;;	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	dec	rdi		; rdi = address of '\0'
				;     = end of destination string
	mov	rcx, rdx	; rcx = length of source string (including '\0')
endif ; VARIANT
	rep	movsb
	mov	rax, r9		; rax = address of destination string
	mov	rdi, r10
	mov	rsi, r11
	ret
strcat	endp
strchr	proc	public		; char *strchr(char const *string, int character)
	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of string (including '\0')
	mov	rax, rdx	; rax = character
	sub	rdi, rcx	; rdi = address of string
	repne	scasb
	lea	rax, [rdi-1]	; rax = address of character
	cmovne	rax, rcx	; rax = (rcx = 0) ? 0 : address of character
	mov	rdi, r10
	ret
strchr	endp
strcmp	proc	public		; ssize_t strcmp(char const *source, char const *destination)
	xor	eax, eax	; rax = 0
	cmp	rcx, rdx
	je	short equal	; address of source string = address of destination string?
	mov	r11, rsi
	mov	rsi, rcx	; rsi = address of source string
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of destination string
;;	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of destination string (including '\0')
	mov	rdi, rdx	; rdi = address of destination string
;;	xor	eax, eax	; rax = 0
	repe	cmpsb
	seta	al		; rax = (*source > *destination)
	sbb	rax, 0		; rax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
	mov	rdi, r10
	mov	rsi, r11
equal:
	ret
strcmp	endp
; NOTE: strcoll() is a second implementation of strcmp()!
strcoll	proc	public		; ssize_t strcoll(char const *source, char const *destination)
	sub	rdx, rcx
	jz	short equal	; address of destination string = address of source string?
compare:
	mov	al, [rcx]
	cmp	al, [rcx+rdx]
	jne	short different
	inc	rcx
	test	al, al
	jnz	short compare	; *source <> '\0'?
equal:
	xor	eax, eax	; rax = 0
	ret
different:
	sbb	rax, rax	; rax = (*source < *destination) ? -1 : 0
	or	rax, 1		; rax = (*source < *destination)
				;     - (*source > *destination)
				;     = {-1, 0, 1}
	ret
strcoll	endp
strcpy	proc	public		; char *strcpy(char *destination, char const *source)
	mov	r9, rcx		; r9 = address of destination string
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of source string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of source string (including '\0')
	mov	rdi, r9		; rdi = address of destination string
	mov	r11, rsi
	mov	rsi, rdx	; rsi = address of source string
	rep	movsb
	mov	rax, r9		; rax = address of destination string
	mov	rdi, r10
	mov	rsi, r11
	ret
strcpy	endp
strcspn	proc	public		; size_t strcspn(char const *string, char const *delimiter)
	xor	eax, eax	; rax = 0
	cmp	al, [rdx]
	je	short empty	; delimiter[0] = '\0'?
	mov	[rsp+32], rax
	mov	[rsp+24], rax
	mov	[rsp+16], rax
	mov	[rsp+8], rax	; bitmap[0..255] = 0
setup:
	bts	[rsp+8], rax	; bitmap[rax] = 1
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
	cmp	al, ah
	jne	short setup	; rax <> '\0'?
	mov	rdx, rcx	; rdx = address of string
skip:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jnc	short skip	; bitmap[rax] = 0 (no match)?
stop:
	sbb	rcx, rdx	; rcx = number of non-matching characters
	mov	rax, rcx
	ret
empty:
	mov	rdx, rcx	; rdx = address of string
count:
	cmp	al, [rcx]
	lea	rcx, [rcx+1]	; rcx = ++string
	jne	short count	; *string <> '\0'?
	stc
	sbb	rcx, rdx	; rcx = number of characters
	mov	rax, rcx
	ret
strcspn	endp
strlen	proc	public		; size_t strlen(char const *string)
	mov	rdx, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
if 0
	not	rcx		; rcx = length of string (including '\0')
	dec	rcx
	mov	rax, rcx	; rax = length of string
else
	mov	rax, -2
	sub	rax, rcx	; rax = length of string
endif
	mov	rdi, rdx
	ret
strlen	endp
strncat	proc	private		; char *strncat(char *destination, char const *source, size_t count)
	ud2
strncat	endp
strncmp	proc	private		; int strncmp(char const *source, char const *destination, size_t count)
	ud2
strncmp	endp
strncpy	proc	private		; char *strncpy(char *destination, char const *source, size_t count)
	ud2
strncpy	endp
strnlen	proc	private		; size_t strnlen(char const *string, size_t count)
	ud2
strnlen	endp
strnset	proc	private		; char *strnset(char const *string, int character, size_t count)
	ud2
strnset	endp
strpbrk	proc	public		; char *strpbrk(char const *string, char const *delimiter)
	xor	eax, eax	; rax = 0
	cmp	al, [rdx]
	je	short empty	; delimiter[0] = '\0'?
	mov	[rsp+32], rax
	mov	[rsp+24], rax
	mov	[rsp+16], rax
	mov	[rsp+8], rax	; bitmap[0..255] = 0
setup:
	bts	[rsp+8], rax	; bitmap[rax] = 1
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
	cmp	al, ah
	jne	short setup	; rax <> '\0'?
skip:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jnc	short skip	; bitmap[rax] = 0 (no match)?
stop:
	dec	rcx		; rcx = --string
	neg	eax
	sbb	rax, rax	; rax = (*string = '\0') ? 0 : -1
	and	rax, rcx	; rax = (*string = '\0') ? 0 : address of string
empty:
	ret
strpbrk	endp
strrchr	proc	public		; char *strrchr(char const *string, int character)
	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of string (including '\0')
	mov	rax, rdx	; rax = character
	dec	rdi		; rdi = address of '\0'
				;     = end of string
	std
	repne	scasb
	cld
	lea	rax, [rdi+1]	; rax = address of character
	cmovne	rax, rcx	; rax = (rcx = 0) ? 0 : address of character
	mov	rdi, r10
	ret
strrchr	endp
strrev	proc	private		; char *strrev(char *string)
	ud2
strrev	endp
strset	proc	public		; char *strset(char *string, int character)
	mov	r9, rcx		; r9 = address of string
	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of string (including '\0')
	dec	rcx
	mov	rdi, r9		; rdi = address of string
	mov	rax, rdx	; rax = character
	rep	stosb
	mov	rax, r9		; rax = address of string
	mov	rdi, r10
	ret
strset	endp
strspn	proc	public		; size_t strspn(char const *string, char const *delimiter)
	xor	eax, eax	; rax = 0
	cmp	al, [rdx]
	je	short empty	; delimiter[0] = '\0'?
	mov	[rsp+32], rax
	mov	[rsp+24], rax
	mov	[rsp+16], rax
	mov	[rsp+8], rax	; bitmap[0..255] = 0
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
setup:
	bts	[rsp+8], rax	; bitmap[rax] = 1
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
	cmp	al, ah
	jne	short setup	; rax <> '\0'?
	mov	rdx, rcx	; rdx = address of string
skip:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jc	short skip	; bitmap[rax] = 1 (match)?
if 0
	dec	rcx		; rcx = --string
	sub	rcx, rdx
else
	stc
	sbb	rcx, rdx	; rcx = number of matching characters
endif
	mov	rax, rcx
empty:
	ret
strspn	endp
strstr	proc	public		; char *strstr(char const *haystack, char const *needle)
	mov	r8, rcx		; r8 = address of haystack
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of needle
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of needle (including '\0')
	dec	rcx		; rcx = length of needle
	mov	rax, r8		; rax = address of haystack
	jz	short empty	; length of needle = 0?
	mov	r9, rcx		; r9 = length of needel
	mov	rdi, r8		; rdi = address of haystack
	xor	eax, eax	; rax = '\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasb
	not	rcx		; rcx = length of haystack (including '\0')
	sub	rdi, rcx	; rdi = address of haystack
	dec	rcx		; rcx = length of haystack
	jz	short empty	; length of haystack = 0?
	cmp	rcx, r9
	jb	short empty	; length of haystack <length of needle?
	mov	r11, rsi
search:
	mov	al, [rdx]	; al = first character of needle
				; rdi = current address in haystack
	repne	scasb		; rdi = next address in haystack,
				; rcx = current length of haystack
	jne	short break	; (first character of) needle not in haystack?
	dec	rcx		; rcx = next length of haystack
	mov	al, [rdx+r9-1]	; al = last character of needle
	cmp	al, [rdi+r9-2]
	jne	short continue	; last character of needle not in haystack?
compare:
	mov	rax, rdi	; rax = next address in haystack
	mov	r8, rcx		; r8 = next length of haystack
if 0
	dec	rdi		; rdi = current address in haystack
				;     = address of matching character
	mov	rsi, rdx	; rsi = address of needle
	mov	rcx, r9		; rcx = length of needle
else
				; rdi = next address in haystack
	mov	rsi, rdx
	inc	rsi		; rsi = address of needle + 1
	mov	rcx, r9
	dec	rcx		; rcx = length of needle - 1,
				; ZF = (rcx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsb
	je	short match	; needle in haystack?
	mov	rdi, rax	; rdi = current address in haystack
	mov	rcx, r8		; rcx = current length of haystack
continue:
	cmp	rcx, r9
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	mov	rdi, r10
	mov	rsi, r11
empty:
	ret
match:
	dec	rax		; rax = address of needle in haystack
	mov	rdi, r10
	mov	rsi, r11
	ret
strstr	endp
strtok_r proc	public		; char *strtok_r(char *string, char const *delimiter, char **next)
	xor	eax, eax	; rax = 0
	test	rcx, rcx
	jnz	short start	; string <> 0?
	or	rcx, [r8]	; rcx = next
	jz	short null	; address of next = 0 = address of string?
start:
	cmp	al, [rcx]
	je	short null	; string[0] = '\0'?
	cmp	al, [rdx]
	je	short empty	; *delimiter = '\0'?
	mov	[rsp+32], rax
	mov	[rsp+24], rax
	mov	[rsp+16], rax
	mov	[rsp+8], rax	; bitmap[0..255] = 0
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
setup:
	bts	[rsp+8], rax	; bitmap[rax] = 1
	mov	al, [rdx]	; rax = *delimiter
	inc	rdx		; rdx = ++delimiter
	cmp	al, ah
	jne	short setup	; rax <> '\0'?
skip:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jc	short skip	; bitmap[rax] = 1 (rax is a delimiter)?
	cmp	al, ah
	je	short none	; rax = '\0'?
	mov	al, ah		; rax = 0
	bts	[rsp+8], rax	; bitmap['\0'] = 1
	lea	rdx, [rcx-1]	; rdx = address of token
token:
	mov	al, [rcx]	; rax = *string
	inc	rcx		; rcx = ++string
	bt	[rsp+8], rax
	jnc	short token	; bitmap[rax] = 0 (rax is not a delimiter)?
	cmp	al, ah
	je	short last	; rax = '\0'?
	mov	[rcx-1], ah	; string[-1] = '\0' (terminate token)
	mov	[r8], rcx	; *next = address of string
	mov	rax, rdx	; rax = address of token
	ret
last:
	mov	[r8], rax	; *next = 0
	mov	rax, rdx	; rax = address of token
	ret
empty:
	mov	[r8], rax	; *next = 0
	mov	rax, rcx	; rax = address of token
	ret
null:
none:
	mov	[r8], rax	; *next = 0
	ret
strtok_r endp
	endamd64-str.asm in the directory where you created the
            object library amd64.lib before, then execute the
            following 3 command lines to generate the object file
            amd64-str.obj and add it to the existing object library
            amd64.lib:
        SET ML=/c /Gy /W3 /X ML64.EXE amd64-str.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-str.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
 Note: if the
            /Gy option
            to package every function in its own, separately linkable
            COMDAT
            section is not available with the version of the macro assembler
            ML64.EXE
            you use, split the AMD64 assembler source into
            multiple pieces, with one function per source file.
        
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-str.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
wcs*() Standard Functionswcscat(),
            wcschr(),
            wcscmp(),
            wcscoll(),
            wcscpy(),
            wcslen(),
            wcsncat(),
            wcsncmp(),
            wcsncpy(),
            wcsnlen(),
            wcsnset(),
            wcsrchr(),
            wcsrev(),
            wcsset()
            and
            wcsstr()
            functions for the i386 and the AMD64
            platform follow with build instructions.
            _wcsset()
         Note: only
            wcscat(),
            wcscmp(),
            wcscpy()
            and
            wcslen()
            are available as
            intrinsic
            functions.
        
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: characters are unsigned!
	.386
	.model	flat, C
	.code
wcpcpy	proc	public		; wchar_t *wcpcpy(wchar_t *destination, wchar_t const *source)
	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of source string (including L'\0')
	mov	eax, esi
	mov	esi, [esp+8]	; esi = address of source string
	mov	edi, [esp+4]	; edi = address of destination string
	rep	movsw
	dec	edi
	dec	edi		; edi = address of L'\0'
	mov	esi, eax
	mov	eax, edi	; eax = address of L'\0'
	mov	edi, edx
	ret
wcpcpy	endp
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: counts and lengths are numbers of wide characters, not bytes!
	.386
	.model	flat, C
	.code
wcscat	proc	public		; wchar_t *wcscat(wchar_t *destination, wchar_t const *source)
	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of source string (including L'\0')
	push	ecx
	mov	edi, [esp+8]	; edi = address of destination string
;;	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	dec	edi
	dec	edi		; edi = address of L'\0'
				;     = end of destination string
	mov	eax, esi
	mov	esi, [esp+12]	; esi = address of source string
	pop	ecx		; ecx = length of source string (including L'\0')
	rep	movsw
	mov	edi, edx
	mov	esi, eax
	mov	eax, [esp+4]	; eax = address of destination string
	ret
wcscat	endp
wcschr	proc	public		; wchar_t *wcschr(wchar_t const *string, wchar_t character)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of string (including L'\0')
	sub	edi, ecx
	sub	edi, ecx	; edi = address of string
	mov	eax, [esp+8]	; eax = wide character
	repne	scasw
	dec	edi
	dec	edi		; edi = address of wide character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = wide character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of wide character
	mov	edi, edx
	ret
wcschr	endp
wcscmp	proc	public		; int wcscmp(wchar_t const *source, wchar_t const *destination)
	push	esi
	push	edi
	xor	eax, eax	; eax = 0
	mov	esi, [esp+12]	; esi = address of source string
	mov	edi, [esp+16]	; edi = address of destination string
	cmp	edi, esi
	je	short equal	; address of destination string = address of source string?
;;	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of destination string (including L'\0')
	sub	edi, ecx
	sub	edi, ecx	; edi = address of destination string
;;	xor	eax, eax	; eax = 0
	repe	cmpsw
	seta	al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
equal:
	pop	edi
	pop	esi
	ret
wcscmp	endp
; NOTE: wcscoll() is a second implementation of wcscmp()!
wcscoll	proc	public		; int wcscoll(wchar_t const *source, wchar_t const *destination)
	mov	ecx, [esp+4]	; ecx = address of source string
	mov	edx, [esp+8]	; edx = address of destination string
	sub	edx, ecx
	jz	short equal	; address of destination string = address of source string?
compare:
	mov	ax, [ecx]
	cmp	ax, [ecx+edx]
	jne	short different
	inc	ecx
	inc	ecx
	test	ax, ax
	jnz	short compare	; *source <> L'\0'?
equal:
	xor	eax, eax	; eax = 0
	ret
different:
	sbb	eax, eax	; eax = (*source < *destination) ? -1 : 0
	or	eax, 1		; eax = (*source < *destination)
				;     - (*source < *destination)
				;     = {-1, 0, 1}
	ret
wcscoll	endp
wcscpy	proc	public		; wchar_t *wcscpy(wchar_t *destination, wchar_t const *source)
	mov	edx, edi
	mov	edi, [esp+8]	; edi = address of source string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of source string (including L'\0')
	mov	eax, esi
	mov	esi, [esp+8]	; esi = address of source string
	mov	edi, [esp+4]	; edi = address of destination string
	rep	movsw
	mov	edi, edx
	mov	esi, eax
	mov	eax, [esp+4]	; eax = address of destination string
	ret
wcscpy	endp
wcslen	proc	public		; size_t wcslen(wchar_t const *string)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw		; ecx = -1 - (address of L'\0' + 2 - address of string)
				;     = -1 - (length of string + 1)
				;     = -2 - length of string
if 0
	mov	eax, -2
	sub	eax, ecx	; eax = -2 + 2 + length of string
				;     = length of string
else
	mov	eax, ecx	; eax = -1 - (length of string + 1)
	not	eax		; eax = length of string + 1
	dec	eax		; eax = length of string
endif
	mov	edi, edx
	ret
wcslen	endp
wcsncat	proc	public		; wchar_t *wcsncat(wchar_t *destination, wchar_t const *source, size_t count)
	push	esi
	push	edi
	mov	esi, [esp+16]	; esi = address of source string
	mov	edx, [esp+20]	; edx = count
	mov	edi, esi	; edi = address of source string
	mov	ecx, edx	; ecx = count
	xor	eax, eax	; eax = L'\0'
	repne	scasw
	sub	edx, ecx	; edx = length of source string (including L'\0')
	mov	edi, [esp+12]	; edi = address of destination string
;;	xor	eax, eax	; eax = 'L\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	dec	edi
	dec	edi		; edi = address of L'\0'
				;     = end of destination string
	mov	ecx, edx	; ecx = length of source string (including L'\0')
	rep	movsw
;;	xor	eax, eax	; eax = L'\0'
	stosw
	mov	eax, [esp+12]	; eax = address of destination string
	pop	edi
	pop	esi
	ret
wcsncat	endp
wcsncmp	proc	public		; int wcsncmp(wchar_t const *source, wchar_t const *destination, size_t count)
	push	esi
	push	edi
	xor	eax, eax	; eax = 0
	mov	esi, [esp+12]	; esi = address of source string
	mov	edi, [esp+16]	; edi = address of destination string
	cmp	edi, esi
	je	short equal	; address of destination string = address of source string?
	mov	ecx, [esp+20]	; ecx = count
	test	ecx, ecx
	jz	short equal	; count = 0?
;;	xor	eax, eax	; eax = 0,
;;				; CF = 0,
				; ZF = 1 (required when count is 0)
	repe	cmpsw
	seta	al		; eax = (*source > *destination)
	sbb	eax, 0		; eax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
equal:
	pop	edi
	pop	esi
	ret
wcsncmp	endp
wcsncpy	proc	public		; wchar_t *wcsncpy(wchar_t *destination, wchar_t const *source, size_t count)
	push	esi
	push	edi
	mov	esi, [esp+16]	; esi = address of source string
	mov	edx, [esp+20]	; edx = count
	mov	edi, esi	; edi = address of source string
	mov	ecx, edx	; ecx = count
	xor	eax, eax	; eax = L'\0'
	repne	scasw
	sub	ecx, edx
	neg	ecx		; ecx = length of source string (including L'\0')
	sub	edx, ecx	; edx = count - length of source string (including L'\0')
	mov	edi, [esp+12]	; edi = address of destination string
	rep	movsw
	mov	ecx, edx	; ecx = count - length of source string (including L'\0')
;;	xor	eax, eax	; eax = L'\0'
	rep	stosw
	mov	eax, [esp+12]	; eax = address of destination string
	pop	edi
	pop	esi
	ret
wcsncpy	endp
wcsnlen	proc	public		; size_t wcsnlen(wchar_t const *string, size_t count)
	mov	ecx, [esp+8]	; ecx = count
	test	ecx, ecx
	jz	short empty	; count = 0?
	xor	eax, eax	; eax = L'\0'
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
;;	test	edi, edi	; ZF = 0 (required when count is 0)
	repne	scasw		; ecx = (length of string < count)
				;     ? count - (length of string + 1) : 0
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi-2] = L'\0')
				;    = (length of string < count),
				; ecx = (length of string < count)
				;     ? length of string + 1 - count : 0
	sbb	ecx, eax	; ecx = (length of string < count)
				;     ? length of string - count : 0
	add	ecx, [esp+8]	; ecx = (length of string < count)
				;     ? length of string : count
	mov	edi, edx
empty:
	mov	eax, ecx	; eax = (length of string < count)
				;     ? length of string : count
	ret
wcsnlen	endp
wcsnset	proc	public		; wchar_t *wcsnset(wchar_t *string, wchar_t character, size_t count)
	mov	edx, [esp+4]	; edx = address of string
	mov	ecx, [esp+12]	; ecx = count
	test	ecx, ecx
	jz	short zero	; count = 0?
	xor	eax, eax	; eax = L'\0'
	push	edi
	mov	edi, edx	; edi = address of string
;;	test	edi, edi	; ZF = 0 (required when count is 0)
	repne	scasw		; ecx = (length of string < count)
				;     ? count - (length of string + 1) : 0
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi-2] = L'\0')
				;    = (length of string < count),
				; ecx = (length of string < count)
				;     ? length of string + 1 - count : 0
	sbb	ecx, eax	; ecx = (length of string < count)
				;     ? length of string - count : 0
	add	ecx, [esp+16]	; ecx = (length of string < count)
				;     ? length of string : count
	mov	eax, [esp+12]	; eax = wide character
	mov	edi, edx	; edi = address of string
	rep	stosw
	pop	edi
zero:
	mov	eax, edx	; eax = address of string
	ret
wcsnset	endp
wcsrchr	proc	public		; wchar_t *wcsrchr(wchar_t const *string, wchar_t character)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of string (including L'\0')
	dec	edi
	dec	edi		; edi = address of L'\0'
				;     = end of string
	mov	eax, [esp+8]	; eax = wide character
	std
	repne	scasw
	cld
	inc	edi
	inc	edi		; edi = address of wide character
	neg	ecx		; CF = (ecx <> 0)
				;    = ([edi] = wide character)
	sbb	eax, eax	; eax = (ecx = 0) ? 0 : -1
	and	eax, edi	; eax = (ecx = 0) ? 0 : address of wide character
	mov	edi, edx
	ret
wcsrchr	endp
wcsrev	proc	private		; wchar_t *wcsrev(wchar_t *string)
	ud2
wcsrev	endp
wcsset	proc	public		; wchar_t *wcsset(wchar_t *string, wchar_t character)
	mov	edx, edi
	mov	edi, [esp+4]	; edi = address of string
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of string (including L'\0')
	sub	edi, ecx
	sub	edi, ecx	; edi = address of string
	dec	ecx		; ecx = length of string
	mov	eax, [esp+8]	; eax = wide character
	rep	stosw
	mov	edi, edx
	mov	eax, [esp+4]	; eax = address of string
	ret
wcsset	endp
wcsstr	proc	public		; wchar_t *wcsstr(wchar_t const *haystack, wchar_t const *needle)
	push	edi
	mov	edi, [esp+12]	; edi = address of needle
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of needle (including L'\0')
	dec	ecx		; ecx = length of needle
	mov	eax, [esp+8]	; eax = address of haystack
	jz	short empty	; length of needle = 0?
	mov	edx, ecx	; edx = length of needle
ifdef SIMPLE
	push	esi
compare:
	mov	esi, eax	; esi = current address in haystack
	mov	edi, [esp+16]	; edi = address of needle
	mov	ecx, edx	; ecx = length of needle
	repe	cmpsw
	je	short match	; needle in haystack?
	inc	eax
	inc	eax		; eax = next address in haystack
	cmp	word ptr [esi-2], 0
	jne	short compare	; non-matching wide character in haystack <> L'\0'?
	xor	eax, eax
match:
else ; SIMPLE
	mov	edi, eax	; edi = address of haystack
	xor	eax, eax	; eax = L'\0'
	xor	ecx, ecx
	dec	ecx		; ecx = -1
	repne	scasw
	not	ecx		; ecx = length of haystack (including L'\0')
	sub	edi, ecx
	sub	edi, ecx	; edi = address of haystack
	dec	ecx		; ecx = length of haystack
	jz	short empty	; length of haystack = 0?
	cmp	ecx, edx
	jb	short empty	; length of haystack < length of needle?
	push	esi
	push	ebx
search:
	mov	esi, [esp+20]	; esi = address of needle
	mov	ax, [esi]	; ax = first wide character of needle
				; edi = current address in haystack
	repne	scasw		; edi = next address in haystack,
				; ecx = current length of haystack
	jne	short break	; (first wide character of) needle not in haystack?
	dec	ecx		; ecx = next length of haystack
	mov	ax, [esi+edx*2-2]
				; ax = last wide character of needle
	cmp	ax, [edi+edx*2-4]
	jne	short continue	; last wide character of needle not in haystack?
compare:
	mov	eax, edi	; eax = next address in haystack
	mov	ebx, ecx	; ebx = next length of haystack
if 0
	dec	edi
	dec	edi		; edi = current address in haystack
				;     = address of matching wide character
				; esi = address of needle
	mov	ecx, edx	; ecx = length of needle
else
				; edi = next address in haystack
	inc	esi
	inc	esi		; esi = address of needle + 2
	mov	ecx, edx
	dec	ecx		; ecx = length of needle - 1,
				; ZF = (ecx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsw
	je	short match	; needle in haystack?
	mov	edi, eax	; edi = current address in haystack
	mov	ecx, ebx	; ecx = current length of haystack
continue:
	cmp	ecx, edx
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	pop	ebx
	pop	esi
	pop	edi
	ret
match:
	dec	eax
	dec	eax		; eax = address of needle in haystack
	pop	ebx
endif ; SIMPLE
	pop	esi
empty:
	pop	edi
	ret
wcsstr	endp
	endi386-wcs.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 3 command lines to generate the object file
            i386-wcs.obj and add it to the existing object library
            i386.lib:
        SET ML=/c /Gy /safeseh /W3 /X ML.EXE i386-wcs.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-wcs.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
 Note: if the
            /Gy option
            to package every function in its own, separately linkable
            COMDAT
            section is not available with the version of the macro assembler
            ML.EXE you use,
            split the i386 assembler source into multiple pieces,
            with one function per source file.
        
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-wcs.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
; NOTE: counts and lengths are numbers of wide characters, not bytes!
	.code
wcscat	proc	public		; wchar_t *wcscat(wchar_t *destination, wchar_t const *source)
	mov	r9, rcx		; r9 = address of destination string
	mov	r10, rdi
ifdef VARIANT
	mov	rdi, rcx	; rdi = address of destination string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	dec	rdi		; rdi = address of L'\0'
				;     = end of destination string
	mov	r11, rsi
	mov	rsi, rdi	; rsi = end of destination string
	mov	rdi, rdx	; rdi = address of source string
;;	xor	eax, eax
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of source string (including L'\0')
	mov	rdi, rsi	; rdi = end of destination string
	mov	rsi, rdx	; rsi = address of source string
else ; VARIANT
	mov	rdi, rdx	; rdi = address of source string
;;	xor	eax, eax
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of source string (including L'\0')
	mov	r11, rsi
	mov	rsi, rdx	; rsi = address of source string
	mov	rdx, rcx
	mov	rdi, r9		; rdi = address of destination string
;;	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	dec	rdi
	dec	rdi		; rdi = address of L'\0'
				;     = end of destination string
	mov	rcx, rdx	; rcx = length of source string (including L'\0')
endif ; VARIANT
	rep	movsw
	mov	rax, r9		; rax = address of destination string
	mov	rdi, r10
	mov	rsi, r11
	ret
wcscat	endp
wcschr	proc	public		; wchar_t *wcschr(wchar_t const *string, wchar_t character)
	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of string (including L'\0')
	mov	rax, rdx	; rax = wide character
	sub	rdi, rcx	; rdi = address of string
	repne	scasw
	lea	rax, [rdi-2]	; rax = address of wide character
	cmovne	rax, rcx	; rax = (rcx = 0) ? 0 : address of wide character
	mov	rdi, r10
	ret
wcschr	endp
wcscmp	proc	public		; int wcscmp(wchar_t const *source, wchar_t const *destination)
	xor	eax, eax	; rax = 0
	cmp	rcx, rdx
	je	short equal	; address of source string = address of destination string?
	mov	r11, rsi
	mov	rsi, rcx	; rsi = address of source string
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of destination string
;;	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of destination string (including L'\0')
	mov	rdi, rdx	; rdi = address of destination string
;;	xor	eax, eax	; rax = 0
	repe	cmpsw
	seta	al		; rax = (*source > *destination)
	sbb	rax, 0		; rax = (*source > *destination)
				;     - (*source < *destination)
				;     = {1, 0, -1}
	mov	rdi, r10
	mov	rsi, r11
equal:
	ret
wcscmp	endp
; NOTE: wcscoll() is a second implementation of wcscmp()!
wcscoll	proc	public		; int wcscoll(wchar_t const *source, wchar_t const *destination)
	sub	rdx, rcx
	jz	short equal	; address of destination string = address of source string?
compare:
	mov	ax, [rcx]
	cmp	ax, [rcx+rdx]
	jne	short different
	lea	rcx, [rcx+2]
	test	ax, ax
	jnz	short compare	; *source <> L'\0'?
equal:
	xor	eax, eax	; rax = 0
	ret
different:
	sbb	rax, rax	; rax = (*source < *destination) ? -1 : 0
	or	rax, 1		; rax = (*source < *destination)
				;     - (*source > *destination)
				;     = {-1, 0, 1}
	ret
wcscoll	endp
wcscpy	proc	public		; wchar_t *wcscpy(wchar_t *destination, wchar_t const *source)
	mov	r9, rcx		; r9 = address of destination string
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of source string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of source string (including L'\0')
	mov	rdi, r9		; rdi = address of destination string
	mov	r11, rsi
	mov	rsi, rdx	; rsi = address of source string
	rep	movsw
	mov	rax, r9		; rax = address of destination string
	mov	rdi, r10
	mov	rsi, r11
	ret
wcscpy	endp
wcslen	proc	public		; size_t wcslen(wchar_t const *string)
	mov	rdx, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of string (including L'\0')
	dec	rcx
	dec	rcx
	mov	rax, rcx	; rax = length of string
	mov	rdi, rdx
	ret
wcslen	endp
wcsncat	proc	private		; wchar_t *wcsncat(wchar_t *destination, wchar_t const *source, size_t count)
	ud2
wcsncat	endp
wcsncmp	proc	private		; int wcsncmp(wchar_t const *source, wchar_t const *destination, size_t count)
	ud2
wcsncmp	endp
wcsncpy	proc	private		; wchar_t *wcsncpy(wchar_t *destination, wchar_t const *source, size_t count)
	ud2
wcsncpy	endp
wcsnlen	proc	private		; size_t wcsnlen(wchar_t const *string, size_t count)
	ud2
wcsnlen	endp
wcsnset	proc	private		; wchar_t *wcsnset(wchar_t *string, wchar_t character, size_t count)
	ud2
wcsnset	endp
wcsrchr	proc	public		; wchar_t *wcsrchr(wchar_t const *string, wchar_t character)
	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of string (including L'\0')
	mov	rax, rdx	; rax = wide character
	lea	rdi, [rdi-2]	; rdi = address of L'\0'
				;     = end of string
	std
	repne	scasw
	cld
	lea	rax, [rdi+2]	; rax = address of wide character
	cmovne	rax, rcx	; rax = (rcx = 0) ? 0 : address of wide character
	mov	rdi, r10
	ret
wcsrchr	endp
wcsrev	proc	private		; wchar_t *wcsrev(wchar_t *string)
	ud2
wcsrev	endp
wcsset	proc	public		; wchar_t *wcsset(wchar_t *string, wchar_t character)
	mov	r9, rcx		; r9 = address of string
	mov	r10, rdi
	mov	rdi, rcx	; rdi = address of string
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of string (including L'\0')
	dec	rcx
	mov	rdi, r9		; rdi = address of string
	mov	rax, rdx	; rax = wide character
	rep	stosw
	mov	rax, r9		; rax = address of string
	mov	rdi, r10
	ret
wcsset	endp
wcsstr	proc	public		; wchar_t *wcsstr(wchar_t const *haystack, wchar_t const *needle)
	mov	r8, rcx		; r8 = address of haystack
	mov	r10, rdi
	mov	rdi, rdx	; rdi = address of needle
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of needle (including L'\0')
	dec	rcx		; rcx = length of needle
	mov	rax, r8		; rax = address of haystack
	jz	short empty	; length of needle = 0?
	mov	r9, rcx		; r9 = length of needel
	mov	rdi, r8		; rdi = address of haystack
	xor	eax, eax	; rax = L'\0'
ifdef AMD
	stc
	sbb	rcx, rcx	; rcx = -1
else
	or	rcx, -1		; rcx = -1
endif
	repne	scasw
	not	rcx		; rcx = length of haystack (including L'\0')
	mov	rdi, r8		; rdi = address of haystack
	dec	rcx		; rcx = length of haystack
	jz	short empty	; length of haystack = 0?
	cmp	rcx, r9
	jb	short empty	; length of haystack <length of needle?
	mov	r11, rsi
search:
	mov	ax, [rdx]	; ax = first wide character of needle
				; rdi = current address in haystack
	repne	scasw		; rdi = next address in haystack,
				; rcx = current length of haystack
	jne	short break	; (first wide character of) needle not in haystack?
	dec	rcx		; rcx = next length of haystack
	mov	ax, [rdx+r9*2-2]
				; ax = last wide character of needle
	cmp	ax, [rdi+r9*2-4]
	jne	short continue	; last wide character of needle not in haystack?
compare:
	mov	rax, rdi	; rax = next address in haystack
	mov	r8, rcx		; r8 = next length of haystack
if 0
	lea	rdi, [rdi-2]	; rdi = current address in haystack
				;     = address of matching character
	mov	rsi, rdx	; rsi = address of needle
	mov	rcx, r9		; rcx = length of needle
else
				; rdi = next address in haystack
	lea	rsi, [rdx+2]	; rsi = address of needle + 2
	mov	rcx, r9
	dec	rcx		; rcx = length of needle - 1,
				; ZF = (rcx = 0)
;;	jz	short match	; needle in haystack?
endif
	repe	cmpsw
	je	short match	; needle in haystack?
	mov	rdi, rax	; rdi = current address in haystack
	mov	rcx, r8		; rcx = current length of haystack
continue:
	cmp	rcx, r9
	jae	short search	; length of haystack >= length of needle?
break:
	xor	eax, eax
	mov	rdi, r10
	mov	rsi, r11
empty:
	ret
match:
	lea	rax, [rax-2]	; rax = address of needle in haystack
	mov	rdi, r10
	mov	rsi, r11
	ret
wcsstr	endp
	endamd64-wcs.asm in the directory where you created the
            object library amd64.lib before, then execute the
            following 3 command lines to generate the object file
            amd64-wcs.obj and add it to the existing object library
            amd64.lib:
        SET ML=/c /Gy /W3 /X ML64.EXE amd64-wcs.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-wcs.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
 Note: if the
            /Gy option
            to package every function in its own, separately linkable
            COMDAT
            section is not available with the version of the macro assembler
            ML64.EXE
            you use, split the AMD64 assembler source into
            multiple pieces, with one function per source file.
        
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 14.16.27023.1 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-wcs.asm Microsoft (R) Library Manager Version 14.16.27049.0 Copyright (C) Microsoft Corporation. All rights reserved.
Thread Local Storage (TLS) is the method by which each thread in a given multithreaded process can allocate locations in which to store thread-specific data. Dynamically bound (run-time) thread-specific data is supported by way of the TLS API (TlsAlloc, TlsGetValue, TlsSetValue, TlsFree). Win32 and the Microsoft C++ compiler now support statically bound (load-time) per-thread data in addition to the existing API implementation.__declspec Rules and Limitations for TLS Under the heading[…]
Visual C also provides a Microsoft-specific attribute, thread, as extended storage class modifier. Use the
__declspeckeyword to declare athreadvariable. For example, the following code declares an integer thread local variable and initializes it with a value:__declspec( thread ) int tls_i = 1;[…]
- On Windows operating systems before Windows Vista,
__declspec( thread )has some limitations. If a DLL declares any data or object as__declspec( thread ), it can cause a protection fault if dynamically loaded. After the DLL is loaded with LoadLibrary, it causes system failure whenever the code references the__declspec( thread )data. Because the global variable space for a thread is allocated at run time, the size of this space is based on a calculation of the requirements of the application plus the requirements of all the DLLs that are statically linked. When you useLoadLibrary, you can't extend this space to allow for the thread local variables declared with__declspec( thread ). Use the TLS APIs, such as TlsAlloc, in your DLL to allocate TLS if the DLL might be loaded withLoadLibrary.
The .tls section, the specification of the PE Format states:
The .tls section provides direct PE and COFF support for static thread local storage (TLS). […] a static TLS variable can be defined as follows, without using the Windows API:Ouch: even the very first (highlighted) sentence is wrong – the
__declspec (thread) int tlsFlag = 1;To support this programming construct, the PE and COFF .tls section specifies the following information: initialization data, callback routines for per-thread initialization and termination, and the TLS index, which are explained in the following discussion.
Note
Statically declared TLS data objects can be used only in statically loaded image files. This fact makes it unreliable to use static TLS data in a DLL unless you know that the DLL, or anything statically linked with it, will never be loaded dynamically with the LoadLibrary API function.
Executable code accesses a static TLS data object through the following steps:
At link time, the linker sets the Address of Index field of the TLS directory. This field points to a location where the program expects to receive the TLS index.
The Microsoft run-time library facilitates this process by defining a memory image of the TLS directory and giving it the special name "__tls_used" (Intel x86 platforms) or "_tls_used" (other platforms). The linker looks for this memory image and uses the data there to create the TLS directory. Other compilers that support TLS and work with the Microsoft linker must use this same technique.
When a thread is created, the loader communicates the address of the thread's TLS array by placing the address of the thread environment block (TEB) in the FS register. A pointer to the TLS array is at the offset of 0x2C from the beginning of TEB. This behavior is Intel x86-specific.
The loader assigns the value of the TLS index to the place that was indicated by the Address of Index field.
The executable code retrieves the TLS index and also the location of the TLS array.
The code uses the TLS index and the TLS array location (multiplying the index by 4 and using it as an offset to the array) to get the address of the TLS data area for the given program and module. Each thread has its own TLS data area, but this is transparent to the program, which does not need to know how data is allocated for individual threads.
An individual TLS data object is accessed as some fixed offset into the TLS data area.
IMAGE_TLS_DIRECTORY
            provides the
            TLS support.
         Note: the .tls section is required
            only when TLS data
            is initialised, it is not needed when data is just declared.
        
Ouch: the initial note is but obsolete and wrong – Windows Vista and later versions of Windows NT support static TLS data in dynamically loaded DLLs!
Note: the multiplier 4 is of course only correct for 32-bit platforms; 64-bit platforms require the multiplier 8.
The documentation misses the following step for the x64 alias AMD64 processor architecture, and corresponding steps for other processor architectures as well:
Note: despite the fixed value of this offset, the Visual C compiler references the address of the external symbol
When a thread is created, the loader communicates the address of the thread's TLS array by placing the address of the thread environment block (TEB) in the GS register. A pointer to the TLS array is at the offset of 0x58 from the beginning of the TEB. This behavior is Intel x64-specific.
__tls_array on the i386
            alias x86 platform.
        The specification of the PE Format continues:
Note: the documentation lacks the information that the Visual C compiler puts all data for the TLS template in COFF sectionsThe TLS directory has the following format:
Offset (PE32/PE32+) Size (PE32/PE32+) Field Description 0 4/8 Raw Data Start VA The starting address of the TLS template. The template is a block of data that is used to initialize TLS data. The system copies all of this data each time a thread is created, so it must not be corrupted. Note that this address is not an RVA; it is an address for which there should be a base relocation in the .reloc section. 4/8 4/8 Raw Data End VA The address of the last byte of the TLS, except for the zero fill. As with the Raw Data Start VA field, this is a VA, not an RVA. 8/16 4/8 Address of Index The location to receive the TLS index, which the loader assigns. This location is in an ordinary data section, so it can be given a symbolic name that is accessible to the program. 12/24 4/8 Address of Callbacks The pointer to an array of TLS callback functions. The array is null-terminated, so if no callback function is supported, this field points to 4 bytes set to zero. For information about the prototype for these functions, see TLS Callback Functions. 16/32 4 Size of Zero Fill The size in bytes of the template, beyond the initialized data delimited by the Raw Data Start VA and Raw Data End VA fields. The total template size should be the same as the total size of TLS data in the image file. The zero fill is the amount of data that comes after the initialized nonzero data. 20/36 4 Characteristics The four bits [23:20] describe alignment info. Possible values are those defined as IMAGE_SCN_ALIGN_*, which are also used to describe alignment of section in object files. The other 28 bits are reserved for future use. 
.tls$‹suffix› – which it declares
            but writable instead of read-only, i.e. it fails to protect the template data against corruption, an easily avoidable safety hazard!
 OOPS: the
            Raw Data End VA
            field contains the address of the first byte after
            the TLS template!
        
 OUCH: the Size of Zero Fill
 field is
            not supported at all!
        
 Note: if the size of the initialised data of the
            .tls section in the image file is less than the section
            size, the module loader fills the additional uninitialised data with
            zeroes, i.e. the Size of Zero Fill
 field is superfluous.
        
 Under the heading
            TLS Callback Functions
,
            the specification of the
            PE Format
            states:
        
The program can provide one or more TLS callback functions […]The prototype for a callback function (pointed to by a pointer of type PIMAGE_TLS_CALLBACK) has the same parameters as a DLL entry-point function:
typedef VOID (NTAPI *PIMAGE_TLS_CALLBACK) ( PVOID DllHandle, DWORD Reason, PVOID Reserved );The Reserved parameter should be set to zero. The Reason parameter can take the following values:
Setting Value Description DLL_PROCESS_ATTACH 1 A new process has started, including the first thread. DLL_THREAD_ATTACH 2 A new thread has been created. This notification sent for all but the first thread. DLL_THREAD_DETACH 3 A thread is about to be terminated. This notification sent for all but the first thread. DLL_PROCESS_DETACH 0 A process is about to terminate, including the original thread. 
mainCRTStartup()
            and
            _DllMainCRTStartup()
            of both its components, and uses a
            TLS callback
            function to log the thread’s progress:
        // Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#undef UNICODE
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
__declspec(thread)
DWORD	dwTLS = 'MSVC';			// placed in writable ".tls$" section by the compiler
#ifndef LIBRARY
#pragma data_seg(".tls")
DWORD	_tls_begin = 'JUNK';		// placed before all TLS template data by the linker
#pragma data_seg(".tls$~~~")
DWORD	_tls_end = 'JUNK';		// placed after all TLS template data by the linker
#pragma data_seg()
#pragma bss_seg(".bss$T")
DWORD	_tls_index;			// assigned by the module loader
#pragma bss_seg()
#else
extern	const	DWORD	_tls_index;
#endif // LIBRARY
__declspec(safebuffers)
BOOL	CDECL	PrintFormat(HANDLE hOutput, [SA_FormatString(Style="printf")] LPCSTR lpFormat, ...)
{
	CHAR	szFormat[1025];
	DWORD	dwFormat;
	DWORD	dwOutput;
	va_list	vaInput;
	va_start(vaInput, lpFormat);
	dwFormat = wvsprintf(szFormat, lpFormat, vaInput);
	va_end(vaInput);
	if ((dwFormat == 0UL)
	 || !WriteFile(hOutput, szFormat, dwFormat, &dwOutput, (LPOVERLAPPED) NULL))
		return FALSE;
	return dwOutput == dwFormat;
}
const	LPCSTR	szReason[4] = {"process detach",
		               "process attach",
		               "thread attach",
		               "thread detach"};
__declspec(safebuffers)
VOID	WINAPI	TLSCallback(LPVOID hModule, DWORD dwReason, LPVOID lpUnused)
{
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	HMODULE	hCaller;
	DWORD	dwCaller;
	CHAR	szCaller[MAX_PATH];
	CHAR	szModule[MAX_PATH];
	DWORD	dwModule = GetModuleFileName(hModule, szModule, sizeof(szModule));
	if (hOutput == INVALID_HANDLE_VALUE)
		return;
	if (dwModule < sizeof(szModule))
		szModule[dwModule] = '\0';
	if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
	                       _ReturnAddress(),
	                       &hCaller))
		szCaller[0] = '\0';
	else
	{
		dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));
		if (dwCaller < sizeof(szCaller))
			szCaller[dwCaller] = '\0';
	}
	PrintFormat(hOutput,
	            "\r\n"
	            __FUNCTION__ "() function @ 0x%p\r\n"
	            "\tCalled module  @ 0x%p = %hs\r\n"
	            "\tCalling module @ 0x%p = %hs\r\n"
	            "\tReturn address @ 0x%p = 0x%p\r\n"
	            "\tArguments:\r\n"
	            "\t\tModule = 0x%p\r\n"
	            "\t\tReason = %lu (%hs)\r\n"
	            "\t\tUnused = 0x%p\r\n"
	            "\tThread id = %lu\r\n",
	            TLSCallback,
	            hModule, szModule,
	            hCaller, szCaller,
	            _AddressOfReturnAddress(), _ReturnAddress(),
	            hModule, dwReason, szReason[dwReason], lpUnused,
	            GetCurrentThreadId());
}
#ifndef LIBRARY
const	PIMAGE_TLS_CALLBACK	_tls_callbacks[] = {TLSCallback, NULL};
const	IMAGE_TLS_DIRECTORY	_tls_used = {&_tls_begin,
				             &_tls_end + sizeof(_tls_end),
				             &_tls_index,
				             _tls_callbacks,
				             'VOID',
				             0UL};
#else
extern	IMAGE_TLS_DIRECTORY	_tls_used;
#pragma const_seg(".ptr$")		// added to ".ptr" section by the linker
//const	PIMAGE_TLS_CALLBACK	_tls_callback = TLSCallback;
#pragma const_seg()			// place more pointers to callback routines above here
__declspec(allocate(".ptr$"))		// added to ".ptr" section by the linker
const	PIMAGE_TLS_CALLBACK	_tls_callback = TLSCallback;
#endif // LIBRARY
extern	IMAGE_DOS_HEADER	__ImageBase;
#ifdef _DLL
__declspec(dllexport)
__declspec(safebuffers)
DWORD	WINAPI	ThreadProc(LPVOID lpParameter)
{
	HMODULE	hCaller;
	DWORD	dwCaller;
	CHAR	szCaller[MAX_PATH];
	CHAR	szModule[MAX_PATH];
	DWORD	dwModule = GetModuleFileName((HMODULE) &__ImageBase, szModule, sizeof(szModule));
	if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
	                       _ReturnAddress(),
	                       &hCaller))
		szCaller[0] = '\0';
	else
	{
		dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));
		if (dwCaller < sizeof(szCaller))
			szCaller[dwCaller] = '\0';
	}
	if (dwModule < sizeof(szModule))
		szModule[dwModule] = '\0';
	PrintFormat(lpParameter,
	            "\r\n"
	            __FUNCTION__ "() function @ 0x%p\r\n"
	            "\tCalled module  @ 0x%p = %hs\r\n"
	            "\tCalling module @ 0x%p = %hs\r\n"
	            "\tReturn address @ 0x%p = 0x%p\r\n"
	            "\tParameter = 0x%p\r\n"
	            "\tThread id = %lu\r\n",
	            ThreadProc,
	            &__ImageBase, szModule,
	            hCaller, szCaller,
	            _AddressOfReturnAddress(), _ReturnAddress(),
	            lpParameter,
	            GetCurrentThreadId());
	return GetLastError();
}
__declspec(safebuffers)
BOOL	WINAPI	_DllMainCRTStartup(HMODULE hModule, DWORD dwReason, CONTEXT *lpContext)
{
	DWORD	dwThreadId = GetCurrentThreadId();
	HANDLE	hThread;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	HMODULE	hCaller;
	DWORD	dwCaller;
	CHAR	szCaller[MAX_PATH];
	CHAR	szModule[MAX_PATH];
	DWORD	dwModule = GetModuleFileName(hModule, szModule, sizeof(szModule));
	if (hOutput == INVALID_HANDLE_VALUE)
		return FALSE;
	if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
	                       _ReturnAddress(),
	                       &hCaller))
		szCaller[0] = '\0';
	else
	{
		dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));
		if (dwCaller < sizeof(szCaller))
			szCaller[dwCaller] = '\0';
	}
	if (dwModule < sizeof(szModule))
		szModule[dwModule] = '\0';
	PrintFormat(hOutput,
	            "\r\n"
	            __FUNCTION__ "() function @ 0x%p\r\n"
	            "\tCalled module  @ 0x%p = %hs\r\n"
	            "\tCalling module @ 0x%p = %hs\r\n"
	            "\tReturn address @ 0x%p = 0x%p\r\n"
	            "\tArguments:\r\n"
	            "\t\tModule = 0x%p\r\n"
	            "\t\tReason = %lu (%hs)\r\n"
	            "\t\tUnused = 0x%p\r\n"
	            "\tThread id = %lu\r\n",
	            _DllMainCRTStartup,
	            hModule, szModule,
	            hCaller, szCaller,
	            _AddressOfReturnAddress(), _ReturnAddress(),
	            hModule, dwReason, szReason[dwReason], lpContext,
	            dwThreadId);
	if (dwReason != DLL_PROCESS_ATTACH)
		return FALSE;
	PrintFormat(hOutput,
	            "\a"
	            "\tTLS index = %lu\r\n"
	            "\tTLS value = 0x%p\r\n"
	            "\tTLS array @ 0x%p\r\n"
	            "\tTLS block @ 0x%p\r\n"
	            "\tTLS dword @ 0x%p = \"%.4hs\"\r\n"
	            "\tTLS directory     @ 0x%p\r\n"
	            "\t\tStart     @ 0x%p\r\n"
	            "\t\tEnd       @ 0x%p\r\n"
	            "\t\tIndex     @ 0x%p\r\n"
	            "\t\tCallbacks @ 0x%p\r\n"
	            "\t\tZerofill  = 0x%08lX = \"%.4hs\"\r\n"
	            "\t\tAlignment = 0x%08lX\r\n" + (dwTLS == 'MSVC'),
	            _tls_index,
	            TlsGetValue(_tls_index),
#ifdef _M_IX86
	            __readfsdword(44),
	            ((LPVOID *) __readfsdword(44))[_tls_index],
#elif _M_AMD64
	            __readgsqword(88),
	            ((LPVOID *) __readgsqword(88))[_tls_index],
#else
#error Only I386 and AMD64 supported!
#endif
	            &dwTLS, &dwTLS,
	            &_tls_used,
	            _tls_used.StartAddressOfRawData,
	            _tls_used.EndAddressOfRawData,
	            _tls_used.AddressOfIndex,
	            _tls_used.AddressOfCallBacks,
	            _tls_used.SizeOfZeroFill, &_tls_used.SizeOfZeroFill,
	            _tls_used.Characteristics);
	hThread = CreateThread((LPSECURITY_ATTRIBUTES) NULL,
	                       (SIZE_T) 65536,
	                       ThreadProc,
	                       hOutput,
	                       0,
	                       &dwThreadId);
	if (hThread == NULL)
		PrintFormat(hOutput,
		            "CreateThread() returned error %lu\r\n",
		            GetLastError());
	else
	{
		PrintFormat(hOutput,
		            "\r\n"
		            "Thread %lu created and started\r\n",
		            dwThreadId);
		if (!CloseHandle(hThread))
			PrintFormat(hOutput,
			            "CloseHandle() returned error %lu\r\n",
			            GetLastError());
	}
	return TRUE;
}
#else // _DLL
__declspec(dllimport)
DWORD	WINAPI	ThreadProc(LPVOID lpParameter);
DWORD	CDECL	mainCRTStartup(VOID)
{
	DWORD	dwError = ERROR_SUCCESS;
	DWORD	dwThreadId = GetCurrentThreadId();
	DWORD	dwThread;
	HANDLE	hThread;
	HANDLE	hOutput = GetStdHandle(STD_OUTPUT_HANDLE);
	HMODULE	hCaller;
	DWORD	dwCaller;
	CHAR	szCaller[MAX_PATH];
	CHAR	szModule[MAX_PATH];
	DWORD	dwModule = GetModuleFileName((HMODULE) &__ImageBase, szModule, sizeof(szModule));
	if (hOutput == INVALID_HANDLE_VALUE)
		return GetLastError();
	if (!GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS | GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
	                       _ReturnAddress(),
	                       &hCaller))
		szCaller[0] = '\0';
	else
	{
		dwCaller = GetModuleFileName(hCaller, szCaller, sizeof(szCaller));
		if (dwCaller < sizeof(szCaller))
			szCaller[dwCaller] = '\0';
	}
	if (dwModule < sizeof(szModule))
		szModule[dwModule] = '\0';
	PrintFormat(hOutput,
	            "\a\r\n"
	            __FUNCTION__ "() function @ 0x%p\r\n"
	            "\tCalled module  @ 0x%p = %hs\r\n"
	            "\tCalling module @ 0x%p = %hs\r\n"
	            "\tReturn address @ 0x%p = 0x%p\r\n"
	            "\tThread id = %lu\r\n"
	            "\tTLS index = %ld\r\n"
	            "\tTLS value = 0x%p\r\n"
	            "\tTLS array @ 0x%p\r\n"
	            "\tTLS block @ 0x%p\r\n"
	            "\tTLS dword @ 0x%p = \"%.4hs\"\r\n"
	            "\tTLS directory     @ 0x%p\r\n"
	            "\t\tStart     @ 0x%p\r\n"
	            "\t\tEnd       @ 0x%p\r\n"
	            "\t\tIndex     @ 0x%p\r\n"
	            "\t\tCallbacks @ 0x%p\r\n"
	            "\t\tZerofill  = 0x%08lX = \"%.4hs\"\r\n"
	            "\t\tAlignment = 0x%08lX\r\n" + (dwTLS == 'MSVC'),
	            mainCRTStartup,
	            &__ImageBase, szModule,
	            hCaller, szCaller,
	            _AddressOfReturnAddress(), _ReturnAddress(),
	            dwThreadId,
	            _tls_index,
	            TlsGetValue(_tls_index),
#ifdef _M_IX86
	            __readfsdword(44),
	            ((LPVOID *) __readfsdword(44))[_tls_index],
#elif _M_AMD64
	            __readgsqword(88),
	            ((LPVOID *) __readgsqword(88))[_tls_index],
#else
#error Only I386 and AMD64 supported!
#endif
	            &dwTLS, &dwTLS,
	            &_tls_used,
	            _tls_used.StartAddressOfRawData,
	            _tls_used.EndAddressOfRawData,
	            _tls_used.AddressOfIndex,
	            _tls_used.AddressOfCallBacks,
	            _tls_used.SizeOfZeroFill, &_tls_used.SizeOfZeroFill,
	            _tls_used.Characteristics);
	hThread = CreateThread((LPSECURITY_ATTRIBUTES) NULL,
	                       (SIZE_T) 65536,
	                       ThreadProc,
	                       hOutput,
	                       0UL,
	                       &dwThreadId);
	if (hThread == NULL)
		PrintFormat(hOutput,
		            "CreateThread() returned error %lu\r\n",
		            dwError = GetLastError());
	else
	{
		PrintFormat(hOutput,
		            "\r\n"
		            "Thread %lu created and started\r\n",
		            dwThreadId);
		if (WaitForSingleObject(hThread, INFINITE) == WAIT_FAILED)
			PrintFormat(hOutput,
			            "WaitForSingleObject() returned error %lu\r\n",
			            dwError = GetLastError());
		if (!GetExitCodeThread(hThread, &dwThread))
			PrintFormat(hOutput,
			            "GetExitCodeThread() returned error %lu\r\n",
			            dwError = GetLastError());
		else
			PrintFormat(hOutput,
			            "\r\n"
			            "Thread %lu exited with code %lu\r\n",
			            dwThreadId, dwThread);
		if (!CloseHandle(hThread))
			PrintFormat(hOutput,
			            "CloseHandle() returned error %lu\r\n",
			            GetLastError());
	}
	return dwError;
}
#endif // _DLLtls-demo.c in an arbitrary,
            preferable empty directory, then execute the following 6 command
            lines to compile and link it a first time to generate the
            DLL
            tls-demo.dll and its import library
            tls-demo.lib for the AMD64 platform, to
            compile and link it a second time to generate the image file
            tls-demo.exe for the AMD64 platform too,
            and finally execute the latter:
        SET CL=/GAFy /Oisy /W4 /Zl SET LINK=/NODEFAULTLIB /SECTION:.tls,!w CL.EXE /LD /MD tls-demo.c kernel32.lib user32.lib SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SECTION:.tls,!w /SUBSYSTEM:CONSOLE CL.EXE tls-demo.c tls-demo.lib kernel32.lib user32.lib .\tls-demo.exeFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c tls-demo.c(108) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(109) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(110) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(111) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'const PIMAGE_TLS_CALLBACK *' Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /NODEFAULTLIB /SECTION:.tls,!w /out:tls-demo.dll /dll /implib:tls-demo.lib tls-demo.obj kernel32.lib user32.lib Creating library tls-demo.lib and object tls-demo.exp Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c tls-demo.c(108) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(109) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(110) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'DWORD *' tls-demo.c(111) : warning C4047: 'initializing' : 'ULONGLONG' differs in levels of indirection from 'const PIMAGE_TLS_CALLBACK *' Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SECTION:.tls,!w /SUBSYSTEM:CONSOLE /out:tls-demo.exe tls-demo.obj tls-demo.lib kernel32.lib user32.lib TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F038 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 1 (process attach) Unused = 0x0000000000000000 Thread id = 7544 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F0A8 = 0x0000000077837C3E Arguments: Module = 0x000007FEFACA0000 Reason = 1 (process attach) Unused = 0x000000000017F830 Thread id = 7544 TLS index = 1 TLS value = 0x0000000000000000 TLS array @ 0x00000000002C3280 TLS block @ 0x00000000002EA590 TLS dword @ 0x00000000002C32D4 = "CVSM" TLS directory @ 0x000007FEFACA20E0 Start @ 0x000007FEFACA5000 End @ 0x000007FEFACA5018 Index @ 0x000007FEFACA3000 Callbacks @ 0x000007FEFACA20D0 Zerofill = 0x564F4944 = "DIOV" Alignment = 0x00000000 Thread 11820 created and started TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F038 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 1 (process attach) Unused = 0x0000000000000000 Thread id = 7544 mainCRTStartup() function @ 0x000000013F891258 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x00000000776E0000 = C:\Windows\system32\kernel32.dll Return address @ 0x000000000017FC88 = 0x00000000776F556D Thread id = 7544 TLS index = 0 TLS value = 0x0000000000000000 TLS array @ 0x00000000002C3280 TLS block @ 0x00000000002C32D0 TLS dword @ 0x00000000002C32D4 = "CVSM" TLS directory @ 0x000000013F892100 Start @ 0x000000013F895000 End @ 0x000000013F895018 Index @ 0x000000013F893000 Callbacks @ 0x000000013F8920F0 Zerofill = 0x564F4944 = "DIOV" Alignment = 0x00000000 Thread 11888 created and started TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F458 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11888 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F4C8 = 0x00000000778383CC Arguments: Module = 0x000007FEFACA0000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11888 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F458 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11888 ThreadProc() function @ 0x000007FEFACA1258 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x00000000776E0000 = C:\Windows\system32\kernel32.dll Return address @ 0x000000000201FAE8 = 0x00000000776F556D Parameter = 0x0000000000000070 Thread id = 11888 TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F688 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 11888 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F6F8 = 0x0000000077838785 Arguments: Module = 0x000007FEFACA0000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 11888 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000201F688 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 11888 Thread 11888 exited with code 0 TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F2A8 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11820 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F318 = 0x00000000778383CC Arguments: Module = 0x000007FEFACA0000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11820 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F2A8 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 2 (thread attach) Unused = 0x0000000000000000 Thread id = 11820 TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F758 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 7544 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F7C8 = 0x0000000077838785 Arguments: Module = 0x000007FEFACA0000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 7544 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x000000000017F758 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 3 (thread detach) Unused = 0x0000000000000000 Thread id = 7544 ThreadProc() function @ 0x000007FEFACA1258 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x00000000776E0000 = C:\Windows\system32\kernel32.dll Return address @ 0x0000000001E0F938 = 0x00000000776F556D Parameter = 0x0000000000000070 Thread id = 11820 TLSCallback() function @ 0x000007FEFACA10D0 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F488 = 0x0000000077845078 Arguments: Module = 0x000007FEFACA0000 Reason = 0 (process detach) Unused = 0x0000000000000000 Thread id = 11820 _DllMainCRTStartup() function @ 0x000007FEFACA1384 Called module @ 0x000007FEFACA0000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F4F8 = 0x000000007783775B Arguments: Module = 0x000007FEFACA0000 Reason = 0 (process detach) Unused = 0x0000000000000001 Thread id = 11820 TLSCallback() function @ 0x000000013F8910D0 Called module @ 0x000000013F890000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x0000000077800000 = C:\Windows\SYSTEM32\ntdll.dll Return address @ 0x0000000001E0F488 = 0x0000000077845078 Arguments: Module = 0x000000013F890000 Reason = 0 (process detach) Unused = 0x0000000000000000 Thread id = 11820The program works as documented and intended – the variable
dwTLS is initialised with the
            ASCII
            text CVSM, the TLSCallback() function runs
            on the secondary thread 11820 and the tertiary thread 11888
            before its
            ThreadProc()
            function as well as after the latter returns, and
            it runs also on the primary thread 7544 before the
            entry point functions of both the
            DLL and the program
            as well as after the latter returns from its entry
            point function.
         Note: the primary thread 7544 exits before the
            secondary thread 11820 here – as documented in the
            MSDN article
            Terminating a Process,
            the program terminates with its last thread.
            ExitProcess()
        
 Note: the
            MSDN article
            Terminating a Thread
            specifies that threads are terminated upon return of their
            ThreadProc()
            callback function.
            ExitThread()
        
Now (attempt to) build this application for the i386 platform, using the same command lines as before:
SET CL=/GAFy /Oisy /W4 /Zl SET LINK=/NODEFAULTLIB /SECTION:.tls,!w CL.EXE /LD /MD tls-demo.c kernel32.lib user32.lib […]
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
tls-demo.c
tls-demo.c(106) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD *'
tls-demo.c(107) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD *'
tls-demo.c(108) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'DWORD *'
tls-demo.c(109) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'const PIMAGE_TLS_CALLBACK *'
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/NODEFAULTLIB /SECTION:.tls,!w
/out:tls-demo.dll
/dll
/implib:tls-demo.lib
tls-demo.obj
kernel32.lib
user32.lib
   Creating library tls-demo.lib and object tls-demo.exp
tls-demo.obj : error LNK2019: unresolved external symbol __tls_array referenced in function __DllMainCRTStartup@12
tls-demo.dll : fatal error LNK1120: 1 unresolved externals
            OUCH: due to the braindead
            behaviour of the Visual C compiler for the
            i386 platform, which references the (absolute) symbol
            __tls_array in the generated machine code instead to
            use its fixed constant value 44, this build fails!
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.model	flat, C
	public	_tls_array
_tls_array equ	44		; offset of 'ThreadLocalStoragePointer' member in TEB;
				;  symbol referenced in code generated by the compiler!
_tls_32	struct	4
	dword	offset _tls_begin
	dword	offset _tls_end
	dword	offset _tls_index
	dword	offset _tls_start
	dword	'VOID'		; BUG: the module loader discards the 'SizeOfZeroFill' member!
	dword	0
_tls_32	ends
_tls_bss segment alias(".bss$T") dword read write 'BSS'
_tls_index dword ?		; assigned by the module loader
_tls_bss ends
_tls_note segment alias(".comment") discard info read 'INFO'
	byte	"(C)opyright 2004-2025, Stefan Kanthak"
_tls_note ends
_tls_info segment alias(".drectve") discard info read 'INFO'
	byte	"/MERGE:.ptr=.rdata /SECTION:.tls,!w"
_tls_info ends
_tls_start segment alias(".ptr") dword read 'CONST'
_tls_start ends
_tls_stop segment alias(".ptr$~~~") dword read 'CONST'
	dword	0		; callback function array terminator
_tls_stop ends
_tls	segment alias(".rdata$T") dword read 'CONST'
	public	_tls_used
_tls_used _tls_32 <>		; IMAGE_TLS_DIRECTORY32
_tls	ends
_tls_begin segment alias(".tls") para read write 'DATA'
_tls_begin ends
_tls_end segment alias(".tls$~~~") byte read write 'DATA'
_tls_end ends
	endi386-tls.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 3 command lines to generate the object file
            i386-tls.obj and add it to the existing object library
            i386.lib:
        SET ML=/c /safeseh /W3 /X ML.EXE i386-tls.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-tls.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-tls.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.Move the ANSI C source file
tls-demo.c created before into the current
            directory, then execute the following 6 command lines to compile and
            link it a first time with the
            TLS support module
            from the object library i386.lib to generate the
            DLL
            tls-demo.dll and its import library
            tls-demo.lib for the i386 platform, to
            compile and link it a second time to generate the image file
            tls-demo.exe for the i386 platform too,
            and finally execute the latter:
        SET CL=/c /DLIBRARY /GAFy /Oisy /W4 /Zl SET LINK=/NODEFAULTLIB CL.EXE /LD /MD tls-demo.c i386.lib kernel32.lib user32.lib SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE tls-demo.c tls-demo.lib i386.lib kernel32.lib user32.lib .\tls-demo.exe
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /NODEFAULTLIB /out:tls-demo.dll /dll /implib:tls-demo.lib tls-demo.obj i386.lib kernel32.lib user32.lib Creating library tls-demo.lib and object tls-demo.exp Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE /out:tls-demo.exe tls-demo.obj tls-demo.lib i386.lib kernel32.lib user32.lib TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF390 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 1 (process attach) Unused = 0x00000000 Thread id = 1724 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF3CC = 0x779F9280 Arguments: Module = 0x70350000 Reason = 1 (process attach) Unused = 0x004AF6D0 Thread id = 1724 TLS index = 1 TLS value = 0x00000000 TLS array @ 0x007F20D0 TLS block @ 0x0080B728 TLS dword @ 0x007F4FC8 = "CVSM" TLS directory @ 0x70352468 Start @ 0x70354000 End @ 0x70354004 Index @ 0x70353000 Callbacks @ 0x70352088 Zerofill = 0x564F4944 = "DIOV" Alignment = 0x00000000 Thread 2716 created and started TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF390 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 1 (process attach) Unused = 0x00000000 Thread id = 1724 TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF720 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 2716 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF75C = 0x779F9280 Arguments: Module = 0x70350000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 2716 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF720 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 2716 mainCRTStartup() function @ 0x0033116B Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x774D0000 = C:\Windows\syswow64\kernel32.dll Return address @ 0x004AF938 = 0x774E343D Thread id = 1724 TLS index = 0 TLS value = 0x00000000 TLS array @ 0x007F20D0 TLS block @ 0x007F4FC8 TLS dword @ 0x007F4FC8 = "CVSM" TLS directory @ 0x00332400 Start @ 0x00334000 End @ 0x00334004 Index @ 0x00333000 Callbacks @ 0x00332098 Zerofill = 0x564F4944 = "DIOV" Alignment = 0x00000000 ThreadProc() function @ 0x7035116B Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x774D0000 = C:\Windows\syswow64\kernel32.dll Return address @ 0x007CFAF0 = 0x774E343D Parameter = 0x00000074 Thread id = 2716 TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF7B4 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 2716 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF7F0 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 2716 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x007CF7B4 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 2716 Thread 11748 created and started TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFBA4 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 11748 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFBE0 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 11748 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFBA4 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 2 (thread attach) Unused = 0x00000000 Thread id = 11748 ThreadProc() function @ 0x7035116B Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x774D0000 = C:\Windows\syswow64\kernel32.dll Return address @ 0x021AFF74 = 0x774E343D Parameter = 0x00000074 Thread id = 11748 TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFC38 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 11748 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFC74 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 11748 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x021AFC38 = 0x779F9280 Arguments: Module = 0x00330000 Reason = 3 (thread detach) Unused = 0x00000000 Thread id = 11748 Thread 11748 exited with code 0 TLSCallback() function @ 0x70351078 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF5CC = 0x779F9280 Arguments: Module = 0x70350000 Reason = 0 (process detach) Unused = 0x00000000 Thread id = 1724 _DllMainCRTStartup() function @ 0x70351240 Called module @ 0x70350000 = C:\Users\Stefan\Desktop\tls-demo.dll Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF608 = 0x779F9280 Arguments: Module = 0x70350000 Reason = 0 (process detach) Unused = 0x00000001 Thread id = 1724 TLSCallback() function @ 0x00331078 Called module @ 0x00330000 = C:\Users\Stefan\Desktop\tls-demo.exe Calling module @ 0x779C0000 = C:\Windows\SysWOW64\ntdll.dll Return address @ 0x004AF5CC = 0x779F9280 Arguments: Module = 0x00330000 Reason = 0 (process detach) Unused = 0x00000000 Thread id = 1724With the object module
i386-tls.obj, program and
            DLL
            work as documented and intended now, exhibiting the insignificant
            difference that the program terminates with the primary thread 1724
            here.
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
_tls_64	struct	8
	qword	offset _tls_begin
	qword	offset _tls_end
	qword	offset _tls_index
	qword	offset _tls_start
	dword	'VOID'		; BUG: the module loader discards the 'SizeOfZeroFill' member!
	dword	0
_tls_64	ends
_bss	segment alias(".bss$T") dword read write 'BSS'
_tls_index dword ?		; assigned by the module loader
_bss	ends
_note	segment alias(".comment") discard info read 'INFO'
	byte	"(C)opyright 2004-2025, Stefan Kanthak"
_note	ends
_linker	segment alias(".drectve") discard info read 'INFO'
	byte	"/MERGE:.ptr=.rdata /SECTION:.tls,!w"
_linker	ends
_start	segment alias(".ptr") para read 'CONST'
_tls_start equ	$
_start	ends
_stop	segment alias(".ptr$~~~") read 'CONST'
	qword	0		; callback function array terminator
_stop	ends
_const	segment alias(".rdata$T") para read 'CONST'
	public	_tls_used
_tls_used _tls_64 <>		; IMAGE_TLS_DIRECTORY64
_const	ends
_begin	segment alias(".tls") para read write 'DATA'
_tls_begin equ	$
_begin	ends
_end	segment alias(".tls$~~~") byte read write 'DATA'
_tls_end equ	$
_end	ends
	endamd64-tls.asm in the directory where you created the
            object library amd64.lib before, then execute the
            following 3 command lines to generate the object file
            amd64-tls.obj and add it to the existing object library
            amd64.lib:
        SET ML=/c /W3 /X ML64.EXE amd64-tls.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-tls.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-tls.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.Move the ANSI C source file
tls-demo.c created before into the current
            directory, then execute the following 6 command lines to compile and
            link it a first time with the
            TLS support module
            from the object library amd64.lib to generate the
            DLL
            tls-demo.dll and its import library
            tls-demo.lib for the AMD64 platform, to
            compile and link it a second time to generate the image file
            tls-demo.exe for the AMD64 platform too,
            and finally execute the latter:
        SET CL=/c /DLIBRARY /GAFy /Oisy /W4 /Zl SET LINK=/NODEFAULTLIB CL.EXE /LD /MD tls-demo.c amd64.lib kernel32.lib user32.lib SET LINK=/ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE CL.EXE tls-demo.c tls-demo.lib amd64.lib kernel32.lib user32.lib .\tls-demo.exe
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /NODEFAULTLIB /out:tls-demo.dll /dll /implib:tls-demo.lib tls-demo.obj amd64.lib kernel32.lib user32.lib Creating library tls-demo.lib and object tls-demo.exp Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. tls-demo.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:mainCRTStartup /NODEFAULTLIB /SUBSYSTEM:CONSOLE /out:tls-demo.exe tls-demo.obj tls-demo.lib amd64.lib kernel32.lib user32.lib […]
_load_config_used and __security_check_cookie() Function (/GS Support)The Load Configuration Structure (Image Only), the specification of the PE Format states:
The load configuration structure (IMAGE_LOAD_CONFIG_DIRECTORY) was formerly used in very limited cases in the Windows NT operating system itself to describe various features too difficult or too large to describe in the file header or optional header of the image. Current versions of the Microsoft linker and Windows XP and later versions of Windows use a new version of this structure for 32-bit x86-based systems that include reserved SEH technology.OUCH¹: the highlighted statement is but wrong –
[…]
The Microsoft linker automatically provides a default load configuration structure to include the reserved SEH data. If the user code already provides a load configuration structure, it must include the new reserved SEH fields. Otherwise, the linker cannot include the reserved SEH data and the image is not marked as containing reserved SEH.
LINK.EXE
            neither provides an
            IMAGE_LOAD_CONFIG_DIRECTORY
            structure nor reports its omission with an error message!
         The documentation of the
            /SAFESEH
            compiler options gives proper information:
        
If you link withThe specification of the PE format continues with the following disinformation:/NODEFAULTLIBand you want a table of safe exception handlers, you need to supply a load config struct (…) that contains all the entries defined for Visual C++. For example:#include <windows.h> extern DWORD_PTR __security_cookie; /* /GS security cookie */ /* * The following two names are automatically created by the linker for any * image that has the safe exception table present. */ extern PVOID __safe_se_handler_table[]; /* base of safe handler entry table */ extern BYTE __safe_se_handler_count; /* absolute symbol whose address is the count of table entries */ const IMAGE_LOAD_CONFIG_DIRECTORY32 _load_config_used = { sizeof(IMAGE_LOAD_CONFIG_DIRECTORY32), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, &__security_cookie, __safe_se_handler_table, (DWORD)(DWORD_PTR) &__safe_se_handler_count };
Load Configuration LayoutOUCH²: the documentation for theThe load configuration structure has the following layout for 32-bit and 64-bit PE files:
Offset Size Field Description 0 4 Characteristics Flags that indicate attributes of the file, currently unused. […] 54/78 2 Reserved Must be zero. 
IMAGE_LOAD_CONFIG_DIRECTORY
            structure but states that the field at offset 0 stores the size of
            the structure, and the field at offset 54 (for 32-bit images) or 78
            (for 64-bit images) stores the
            /DEPENDENTLOADFLAG!
         Caveat: only with the GuardFlags
            member present in the
            IMAGE_LOAD_CONFIG_DIRECTORY
            structure, i.e. if its Size member is at least 92 on
            32-bit platforms and 148 on 64-bit platforms, the module loader
            honors the
            /DEPENDENTLOADFLAG
            on Windows 10 1607 alias
            Anniversary Update, codenamed
            Redstone 1, and later versions of
            Windows NT!
        
The documentation of the /GS compiler option states:
The /GS compiler option requires that the security cookie be initialized before any function that uses the cookie is run. The security cookie must be initialized immediately on entry to an EXE or DLL. This is done automatically if you use the default VCRuntime entry points: mainCRTStartup, wmainCRTStartup, WinMainCRTStartup, wWinMainCRTStartup, or _DllMainCRTStartup. If you use an alternate entry point, you must manually initialize the security cookie by calling __security_init_cookie.OOPS¹: contrary to the first highlighted statement, the code generated by the compiler requires only that the (arbitrary) value of the security cookie does not change between entry and exit of any function which uses it!
OOPS²: the documentation cited above but fails to provide the following (implementation) details:
_load_config_used
            structure matches the size of the
            IMAGE_LOAD_CONFIG_DIRECTORY64
            structure in the eleventh entry of the
            IMAGE_DATA_DIRECTORY
            array in the
            IMAGE_OPTIONAL_HEADER
            structure;
        __security_init_cookie()
            provided in the
            MSVCRT
            libraries (re)initialises the security cookie only if it has this
            default value or is 0;
        mainCRTStartup,
            wmainCRTStartup, WinMainCRTStartup,
            wWinMainCRTStartup and _DllMainCRTStartup!
        __security_init_cookie()
            function to (re)initialise the security cookie any more!
         Note: while compiler and linker reference the
            security cookie by its symbol name __security_cookie,
            the module loader references it through the virtual address stored
            in the SecurityCookie member of the
            IMAGE_LOAD_CONFIG_DIRECTORY
            structure.
        
The MSDN magazine articles Protecting Your Code with Visual C++ Defenses and Visual C++ Support for Stack-Based Buffer Protection provide additional information. strict_gs_check pragma Security Checks at Runtime and Compile Time Compiler Security Checks In Depth
CAVEAT: when an exception is thrown in a function and not handled in place, but by one of the calling functions, i.e. when the function’s epilog is not executed, an overwritten stack cookie is not detected!
__tls_used to locate the
            IMAGE_TLS_DIRECTORY on the i386 platform
            and _tls_used on other platforms, the linker locates
            the
            IMAGE_LOAD_CONFIG_DIRECTORY
            structure via the public symbol
            __load_config_used
            on the i386 platform and _load_config_used
            on other platforms.
        // Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#ifndef LOAD_LIBRARY_SEARCH_SYSTEM32
#define LOAD_LIBRARY_SEARCH_SYSTEM32	0x00000800UL
#endif
#ifndef _WIN64
#if 0
	DWORD	__security_cookie = 0xBB40E64EUL;
		               // = 3141592654 = 10**9 * pi
#else
const	DWORD	__security_cookie = 2654435769UL;
		               // = 0x9E3779B9UL
		               // = 2**32 / phi
#endif
extern	LPVOID	__safe_se_handler_table[];
extern	BYTE	__safe_se_handler_count;
const	struct	_IMAGE_LOAD_CONFIG_DIRECTORY_32
{
	DWORD	Size;
	DWORD	TimeDateStamp;
	WORD	MajorVersion;
	WORD	MinorVersion;
	DWORD	GlobalFlagsClear;
	DWORD	GlobalFlagsSet;
	DWORD	CriticalSectionDefaultTimeout;
	DWORD	DeCommitFreeBlockThreshold;
	DWORD	DeCommitTotalFreeThreshold;
	DWORD	LockPrefixTable;
	DWORD	MaximumAllocationSize;
	DWORD	VirtualMemoryThreshold;
	DWORD	ProcessHeapFlags;
	DWORD	ProcessAffinityMask;
	WORD	CSDVersion;
#if LCU > 2				// Redstone 1 (1607)
	WORD	DependentLoadFlags;
#else
	WORD	Reserved1;
#endif
	DWORD	EditList;
	DWORD	SecurityCookie;
	DWORD	SEHandlerTable;
	DWORD	SEHandlerCount;
#if LCU > 0				// Threshold 1 (1507)
	DWORD	GuardCFCheckFunctionPointer;
	DWORD	GuardCFDispatchFunctionPointer;
	DWORD	GuardCFFunctionTable;
	DWORD	GuardCFFunctionCount;
	DWORD	GuardFlags;
#if LCU > 1				// Threshold 2 (1511)
	struct	// _IMAGE_LOAD_CONFIG_CODE_INTEGRITY
	{
		WORD	Flags;
		WORD	Catalog;
		DWORD	CatalogOffset;
		DWORD	Reserved;
	} CodeIntegrity;
#if LCU > 2				// Redstone 1 (1607)
	DWORD	GuardAddressTakenIatEntryTable;
	DWORD	GuardAddressTakenIatEntryCount;
	DWORD	GuardLongJumpTargetTable;
	DWORD	GuardLongJumpTargetCount;
	DWORD	DynamicValueRelocTable;
	DWORD	CHPEMetadataPointer;
#if LCU > 3				// Redstone 2 (1703)
	DWORD	GuardRFFailureRoutine;
	DWORD	GuardRFFailureRoutineFunctionPointer;
	DWORD	DynamicValueRelocTableOffset;
	WORD	DynamicValueRelocTableSection;
	WORD	Reserved2;
	DWORD	GuardRFVerifyStackPointerFunctionPointer;
	DWORD	HotPatchTableOffset;
#if LCU > 4				// Redstone 3 (1709)
	DWORD	Reserved3;
	DWORD	EnclaveConfigurationPointer;
#if LCU > 5				// Redstone 4 (1803)
	DWORD	VolatileMetadataPointer;
#if LCU > 6				// Redstone 5 (1809)
	DWORD	GuardEHContinuationTable;
	DWORD	GuardEHContinuationCount;
					// Titanium (1903)
					// Vanadium (1909)
					// Vibranium 1 (2004)
					// Vibranium 2 (20H2)
#if LCU > 7				// Vibranium 3 (21H1)
	DWORD	GuardXFGCheckFunctionPointer;
	DWORD	GuardXFGDispatchFunctionPointer;
	DWORD	GuardXFGTableDispatchFunctionPointer;
#if LCU > 8				// Vibranium 4 (21H2)
	DWORD	CastGuardOsDeterminedFailureMode;
#if LCU > 9				// Vibranium 5 (22H2)
	DWORD	GuardMemcpyFunctionPointer;
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
} _load_config_used = {sizeof(_load_config_used),
                       'DATE',		// 0x44415445 = 2006-04-15 20:15:01 UTC
                       _MSC_VER / 100,
                       _MSC_VER % 100,
                       0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL, 0UL,
                       0U,
                       LOAD_LIBRARY_SEARCH_SYSTEM32,
                       0UL,
                       &__security_cookie,
                       __safe_se_handler_table,
                       &__safe_se_handler_count,
                       0UL, 0UL, 0UL, 0UL, 0UL};
#else // _WIN64
#if 0
	DWORD64	__security_cookie = 0x00002B992DDFA232ULL;
		               // = 3141592653589793238 >> 16
		               // = 10**18 / 2**16 * pi
#else
const	DWORD64	__security_cookie = 173961102589770ULL;
		               // = 0x00009E3779B97F4AULL
		               // = 2**48 / phi
#endif
const	struct	_IMAGE_LOAD_CONFIG_DIRECTORY_64
{
	DWORD	Size;
	DWORD	TimeDateStamp;
	WORD	MajorVersion;
	WORD	MinorVersion;
	DWORD	GlobalFlagsClear;
	DWORD	GlobalFlagsSet;
	DWORD	CriticalSectionDefaultTimeout;
	DWORD64	DeCommitFreeBlockThreshold;
	DWORD64	DeCommitTotalFreeThreshold;
	DWORD64	LockPrefixTable;
	DWORD64	MaximumAllocationSize;
	DWORD64	VirtualMemoryThreshold;
	DWORD64	ProcessAffinityMask;
	DWORD	ProcessHeapFlags;
	WORD	CSDVersion;
#if LCU > 2				// Redstone 1 (1607)
	WORD	DependentLoadFlags;
#else
	WORD	Reserved1;
#endif
	DWORD64	EditList;
	DWORD64	SecurityCookie;
	DWORD64	SEHandlerTable;
	DWORD64	SEHandlerCount;
#if LCU > 0				// Threshold 1 (1507)
	DWORD64	GuardCFCheckFunctionPointer;
	DWORD64	GuardCFDispatchFunctionPointer;
	DWORD64	GuardCFFunctionTable;
	DWORD64	GuardCFFunctionCount;
	DWORD	GuardFlags;
#if LCU > 1				// Threshold 2 (1511)
	struct	// _IMAGE_LOAD_CONFIG_CODE_INTEGRITY
	{
		WORD	Flags;
		WORD	Catalog;
		DWORD	CatalogOffset;
		DWORD	Reserved;
	} CodeIntegrity;
#if LCU > 2				// Redstone 1 (1607)
	DWORD64	GuardAddressTakenIatEntryTable;
	DWORD64	GuardAddressTakenIatEntryCount;
	DWORD64	GuardLongJumpTargetTable;
	DWORD64	GuardLongJumpTargetCount;
	DWORD64	DynamicValueRelocTable;
	DWORD64	CHPEMetadataPointer;
#if LCU > 3				// Redstone 2 (1703)
	DWORD64	GuardRFFailureRoutine;
	DWORD64	GuardRFFailureRoutineFunctionPointer;
	DWORD	DynamicValueRelocTableOffset;
	WORD	DynamicValueRelocTableSection;
	WORD	Reserved2;
	DWORD64	GuardRFVerifyStackPointerFunctionPointer;
	DWORD	HotPatchTableOffset;
#if LCU > 4				// Redstone 3 (1709)
	DWORD	Reserved3;
	DWORD64	EnclaveConfigurationPointer;
#if LCU > 5				// Redstone 4 (1803)
	DWORD64	VolatileMetadataPointer;
#if LCU > 6				// Redstone 5 (1809)
	DWORD64	GuardEHContinuationTable;
	DWORD64	GuardEHContinuationCount;
					// Titanium (1903)
					// Vanadium (1909)
					// Vibranium 1 (2004)
					// Vibranium 2 (20H2)
#if LCU > 7				// Vibranium 3 (21H1)
	DWORD64	GuardXFGCheckFunctionPointer;
	DWORD64	GuardXFGDispatchFunctionPointer;
	DWORD64	GuardXFGTableDispatchFunctionPointer;
#if LCU > 8				// Vibranium 4 (21H2)
	DWORD64	CastGuardOsDeterminedFailureMode;
#if LCU > 9				// Vibranium 5 (22H2)
	DWORD64	GuardMemcpyFunctionPointer;
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
#endif
} _load_config_used = {sizeof(_load_config_used),
                       'TIME',		// 0x54494D45 = 2014-10-23 18:47:33 UTC
                       _MSC_VER / 100,
                       _MSC_VER % 100,
                       0UL, 0UL, 0UL,
                       0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
                       0UL,
                       0U,
                       LOAD_LIBRARY_SEARCH_SYSTEM32,
                       0ULL,
                       &__security_cookie,
                       0ULL, 0ULL, 0ULL, 0ULL, 0ULL, 0ULL,
                       0UL};
#endif // _WIN64
__declspec(noreturn)
#ifdef _WIN64
VOID	__security_check_cookie(DWORD64 qwCookie)
{
	if (qwCookie == __security_cookie)
		return;
#else // _WIN64
VOID	__security_check_cookie(DWORD dwCookie)
{
	if (dwCookie == __security_cookie)
		return;
#endif // _WIN64
#ifdef FAST_FAIL_STACK_COOKIE_CHECK_FAILURE
	__fastfail(FAST_FAIL_STACK_COOKIE_CHECK_FAILURE);
#else
#ifdef FAIL_FAST_GENERATE_EXCEPTION_ADDRESS
	RaiseFailFastException((EXCEPTION_RECORD *) NULL, (CONTEXT *) NULL, FAIL_FAST_GENERATE_EXCEPTION_ADDRESS);
#else
	SetUnhandledExceptionFilter(NULL);
	RaiseException(EXCEPTION_STACK_BUFFER_OVERRUN, EXCEPTION_NONCONTINUABLE, 1UL, _AddressOfReturnAddress());
#endif
#pragma comment(lib, "kernel32.lib")
#endif
}
#pragma comment(user, "(C)opyright 2004-2025, Stefan Kanthak")__fastfail()
            intrinsic function is supported since Windows 8, the
            RaiseFailFastException()
            function is supported since Windows 7.
         Note: see the
            MSDN articles
            LoadLibraryEx()
            function or
            SetDefaultDllDirectories()
            function for the values of the DependentLoadFlags
            member, the articles
            HeapCreate()
            function,
            HeapAlloc()
            function or
            HeapReAlloc()
            function for the values of the ProcessHeapFlags member,
            and the article
            Gflags Flag Reference
            for the values of the GlobalFlagsClear as well as the
            GlobalFlagsSet member.
            Managing Heap Memory
            Global Flag Reference
        
 Save the
            ANSI C
            source presented above as lcu.c in the directory where
            you created the object library i386.lib before, then
            execute the following 3 command lines to compile it, write the
            assembly to the text file tls.cod and add the generated
            object file i386-lcu.obj to the existing object library
            i386.lib:
        
SET CL=/c /DLCU /FAsc /GAFry /Oxy /W4 /Zl CL.EXE /Foi386-lcu.obj lcu.c LINK.EXE /LIB /OUT:i386.lib i386.lib i386-lcu.objFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. lcu.c lcu.c(117) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'const DWORD *' lcu.c(118) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'LPVOID *' lcu.c(119) : warning C4047: 'initializing' : 'DWORD' differs in levels of indirection from 'BYTE *' Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.CAVEAT: verify in the assembly written to the text file
lcu.cod that the
            __security_check_cookie() function clobbers at most
            register ECX upon return to the caller when the stack
            cookie is intact!
            __cdecl
            __fastcall
         Move the
            ANSI C
            source file lcu.c into the directory where you created
            the object library amd64.lib before, then execute the
            following 3 command lines to compile it, write the assembly to the
            text file lcu.cod and add the generated object
            file amd64-lcu.obj to the object library
            amd64.lib:
        
SET CL=/c /DLCU /FAsc /GAFy /Oxy /W4 /Zl CL.EXE /Foamd64-lcu.obj lcu.c LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-lcu.obj
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. lcu.c lcu.c(227) : warning C4047: 'initializing' : 'DWORD64' differs in levels of indirection from 'const DWORD64 *' Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.CAVEAT: verify in the assembly written to the text file
lcu.cod that the
            __security_check_cookie() function clobbers no
            register except RCX, R8, R9,
            R10 and R11 upon return to the caller when
            the stack cookie is intact!
        ; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.686
	.model	flat; C
	extern	___safe_se_handler_count :abs
	extern	___safe_se_handler_table :ptr proc
_lcu_32	struct	4
	dword	sizeof _lcu_32
	dword	'VOID'		; 2006-04-21 21:32:06 UTC
	word	@Version / 100
	word	@Version mod 100
	dword	10 dup (0)
	word	0, 2048		; LOAD_LIBRARY_SEARCH_SYSTEM32
	dword	0
	dword	offset ___security_cookie
	dword	offset ___safe_se_handler_table
	dword	offset ___safe_se_handler_count
	dword	5 dup (0)
_lcu_32	ends
	.const
	public	__load_config_used
__load_config_used \
	_lcu_32	<>		; IMAGE_LOAD_CONFIG_DIRECTORY32
	.data
	public	___security_cookie
___security_cookie \
	dword	3141592654
	.code
@__security_check_cookie@4 \
	proc	public		; void __fastcall __security_check_cookie(dword cookie)
	cmp	ecx, ___security_cookie
	jne	short fastfail
	ret
fastfail:
	mov	ecx, 2		; ecx = FAST_FAIL_STACK_COOKIE_CHECK_FAILURE
	int	41
	ud2
@__security_check_cookie@4 \
	endp
___security_init_cookie \
	proc	public		; void __cdecl __security_init_cookie(void)
	mov	eax, ___security_cookie
	cmp	eax, 3141592654
	je	short init
	test	eax, eax
	jne	short exit
init:
	rdtsc			; eax = low dword of time stamp counter,
				; edx = high dword of time stamp counter
	xor	eax, edx	; eax = random number
	mov	___security_cookie, eax
exit:
	ret
___security_init_cookie	\
	endp
	endi386-lcu.asm in the directory where you created the
            object library i386.lib before, then execute the
            following 3 command lines to generate the object file
            i386-lcu.obj and add it to the existing object library
            i386.lib:
        SET ML=/c /safeseh /W3 /X ML.EXE i386-lcu.asm LINK.EXE /LIB /OUT:i386.lib i386.lib i386-lcu.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-lcu.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
_lcu_64	struct	8
	dword	sizeof _lcu_64
	dword	'VOID'		; 2006-04-21 21:32:06 UTC
	word	@Version / 100
	word	@Version mod 100
	dword	0, 0, 0
	qword	0, 0, 0, 0, 0, 0
	dword	0
	word	0, 2048		; LOAD_LIBRARY_SEARCH_SYSTEM32
	qword	0
	qword	offset __security_cookie
	qword	0, 0, 0, 0, 0, 0
	dword	0
_lcu_64	ends
	.const
	public	_load_config_used
_load_config_used \
	_lcu_64	<>		; IMAGE_LOAD_CONFIG_DIRECTORY64
	.data
	public	__security_cookie
__security_cookie \
	qword	3141592653589793238 shr 16
	.code
__security_check_cookie \
	proc	public		; void __security_check_cookie(qword cookie)
	cmp	rcx, __security_cookie
	jne	short fastfail
;;	shr	rcx, 48
;;	jnz	short fastfail
	ret
fastfail:
	mov	ecx, 2		; rcx = FAST_FAIL_STACK_COOKIE_CHECK_FAILURE
	int	41
	ud2
__security_check_cookie \
	endp
__security_init_cookie \
	proc	public		; void __security_init_cookie(void)
	mov	rax, __security_cookie
	mov	rcx, 3141592653589793238 shr 16
	cmp	rcx, rax
	je	short init
	test	rax, rax
	jne	short exit
init:
	rdtsc			; rax = low dword of time stamp counter,
				; rdx = high dword of time stamp counter
	mov	ecx, edx	; rcx = high dword of time stamp counter
	bswap	edx
	imul	rcx, rax	; rcx = high dword of time stamp counter
				;     * low dword of time stamp counter
	bswap	rax
	xor	rax, rdx	; rax = byte-swapped time stamp counter
	mul	rcx
	xor	rax, rdx	; rax = random number
	shr	rax, 16
	mov	__security_cookie, rax
exit:
	ret
__security_init_cookie \
	endp
__GSHandlerCheck \
	proc	private		; int __GSHandlerCheck(void *, void *, void *, void *)
	xor	eax, eax
	inc	eax		; rax = ExceptionContinueSearch
	ret
__GSHandlerCheck \
	endp
	endamd64-lcu.asm in the directory where you created the
            object library amd64.lib before, then execute the
            following 3 command lines to generate the object file
            amd64-lcu.obj and add it to the existing object library
            amd64.lib:
        SET ML=/c /W3 /X ML64.EXE amd64-lcu.asm LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-lcu.objFor details and reference see the MSDN articles ML and ML64 Command-Line Reference and Running LIB.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-lcu.asm Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
/DELAYLOAD
            Linker Support for Delay-Loaded DLLs
            Specifying DLLs to Delay Load
            Constraints of Delay Loading DLLs
            Binding Imports
            Explicitly Unloading a Delay-Loaded DLL
            Understanding the Helper Function
            Error Handling and Notification
            Exceptions
            Failure Hooks
            Notification Hooks
            Structure and Constant Definitions
            Developing Your Own Helper Function
            Calling Conventions, Parameters, and Return Type
            Calculating Necessary Values
            Unloading a Delay-Loaded DLL
        // Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#pragma comment(lib, "kernel32.lib")
#pragma comment(user, "(C)opyright 2004-2025, Stefan Kanthak")
#ifndef EXCEPTION_DELAY_LOAD_INVALID_PARAMETER
#define EXCEPTION_DELAY_LOAD_INVALID_PARAMETER	0xC06D0057
#endif
#ifndef EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND	0xC06D007E
#endif
#ifndef EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND	0xC06D007F
#endif
extern	const	IMAGE_DOS_HEADER	__ImageBase;
typedef	DWORD	RVA;
typedef	enum	dliNotify
{
	dliStartProcessing,
	dliNotePreLoadLibrary,
	dliNotePreGetProcAddress,
	dliFailLoadLib,
	dliFailGetProc,
	dliNoteEndProcessing
} dliNotify;
typedef	struct	DelayLoadDescr
{
	DWORD	dwAttributes;		// 1UL = all members are RVAs
	DWORD	dwDllName;
	DWORD	dwHMODULE;		// RVA of module handle
	DWORD	dwIAT;			// RVA of import address table
	DWORD	dwINT;			// RVA of import name table
	DWORD	dwBoundIAT;		// RVA of optional bound import address table
	DWORD	dwUnloadIAT;		// RVA of optional copy of original import address table
	DWORD	dwTimeStamp;
} DelayLoadDescr;
typedef	struct	DelayLoadProc
{
	BOOL	fImportByName;
	union
	{
		LPCSTR	szProcName;
		DWORD	dwOrdinal;
	};
} DelayLoadProc;
typedef	struct	DelayLoadInfo
{
	DWORD		cb;		// size of structure
	DelayLoadDescr	*pidd;		// raw form of data (everything is there)
	FARPROC		*ppfn;		// points to address of function to load
	LPCSTR		szDll;		// name of DLL
	DelayLoadProc	dlp;		// name or ordinal of function to load
	HMODULE		hmodCur;	// handle of DLL
	FARPROC		pfnCur;		// actual function that will be called
	DWORD		dwLastError;	// error received (if an error notification)
} DelayLoadInfo;
typedef	FARPROC	(WINAPI *PfnDliHook) (dliNotify, DelayLoadInfo *);
BOOL	WINAPI	__FUnloadDelayLoadedDLL2(LPCSTR szDll);
FARPROC	WINAPI	__delayLoadHelper2(DelayLoadDescr *lpDLD, FARPROC *lpfnIATEntry)
{
	HMODULE	hModule;
	HMODULE			*lpHMODULE = (HMODULE *) ((LPBYTE) &__ImageBase + lpDLD->dwHMODULE);
	IMAGE_THUNK_DATA	*lpINT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwINT);
	IMAGE_THUNK_DATA	*lpIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwIAT);
	IMAGE_THUNK_DATA	*lpBoundIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwBoundIAT);
	DWORD			dwEntry = (IMAGE_THUNK_DATA *) lpfnIATEntry - lpIAT;
	// NOTE: *lpfnIATEntry == lpIAT[dwEntry].u1.Function
	DelayLoadInfo	dli = {sizeof(DelayLoadInfo),
			       lpDLD,
			       lpfnIATEntry,
			       (LPCSTR) &__ImageBase + lpDLD->dwDllName,
			       {!IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal),
			         IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal)
			       ? IMAGE_ORDINAL(lpINT[dwEntry].u1.Ordinal)
			       : ((IMAGE_IMPORT_BY_NAME *) ((LPBYTE) &__ImageBase + lpINT[dwEntry].u1.AddressOfData))->Name},
			       *lpHMODULE,
			       (FARPROC) NULL,
			       ERROR_SUCCESS};
	if (lpDLD->dwAttributes != 0UL)
	{
		dli.dwLastError = ERROR_INVALID_PARAMETER;
		RaiseException(EXCEPTION_DELAY_LOAD_INVALID_PARAMETER,
		               EXCEPTION_NONCONTINUABLE,
		               1UL,
		               (DWORD_PTR *) &dli);
		return (FARPROC) NULL;
	}
	if (dli.hmodCur == NULL)	// module not yet loaded?
	{
#ifndef LOAD_LIBRARY_SEARCH_SYSTEM32
		dli.hmodCur = LoadLibraryA(dli.szDll);
#else
		dli.hmodCur = LoadLibraryExA(dli.szDll, NULL, LOAD_LIBRARY_SEARCH_SYSTEM32);
#endif
		if (dli.hmodCur == NULL)
		{
			dli.dwLastError = GetLastError();
			RaiseException(EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND,
			               EXCEPTION_NONCONTINUABLE,
			               1UL,
			               (DWORD_PTR *) &dli);
			return (FARPROC) NULL;
		}
#ifndef _WIN64
		hModule = (HMODULE) InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#else
		hModule = (HMODULE) _InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#endif
		if (hModule == dli.hmodCur)
			FreeLibrary(dli.hmodCur);
		else
			if (lpDLD->dwUnloadIAT != 0UL)
			{
				// ...
			}
	}
	if ((lpBoundIAT != NULL) && (lpDLD->dwTimeStamp != 0UL))
	{
		IMAGE_NT_HEADERS	*lpModule = (IMAGE_NT_HEADERS *) ((LPBYTE) dli.hmodCur + ((IMAGE_DOS_HEADER *) dli.hmodCur)->e_lfanew);
		if ((lpModule->Signature == IMAGE_NT_SIGNATURE)
		 && (lpModule->FileHeader.TimeDateStamp == lpDLD->dwTimeStamp)
		 && (lpModule->OptionalHeader.ImageBase == dli.hmodCur))
		{
			dli.pfnCur = (FARPROC) lpBoundIAT[dwEntry].u1.Function;
			if (dli.pfnCur != NULL)
				return *lpfnIATEntry = dli.pfnCur;
		}
	}
	dli.pfnCur = GetProcAddress(dli.hmodCur, dli.dlp.szProcName);
	if (dli.pfnCur != NULL)		// function address resolved?
		return *lpfnIATEntry = dli.pfnCur;
	dli.dwLastError = GetLastError();
	RaiseException(EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND,
	               EXCEPTION_NONCONTINUABLE,
	               1UL,
	               (DWORD_PTR *) &dli);
	return (FARPROC) NULL;
}; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.model	flat, C
	extern	__pfnDliDefaultHook2 :ptr proc
	extern	__pfnDliFailureHook2 (__pfnDliDefaultHook2) :ptr proc
	extern	__pfnDliNotifyHook2 (__pfnDliDefaultHook2) :ptr proc
	end; Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	extern	__pfnDliDefaultHook2 :ptr proc
	extern	__pfnDliFailureHook2 (__pfnDliDefaultHook2) :ptr proc
	extern	__pfnDliNotifyHook2 (__pfnDliDefaultHook2) :ptr proc
	end// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#pragma comment(lib, "kernel32.lib")
#ifdef _WIN64
#pragma comment(linker, "/ALTERNATENAME:__pfnDliFailureHook2=__pfnDliDefaultHook2")
#pragma comment(linker, "/ALTERNATENAME:__pfnDliNotifyHook2=__pfnDliDefaultHook2")
#else
#pragma comment(linker, "/ALTERNATENAME:___pfnDliFailureHook2=___pfnDliDefaultHook2")
#pragma comment(linker, "/ALTERNATENAME:___pfnDliNotifyHook2=___pfnDliDefaultHook2")
#endif
#pragma comment(user, "(C)opyright 2004-2025, Stefan Kanthak")
#ifndef EXCEPTION_DELAY_LOAD_INVALID_PARAMETER
#define EXCEPTION_DELAY_LOAD_INVALID_PARAMETER	0xC06D0057
#endif
#ifndef EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND	0xC06D007E
#endif
#ifndef EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND
#define EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND	0xC06D007F
#endif
extern	const	IMAGE_DOS_HEADER	__ImageBase;
typedef	DWORD	RVA;
typedef	enum	dliNotify
{
	dliStartProcessing,
	dliNotePreLoadLibrary,
	dliNotePreGetProcAddress,
	dliFailLoadLib,
	dliFailGetProc,
	dliNoteEndProcessing
} dliNotify;
typedef	struct	DelayLoadDescr
{
	DWORD	dwAttributes;		// 1UL = all members are RVAs
	DWORD	dwDllName;
	DWORD	dwHMODULE;		// RVA of module handle
	DWORD	dwIAT;			// RVA of import address table
	DWORD	dwINT;			// RVA of import name table
	DWORD	dwBoundIAT;		// RVA of optional bound import address table
	DWORD	dwUnloadIAT;		// RVA of optional copy of original import address table
	DWORD	dwTimeStamp;
} DelayLoadDescr;
typedef	struct	DelayLoadProc
{
	BOOL	fImportByName;
	union
	{
		LPCSTR	szProcName;
		DWORD	dwOrdinal;
	};
} DelayLoadProc;
typedef	struct	DelayLoadInfo
{
	DWORD		cb;		// size of structure
	DelayLoadDescr	*pidd;		// raw form of data (everything is there)
	FARPROC		*ppfn;		// points to address of function to load
	LPCSTR		szDll;		// name of DLL
	DelayLoadProc	dlp;		// name or ordinal of function to load
	HMODULE		hmodCur;	// handle of DLL
	FARPROC		pfnCur;		// actual function that will be called
	DWORD		dwLastError;	// error received (if an error notification)
} DelayLoadInfo;
typedef	FARPROC	(WINAPI *PfnDliHook) (dliNotify, DelayLoadInfo *);
extern	PfnDliHook	__pfnDliNotifyHook2;
extern	PfnDliHook	__pfnDliFailureHook2;
const	PfnDliHook	__pfnDliDefaultHook2 = NULL;
BOOL	WINAPI	__FUnloadDelayLoadedDLL2(LPCSTR szDll);
FARPROC	WINAPI	__delayLoadHelper2(DelayLoadDescr *lpDLD, FARPROC *lpfnIATEntry)
{
	HMODULE	hModule;
	HMODULE			*lpHMODULE = (HMODULE *) ((LPBYTE) &__ImageBase + lpDLD->dwHMODULE);
	IMAGE_THUNK_DATA	*lpINT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwINT);
	IMAGE_THUNK_DATA	*lpIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwIAT);
	IMAGE_THUNK_DATA	*lpBoundIAT = (IMAGE_THUNK_DATA *) ((LPBYTE) &__ImageBase + lpDLD->dwBoundIAT);
	DWORD			dwEntry = (IMAGE_THUNK_DATA *) lpfnIATEntry - lpIAT;
	// NOTE: *lpfnIATEntry == lpIAT[dwEntry].u1.Function
	DelayLoadInfo	dli = {sizeof(DelayLoadInfo),
			       lpDLD,
			       lpfnIATEntry,
			       (LPCSTR) &__ImageBase + lpDLD->dwDllName,
			       {!IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal),
			         IMAGE_SNAP_BY_ORDINAL(lpINT[dwEntry].u1.Ordinal)
			       ? IMAGE_ORDINAL(lpINT[dwEntry].u1.Ordinal)
			       : ((IMAGE_IMPORT_BY_NAME *) ((LPBYTE) &__ImageBase + lpINT[dwEntry].u1.AddressOfData))->Name},
			       *lpHMODULE,
			       (FARPROC) NULL,
			       ERROR_SUCCESS};
	if (__pfnDliNotifyHook2 != NULL)
	{
		dli.pfnCur = (*__pfnDliNotifyHook2)(dliStartProcessing, &dli);
		if (dli.pfnCur != NULL)
			goto SUCCESS;
	}
	if (lpDLD->dwAttributes != 0UL)
	{
		dli.dwLastError = ERROR_INVALID_PARAMETER;
		RaiseException(EXCEPTION_DELAY_LOAD_INVALID_PARAMETER,
		               EXCEPTION_NONCONTINUABLE,
		               1UL,
		               (DWORD_PTR *) &dli);
		goto FAILURE;
	}
	if (dli.hmodCur != NULL)	// module already loaded?
		goto ADDRESS;
	if (__pfnDliNotifyHook2 != NULL)
		dli.hmodCur = (HMODULE) (*__pfnDliNotifyHook2)(dliNotePreLoadLibrary, &dli);
	if (dli.hmodCur != NULL)	// module handle resolved by notification routine?
		goto ADDRESS;
#ifndef LOAD_LIBRARY_SEARCH_SYSTEM32
	dli.hmodCur = LoadLibraryA(dli.szDll);
#else
	dli.hmodCur = LoadLibraryExA(dli.szDll, NULL, LOAD_LIBRARY_SEARCH_SYSTEM32);
#endif
	if (dli.hmodCur == NULL)	// module not loaded?
	{
		dli.dwLastError = GetLastError();
		if (__pfnDliFailureHook2 != NULL)
			dli.hmodCur = (HMODULE) (*__pfnDliFailureHook2)(dliFailLoadLib, &dli);
		if (dli.hmodCur == NULL)
		{
			RaiseException(EXCEPTION_DELAY_LOAD_MODULE_NOT_FOUND,
			               EXCEPTION_NONCONTINUABLE,
			               1UL,
			               (DWORD_PTR *) &dli);
			goto FAILURE;
		}
#ifndef _WIN64
		hModule = (HMODULE) InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#else
		hModule = (HMODULE) _InterlockedExchangePointer((LPVOID *) lpHMODULE, dli.hmodCur);
#endif
		if (hModule == dli.hmodCur)
			FreeLibrary(dli.hmodCur);
		else
			if (lpDLD->dwUnloadIAT != 0UL)
			{
				// ...
			}
	}
ADDRESS:
	if (__pfnDliNotifyHook2 != NULL)
		dli.pfnCur = (*__pfnDliNotifyHook2)(dliNotePreGetProcAddress, &dli);
	if (dli.pfnCur != NULL)		// function address resolved by notification routine?
		goto SUCCESS;
	if ((lpBoundIAT != NULL) && (lpDLD->dwTimeStamp != 0UL))
	{
		IMAGE_NT_HEADERS	*lpModule = (IMAGE_NT_HEADERS *) ((LPBYTE) dli.hmodCur + ((IMAGE_DOS_HEADER *) dli.hmodCur)->e_lfanew);
		if ((lpModule->Signature == IMAGE_NT_SIGNATURE)
		 && (lpModule->FileHeader.TimeDateStamp == lpDLD->dwTimeStamp)
		 && (lpModule->OptionalHeader.ImageBase == dli.hmodCur))
		{
			dli.pfnCur = (FARPROC) lpBoundIAT[dwEntry].u1.Function;
			if (dli.pfnCur != NULL)
				goto SUCCESS;
		}
	}
	dli.pfnCur = GetProcAddress(dli.hmodCur, dli.dlp.szProcName);
	if (dli.pfnCur != NULL)		// function address resolved?
		goto SUCCESS;
	dli.dwLastError = GetLastError();
	if (__pfnDliFailureHook2 != NULL)
		dli.pfnCur = (*__pfnDliFailureHook2)(dliFailGetProc, &dli);
	if (dli.pfnCur != NULL)		// function address resolved by failure routine?
		goto SUCCESS;
	RaiseException(EXCEPTION_DELAY_LOAD_ENTRY_NOT_FOUND,
	               EXCEPTION_NONCONTINUABLE,
	               1UL,
	               (DWORD_PTR *) &dli);
	goto FAILURE;
SUCCESS:
	*lpfnIATEntry = dli.pfnCur;
FAILURE:
	if (__pfnDliNotifyHook2 != NULL)
		(*__pfnDliNotifyHook2)(dliNoteEndProcessing, &dli);
	return dli.pfnCur;
}dli.c in the directory where
            you created the object library i386.lib before, then
            execute the following 3 command lines to compile it and add the
            generated object file i386-dli.obj to the existing
            object library i386.lib:
        SET CL=/c /GAFyz /Oxy /W4 /Zl CL.EXE /Foi386-dli.obj dli.c LINK.EXE /LIB /OUT:i386.lib i386.lib i386-dli.objFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. dli.c dli.c(65) : warning C4201: nonstandard extension used : nameless struct/union dli.c(100) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(101) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(102) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(103) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(106) : warning C4047: ':' : 'DWORD' differs in levels of indirection from 'BYTE *' dli.c(106) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(106) : warning C4057: 'initializing' : 'LPCSTR' differs in indirection to slightly different base types from 'BYTE *' dli.c(107) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(135) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE' dli.c(140) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE' dli.c(186) : warning C4047: '==' : 'DWORD' differs in levels of indirection from 'HMODULE' Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.__stdcall Move the ANSI C source file
dli.c into the directory where you created
            the object library amd64.lib before, then execute the
            following 3 command lines to compile it and add the generated object
            file amd64-dli.obj to the object library
            amd64.lib:
        SET CL=/c /GAFy /Oxy /W4 /Zl CL.EXE /Foamd64-dli.obj dli.c LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-dli.obj
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. dli.c dli.c(65) : warning C4201: nonstandard extension used : nameless struct/union dli.c(95) : warning C4244: 'initializing' : conversion from '__int64' to 'DWORD', possible loss of data dli.c(100) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(101) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(102) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(103) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(106) : warning C4047: ':' : 'ULONGLONG' differs in levels of indirection from 'BYTE *' dli.c(106) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(106) : warning C4057: 'initializing' : 'LPCSTR' differs in indirection to slightly different base types from 'BYTE *' dli.c(107) : warning C4204: nonstandard extension used : non-constant aggregate initializer dli.c(135) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE' dli.c(149) : warning C4054: 'type cast' : from function pointer 'FARPROC' to data pointer 'HMODULE' dli.c(186) : warning C4047: '==' : 'ULONGLONG' differs in levels of indirection from 'HMODULE' Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
main() and wmain() SupportRemarks, the documentation of the linker option
/ENTRY:‹symbol›
            states:
        RemarksDynamic-Link Library Entry-Point Function DllMain Callback Function OUCH:The /ENTRY option specifies an entry point function as the starting address for an .exe file or DLL.
The function must be defined to use the
__stdcallcalling convention. The parameters and return value depend on if the program is a console application, a windows application or a DLL. It is recommended that you let the linker set the entry point so that the C run-time library is initialized correctly, and C++ constructors for static objects are executed.By default, the starting address is a function name from the C run-time library. The linker selects it according to the attributes of the program, as shown in the following table.
Function name Default for mainCRTStartup 
(or wmainCRTStartup)An application that uses /SUBSYSTEM:CONSOLE; calls main(orwmain)WinMainCRTStartup 
(or wWinMainCRTStartup)An application that uses /SUBSYSTEM:WINDOWS; calls WinMain(orwWinMain), which must be defined to use__stdcall_DllMainCRTStartup A DLL; calls DllMainif it exists, which must be defined to use__stdcallIf the /DLL or /SUBSYSTEM option is not specified, the linker selects a subsystem and entry point depending on whether
mainorWinMainis defined.The functions
main,WinMain, andDllMainare the three forms of the user-defined entry point.
mainCRTStartup(),
            mainCRTStartup(), WinMainCRTStartup() and
            wWinMainCRTStartup(), the 4 entry point functions for
            applications, use but the
            __cdecl
            calling and naming convention
            – they take the address of the
            Process Environment Block
            as argument and return a 32-bit integer as exit code of the
            thread
            respectively the
            process.
            Processes and Threads
        The MSDN article Format of a C Decorated Name specifies:
The MSDN articles __cdecl and __stdcall provide more details. __fastcall __vectorcall Calling ConventionsThe form of decoration for a C function depends on the calling convention used in its declaration, as shown below. Note that in a 64-bit environment, functions are not decorated.
Calling convention Decoration __cdecl (the default) Leading underscore (_) __stdcall Leading underscore (_) and a trailing at sign (@) followed by a number representing the number of bytes in the parameter list __fastcall Same as __stdcall, but prepended by an at sign instead of an underscore __vectorcall Two trailing at signs (@@) followed by the decimal number of bytes in the parameter list. 
// Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
typedef	unsigned short	wchar_t;
#ifdef CONSOLE
#ifdef UNICODE
int wmain(int argc, wchar_t *argv[], wchar_t *envp[])
#else
int main(int argc, char *argv[], char *envp[])
#endif
{
    return *envp != argv[argc];
}
#else // WINDOWS
#ifdef UNICODE
int wWinMain(void *current, void *previous, wchar_t *cmdline, int show)
#else
int WinMain(void *current, void *previous, char *cmdline, int show)
#endif
{
    return cmdline[current == previous] != show;
}
#endif // WINDOWSi386-sys.c in an arbitrary,
            preferable empty directory, then execute the following 5 command
            lines to compile and (attempt to) link it:
        SET CL=/W4 /X /Zl CL.EXE /DUNICODE /Gz i386-sys.c CL.EXE /Gz i386-sys.c CL.EXE /DCONSOLE /Gd i386-sys.c CL.EXE /DCONSOLE /DUNICODE /Gd i386-sys.cFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-sys.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /out:i386-sys.exe i386-sys.obj LINK : error LNK2001: unresolved external symbol _wWinMainCRTStartup i386-sys.exe : fatal error LNK1120: 1 unresolved externals Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-sys.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /out:i386-sys.exe i386-sys.obj LINK : error LNK2001: unresolved external symbol _WinMainCRTStartup i386-sys.exe : fatal error LNK1120: 1 unresolved externals Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-sys.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /out:i386-sys.exe i386-sys.obj LINK : error LNK2001: unresolved external symbol _mainCRTStartup i386-sys.exe : fatal error LNK1120: 1 unresolved externals Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. i386-sys.c Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /out:i386-sys.exe i386-sys.obj LINK : error LNK2001: unresolved external symbol _wmainCRTStartup i386-sys.exe : fatal error LNK1120: 1 unresolved externalsOUCH: the linker expects the 4 entry point functions for applications,
mainCRTStartup(),
            mainCRTStartup(), WinMainCRTStartup() and
            wWinMainCRTStartup(), to be defined using the
            __cdecl
            naming convention,
            i.e. without the
            decoration
            appended to the name of functions defined using the
            __stdcall
            naming convention!
        ; Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.386
	.model	flat, C
	.code
main	proc	public
	assume	fs :flat	; fs = address of TEB
	mov	eax, fs:[48]	; eax = address of PEB
	xor	eax, [esp+4]	; eax = 0
	ret
main	endp
	end	main		; writes "/ENTRY:main" to '.drectve' sectioni386-sys.asm in an arbitrary, preferable empty
            directory, then execute the following 5 command lines to build the
            console application i386-sys.exe, execute it and
            display its exit code:
        SET ML=/safeseh /W3 /X SET LINK= ML.EXE i386-sys.asm .\i386-sys.exe ECHO %ERRORLEVEL%
Microsoft (R) Macro Assembler Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: i386-sys.asm Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /OUT:i386-sys.exe i386-sys.obj 0
; Copyleft © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
	.code
wmain	proc	public
				; gs = address of TEB
				; rcx = address of PEB
	xor	eax, eax	; rax = 0
	cmp	rcx, gs:[96]
	sete	al		; rax = 1
	ret
wmain	endp
	endamd64-sys.asm in an arbitrary, preferable empty
            directory, then execute the following 5 command lines to build the
            console application amd64-sys.exe, execute it and
            display its exit code:
        SET ML=/W3 /X SET LINK=/ENTRY:wmain ML64.EXE amd64-sys.asm .\amd64-sys.exe ECHO %ERRORLEVEL%
Microsoft (R) Macro Assembler (x64) Version 10.00.40219.01 Copyright (C) Microsoft Corporation. All rights reserved. Assembling: amd64-sys.asm Microsoft (R) Incremental Linker Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved. /ENTRY:wmain /OUT:amd64-sys.exe amd64-sys.obj 1
main() and wmain() functions,
            their arguments, how to parse the command line returned from the
            GetCommandLine()
            function and how to split the environment block returned from the
            GetEnvironmentStrings()
            function to derive these arguments.
            Using wmain instead of main
            Argument Definitions
            main Function Restrictions
            Parsing C Command-Line Arguments
            WinMain function
         Note: the
            CommandLineToArgvW()
            function uses the same algorithm, but supports only
            UTF-16LE
            encoding.
        
 The following
            ANSI C
            program provides the glue
 between the
            mainCRTStartup() or wmainCRTStartup()
            entry point functions and the main() or
            wmain() functions:
        
// Copyright © 2004-2025, Stefan Kanthak <stefan.kanthak@nexgo.de>
#define STRICT
#define WIN32_LEAN_AND_MEAN
#include <windows.h>
#pragma comment(lib, "kernel32.lib")
#pragma comment(user, "(C)opyright 2004-2025, Stefan Kanthak")
extern	int	main(int argc, char const *argv[], char const *envp[]);
extern	int	wmain(int argc, wchar_t const *argv[], wchar_t const *envp[]);
__declspec(noreturn)
__declspec(safebuffers)
VOID	CDECL	mainCRTStartup(VOID)
{
	LPSTR	lpArgument;
	LPCSTR	lpCmdLine = GetCommandLineA();
	LPCSTR	lpBlock = GetEnvironmentStringsA();
	LPCSTR	lpCount = lpCmdLine;
	UINT	uiCount = 0U;
	UINT	uiQuote = 0U;
	UINT	argc;
	LPCSTR	*argv;
	LPCSTR	*envp;
	argc = (*lpCount != ' ' && *lpCount != '\t');
	while (*lpCount != '\0')	// count arguments
	{
		if (!uiQuote		// whitespace outside double quotes?
		 && (*lpCount == ' ' || *lpCount == '\t'))
		{
			do		// skip unquoted whitespace
				lpCount++;
			while (*lpCount == ' ' || *lpCount == '\t');
			argc += (*lpCount != '\0');
			uiCount = 0U;
			continue;
		}
		else if (*lpCount == '\\')
			uiCount ^= ~0U;
		else if (!uiCount	// unescaped double quote?
		      && *lpCount == '"')
			uiQuote ^= ~0U;
		else			// regular character
			uiCount = 0U;
		lpCount++;
	}
	if (uiQuote)			// unpaired double quote?
		SetLastError(ERROR_BAD_ARGUMENTS);
	argv = (LPCSTR *) _alloca((argc + 1U) * sizeof(*argv) + (lpCount + 1U - lpCmdLine) * sizeof(**argv));
	argv[0] = lpArgument = (LPSTR) (argv + argc + 1U);
	argc = uiCount = uiQuote = 0U;
	while (*lpCmdLine != '\0')	// process arguments
	{
		if (!uiQuote		// whitespace outside double quotes?
		 && (*lpCmdLine == ' ' || *lpCmdLine == '\t'))
		{			// terminate current argument
			*lpArgument = '\0';
			do		// skip unquoted whitespace
				lpCmdLine++;
			while (*lpCmdLine == ' ' || *lpCmdLine == '\t');
			if (*lpCmdLine != '\0')
					// store address of next argument
				argv[++argc] = lpArgument = (LPSTR) lpCmdLine;
			uiCount = 0U;
		}
		else if (*lpCmdLine == '\\')
		{
			*lpArgument++ = *lpCmdLine++;
			uiCount++;	// count backslash
		}
		else if (*lpCmdLine == '"')
		{
			lpArgument -= (uiCount + 1U) / 2U;
			if (uiCount & 1U)
					// double quote preceeded by odd number
					//  of backslashes: keep half of them
					//   and the (escaped) double quote
				*lpArgument++ = *lpCmdLine++;
			else		// double quote preceeded by even number
					//  of backslashes: keep half of them
				if (*++lpCmdLine == '"' && uiQuote)
					// double quote inside double quotes and
					//  followed by another double quote:
					//   keep one double quote
					*lpArgument++ = *lpCmdLine++;
				else	// skip double quote and toggle state
					uiQuote ^= ~0U;
			uiCount = 0U;
		}
		else			// regular character
		{
			*lpArgument++ = *lpCmdLine++;
			uiCount = 0U;
		}
	}
	*lpArgument = '\0';		// terminate (last) argument
	argv[++argc] = NULL;		// store terminating NULL pointer
	envp = argv + argc;
	if (lpBlock != NULL)
	{
		for (uiCount = 0U,	// count environment strings
		     lpCount = lpBlock; *lpCount != '\0'; lpCount += strlen(lpCount) + 1U)
			if (*lpCount != '=')
				uiCount++;
		if (uiCount > 0U)	// process environment strings
		{
			envp = (LPCSTR *) _alloca((uiCount + 1U) * sizeof(*envp));
			for (uiCount = 0U,
			     lpCount = lpBlock; *lpCount != '\0'; lpCount += strlen(lpCount) + 1U)
				if (*lpCount != '=')
					envp[uiCount++] = lpCount;
			envp[uiCount] = (LPCSTR) NULL;
		}
	}
	ExitProcess(main(argc, argv, envp));
}
__declspec(noreturn)
__declspec(safebuffers)
VOID	CDECL	wmainCRTStartup(VOID)
{
	LPWSTR	lpArgument;
	LPCWSTR	lpCmdLine = GetCommandLineW();
	LPCWSTR	lpBlock = GetEnvironmentStringsW();
	LPCWSTR	lpCount = lpCmdLine;
	UINT	uiCount = 0U;
	UINT	uiQuote = 0U;
	UINT	argc;
	LPCWSTR	*argv;
	LPCWSTR	*envp;
	argc = (*lpCount != L' ' && *lpCount != L'\t');
	while (*lpCount != L'\0')	// count arguments
	{
		if (!uiQuote		// whitespace outside double quotes?
		 && (*lpCount == L' ' || *lpCount == L'\t'))
		{
			do		// skip unquoted whitespace
				lpCount++;
			while (*lpCount == L' ' || *lpCount == L'\t');
			argc += (*lpCount != L'\0');
			uiCount = 0U;
			continue;
		}
		else if (*lpCount == L'\\')
			uiCount ^= ~0U;
		else if (!uiCount	// unescaped double quote?
		      && *lpCount == L'"')
			uiQuote ^= ~0U;
		else			// regular character
			uiCount = 0U;
		lpCount++;
	}
	if (uiQuote)			// unpaired double quote?
		SetLastError(ERROR_BAD_ARGUMENTS);
	argv = (LPCWSTR *) _alloca((argc + 1U) * sizeof(*argv) + (lpCount + 1U - lpCmdLine) * sizeof(**argv));
	argv[0] = lpArgument = (LPWSTR) (argv + argc + 1U);
	argc = uiCount = uiQuote = 0U;
	while (*lpCmdLine != L'\0')	// process arguments
	{
		if (!uiQuote		// whitespace outside double quotes?
		 && (*lpCmdLine == L' ' || *lpCmdLine == L'\t'))
		{			// terminate current argument
			*lpArgument = L'\0';
			do		// skip unquoted whitespace
				lpCmdLine++;
			while (*lpCmdLine == L' ' || *lpCmdLine == L'\t');
			if (*lpCmdLine != L'\0')
					// store address of next argument
				argv[++argc] = lpArgument = (LPWSTR) lpCmdLine;
			uiCount = 0U;
		}
		else if (*lpCmdLine == L'\\')
		{
			*lpArgument++ = *lpCmdLine++;
			uiCount++;	// count backslash
		}
		else if (*lpCmdLine == L'"')
		{
			lpArgument -= (uiCount + 1U) / 2U;
			if (uiCount & 1U)
					// double quote preceeded by odd number
					//  of backslashes: keep half of them
					//   and the (escaped) double quote
				*lpArgument++ = *lpCmdLine++;
			else		// double quote preceeded by even number
					//  of backslashes: keep half of them
				if (*++lpCmdLine == L'"' && uiQuote)
					// double quote inside double quotes and
					//  followed by another double quote:
					//   keep one double quote
					*lpArgument++ = *lpCmdLine++;
				else	// skip double quote and toggle state
					uiQuote ^= ~0U;
			uiCount = 0U;
		}
		else			// regular character
		{
			*lpArgument++ = *lpCmdLine++;
			uiCount = 0U;
		}
	}
	*lpArgument = L'\0';		// terminate (last) argument
	argv[++argc] = NULL;		// store terminating NULL pointer
	envp = argv + argc;
	if (lpBlock != NULL)
	{
		for (uiCount = 0U,	// count environment strings
		     lpCount = lpBlock; *lpCount != L'\0'; lpCount += wcslen(lpCount) + 1U)
			if (*lpCount != L'=')
				uiCount++;
		if (uiCount > 0U)	// process environment strings
		{
			envp = (LPCWSTR *) _alloca((uiCount + 1U) * sizeof(*envp));
			for (uiCount = 0U,
			     lpCount = lpBlock; *lpCount != L'\0'; lpCount += wcslen(lpCount) + 1U)
				if (*lpCount != L'=')
					envp[uiCount++] = lpCount;
			envp[uiCount] = (LPCWSTR) NULL;
		}
	}
	ExitProcess(wmain(argc, argv, envp));
}mainCRTStartup() function
            allocates up to 32768 bytes for the command line plus
            16384×4 bytes (32-bit platforms) or 16384×8 bytes
            (64-bit platforms) for the argv[] array on the stack,
            i.e. at most 96 kiB on 32-bit platforms and 160 kiB on
            64-bit platforms; the wmainCRTStartup() function
            allocates up to 65536 bytes for the command line plus 16384×4
            bytes (32-bit platforms) or 16384×8 bytes (64-bit platforms)
            for the argv[] array on the stack, i.e. at most
            128 kiB on 32-bit platforms and 192 kiB on 64-bit
            platforms.
         Save the
            ANSI C
            source presented above as sys.c in the directory where
            you created the object library i386.lib before, then
            execute the following 3 command lines to compile it and add the
            generated object file i386-sys.obj to the existing
            object library i386.lib:
        
SET CL=/c /GAFdy /J /Oxy /W4 /Zl CL.EXE /Foi386-sys.obj sys.c LINK.EXE /LIB /OUT:i386.lib i386.lib i386-sys.objFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86 Copyright (C) Microsoft Corporation. All rights reserved. sys.c Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.Move the ANSI C source file
sys.c into the directory where you created
            the object library amd64.lib before, then execute the
            following 3 command lines to compile it and add the generated object
            file amd64-sys.obj to the object library
            amd64.lib:
        SET CL=/c /GAFy /J /Oxy /W4 /Zl CL.EXE /Foamd64-sys.obj sys.c LINK.EXE /LIB /OUT:amd64.lib amd64.lib amd64-sys.obj
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64 Copyright (C) Microsoft Corporation. All rights reserved. sys.c Microsoft (R) Library Manager Version 10.00.40219.386 Copyright (C) Microsoft Corporation. All rights reserved.
amd64.lib respectively
            i386.lib before or instead of the
            MSVCRT
            libraries.
        
Option Description /LD Creates a DLL. Passes the /DLL option to the linker. The linker looks for, but does not require, a
DllMainfunction. If you do not write aDllMainfunction, the linker inserts aDllMainfunction that returns TRUE.Links the DLL startup code.
Creates an import library (.lib), if an export (.exp) file is not specified on the command line. You link the import library to applications that call your DLL.
Interprets /Fe (Name EXE File) as naming a DLL rather than an .exe file. By default, the program name becomes basename.dll instead of basename.exe.
[…]
Implies /MT unless you explicitly specify /MD.
.c in an arbitrary, preferable empty directory,
            then compile and (attempt to) link it with the object library
            msvcrt.lib
            against the
            Visual C runtime
            DLL:
        COPY NUL: .c SET CL=/LD /W4 /X SET LINK=/MACHINE:I386 /MAP /OPT:ICF,REF CL.EXE /MD .cFor details and reference see the MSDN articles Compiler Options and Linker Options.
Note: if necessary, see the MSDN article Use the Microsoft C++ toolset from the command line for an introduction.
Note: the command lines can be copied and pasted as block into a Command Processor window.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 16.00.40219.01 for 80x86
Copyright (C) Microsoft Corporation.  All rights reserved.
.c(1) : warning C4206: nonstandard extension used : translation unit is empty
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/MACHINE:I386 /MAP /OPT:ICF,REF
/out:.dll
/dll
/implib:.lib
.obj
LINK : error LNK2001: unresolved external symbol __DllMainCRTStartup@12
.dll : fatal error LNK1120: 1 unresolved externals
            OUCH: the combined import and object
            library
            msvcrt.lib
            shipped with the Visual C compiler does
            not provide the entry point function
            _DllMainCRTStartup() required to build
            DLLs!
         Repeat the last command line without the compiler option
            /MD
            to link with the object library
            libcmt.lib
            shipped with the Visual C compiler, then display
            the size of the generated empty
            DLL
            .dll:
        
CL.EXE .c DIR .dll
Microsoft (R) C/C++ Optimizing Compiler Version 16.00.40219.01 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.
.c(1) : warning C4206: nonstandard extension used : translation unit is empty
Microsoft (R) Incremental Linker Version 10.00.40219.386
Copyright (C) Microsoft Corporation.  All rights reserved.
/MACHINE:I386 /MAP /OPT:ICF,REF
/out:.dll
/dll
/implib:.lib
.obj
 Volume in drive C has no label.
 Volume Serial Number is 1957-0427
 Directory of C:\Users\Stefan\Desktop
04/27/2015  08:15 PM            32,256 .dll
               1 File(s)         32,256 bytes
               0 Dir(s)    9,876,543,210 bytes free
            OOPS: an emptyDLL is 31.5 kiB (in words: thirty-one and a half kilobyte)
small!
 Note: the inspection of the generated text file
            .map to determine what the linker included is left as
            an exercise to the reader.
        
 Note: the corresponding demonstration for console
            applications with empty main() and
            wmain() functions as well as Windows
            applications with empty
            WinMain()
            and
            wWinMain()
            functions is also left an exercise to the reader.
        
Note: a repetition of these demonstrations using the 64-bit build environment is left as an exercise to the reader too.
.CRT Section Usage.CRT section.
         The following table shows how the Visual C
            compiler and its runtime use the .CRT section:
        
| Section$Group | Public Name | Purpose and Usage | 
|---|---|---|
| Section$Group | Public Name | Purpose and Usage | 
| .CRT$XCA | __xc_a | NULLpointer before array of
                        C++ constructor and initialiser function
                        pointers. | 
| .CRT$XCAA | pre_cpp_init()function pointer. | |
| .CRT$XCU | Dynamic initialiser function pointers. | |
| .CRT$XCZ | __xc_z | Terminating NULLpointer after array of
                        C++ constructor and initialiser function
                        pointers. | 
| .CRT$XDA | __xd_a | NULLpointer before array of
                        C++
                        TLS
                        initialiser callback function pointers. | 
| .CRT$XDC | C++ TLS initialiser callback function pointers. | |
| .CRT$XDL | C++ TLS initialiser callback function pointers. | |
| .CRT$XDU | C++ TLS initialiser callback function pointers. | |
| .CRT$XDZ | __xd_z | Terminating NULLpointer after array of
                        C++
                        TLS
                        initialiser call function pointers. | 
| .CRT$XIA | __xi_a | NULLpointer before array of
                        C initialiser function pointers. | 
| .CRT$XIAA | pre_c_init()and_mixed_pre_c_init()function pointers. | |
| .CRT$XIC | __initmbctable(),__initstdio(),__inittime(),__lconv_init()and__onexitinit()function pointers. | |
| .CRT$XID | __set_emptyinvalidparamhandler,__set_loosefpmath()and_InitCPLocHash()function pointers. | |
| .CRT$XIY | __CxxSetUnhandledExceptionFilter()function pointer. | |
| .CRT$XIZ | __xi_z | Terminating NULLpointer after array of
                        C initialiser function pointers. | 
| .CRT$XLA | __xl_a | NULLpointer before array of
                        TLS
                        callback function pointers. | 
| .CRT$XLC | __dyn_tls_dtor()function pointer. | |
| .CRT$XLD | __dyn_tls_init()function pointer. | |
| .CRT$XLZ | __xl_z | Terminating NULLpointer after array of
                        TLS
                        callback function pointers. | 
| .CRT$XPA | __xp_a | NULLpointer before array of
                        C pre-termination function pointers. | 
| .CRT$XPB | _concrt_static_cleanup()function pointer. | |
| .CRT$XPX | __termconin(),__termconout(),_locterm()and_rmtmp()function pointers. | |
| .CRT$XPXA | __endstdio()function pointer. | |
| .CRT$XPZ | __xp_z | Terminating NULLpointer after array of
                        C pre-termination function pointers. | 
| .CRT$XTA | __xt_a | NULLpointer before array of
                        C termination function pointers. | 
| .CRT$XTZ | __xt_z | Terminating NULLpointer after array of
                        C termination function pointers. | 
CAVEAT: all symbols have global scope and pollute the name space without necessity!
.rtc Section Usage_RTC_Initialize()
            _RTC_Terminate()
         The following table shows how the Visual C
            compiler and its runtime use the .rtc section:
        
| Section$Group | Public Name | Purpose and Usage | 
|---|---|---|
| Section$Group | Public Name | Purpose and Usage | 
| .rtc$IAA | __rtc_iaa | Terminating NULLpointer before array of
                        RTC
                        initialisation function pointers. | 
| .rtc$IZZ | __rtc_izz | Terminating NULLpointer after array of
                        RTC
                        initialisation function pointers. | 
| .rtc$TAA | __rtc_taa | Terminating NULLpointer before array of
                        RTC
                        termination function pointers. | 
| .rtc$TZZ | __rtc_tzz | Terminating NULLpointer after array of
                        RTC
                        termination function pointers. | 
CAVEAT: all symbols have global scope and pollute the name space without necessity!
Use the X.509 certificate to send S/MIME encrypted mail.
Note: email in weird format and without a proper sender name is likely to be discarded!
 I dislike
            HTML (and even
            weirder formats too) in email, I prefer to receive plain text.
        
I also expect to see your full (real) name as sender, not your
            nickname.
        
I abhor top posts and expect inline quotes in replies.
        
as iswithout any warranty, neither express nor implied.
cookiesin the web browser.
The web service is operated and provided by
Telekom Deutschland GmbH The web service provider stores a session cookie
 in the web
            browser and records every visit of this web site with the following
            data in an access log on their server(s):