diff options
Diffstat (limited to 'rts/gmp/mpn/x86')
73 files changed, 0 insertions, 17763 deletions
diff --git a/rts/gmp/mpn/x86/README b/rts/gmp/mpn/x86/README deleted file mode 100644 index 3507548b8c..0000000000 --- a/rts/gmp/mpn/x86/README +++ /dev/null @@ -1,40 +0,0 @@ - - X86 MPN SUBROUTINES - - -This directory contains mpn functions for various 80x86 chips. - - -CODE ORGANIZATION - - x86 i386, i486, generic - x86/pentium Intel Pentium (P5, P54) - x86/pentium/mmx Intel Pentium with MMX (P55) - x86/p6 Intel Pentium Pro - x86/p6/mmx Intel Pentium II, III - x86/p6/p3mmx Intel Pentium III - x86/k6 AMD K6, K6-2, K6-3 - x86/k6/mmx - x86/k6/k62mmx AMD K6-2 - x86/k7 AMD Athlon - x86/k7/mmx - - -The x86 directory is also the main support for P6 at the moment, and -is something of a blended style, meant to be reasonable on all x86s. - - - -STATUS - -The code is well-optimized for AMD and Intel chips, but not so well -optimized for Cyrix chips. - - - -RELEVANT OPTIMIZATION ISSUES - -For implementations with slow double shift instructions (SHLD and -SHRD), it might be better to mimic their operation with SHL+SHR+OR. -(M2 is likely to benefit from that, but not Pentium due to its slow -plain SHL and SHR.) diff --git a/rts/gmp/mpn/x86/README.family b/rts/gmp/mpn/x86/README.family deleted file mode 100644 index 3bc73f58b0..0000000000 --- a/rts/gmp/mpn/x86/README.family +++ /dev/null @@ -1,333 +0,0 @@ - - X86 CPU FAMILY MPN SUBROUTINES - - -This file has some notes on things common to all the x86 family code. - - - -ASM FILES - -The x86 .asm files are BSD style x86 assembler code, first put through m4 -for macro processing. The generic mpn/asm-defs.m4 is used, together with -mpn/x86/x86-defs.m4. Detailed notes are in those files. - -The code is meant for use with GNU "gas" or a system "as". There's no -support for assemblers that demand Intel style, and with gas freely -available and easy to use that shouldn't be a problem. - - - -STACK FRAME - -m4 macros are used to define the parameters passed on the stack, and these -act like comments on what the stack frame looks like too. For example, -mpn_mul_1() has the following. - - defframe(PARAM_MULTIPLIER, 16) - defframe(PARAM_SIZE, 12) - defframe(PARAM_SRC, 8) - defframe(PARAM_DST, 4) - -Here PARAM_MULTIPLIER gets defined as `FRAME+16(%esp)', and the others -similarly. The return address is at offset 0, but there's not normally any -need to access that. - -FRAME is redefined as necessary through the code so it's the number of bytes -pushed on the stack, and hence the offsets in the parameter macros stay -correct. At the start of a routine FRAME should be zero. - - deflit(`FRAME',0) - ... - deflit(`FRAME',4) - ... - deflit(`FRAME',8) - ... - -Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and -FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions, -and can be used instead of explicit definitions if preferred. -defframe_pushl() is a combination FRAME_pushl() and defframe(). - -There's generally some slackness in redefining FRAME. If new values aren't -going to get used, then the redefinitions are omitted to keep from -cluttering up the code. This happens for instance at the end of a routine, -where there might be just four register pops and then a ret, so FRAME isn't -getting used. - -Local variables and saved registers can be similarly defined, with negative -offsets representing stack space below the initial stack pointer. For -example, - - defframe(SAVE_ESI, -4) - defframe(SAVE_EDI, -8) - defframe(VAR_COUNTER,-12) - - deflit(STACK_SPACE, 12) - -Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the -space, and that instruction must be followed by a redefinition of FRAME -(setting it equal to STACK_SPACE) to reflect the change in %esp. - -Definitions for pushed registers are only put in when they're going to be -used. If registers are just saved and restored with pushes and pops then -definitions aren't made. - - - -ASSEMBLER EXPRESSIONS - -Only addition and subtraction seem to be universally available, certainly -that's all the Solaris 8 "as" seems to accept. If expressions are wanted -then m4 eval() should be used. - -In particular note that a "/" anywhere in a line starts a comment in Solaris -"as", and in some configurations of gas too. - - addl $32/2, %eax <-- wrong - - addl $eval(32/2), %eax <-- right - -Binutils gas/config/tc-i386.c has a choice between "/" being a comment -anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select -the latter, and as of 2.9.5 it's the default for GNU/Linux too. - - - -ASSEMBLER COMMENTS - -Solaris "as" doesn't support "#" commenting, using /* */ instead, -unfortunately. For that reason "C" commenting is used (see asm-defs.m4) and -the intermediate ".s" files have no comments. - - - -ZERO DISPLACEMENTS - -In a couple of places addressing modes like 0(%ebx) with a byte-sized zero -displacement are wanted, rather than (%ebx) with no displacement. These are -either for computed jumps or to get desirable code alignment. Explicit -.byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into -(%ebx). The Zdisp() macro in x86-defs.m4 is used for this. - -Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas -1.92.3 changes it. In general changing would be the sort of "optimization" -an assembler might perform, hence explicit ".byte"s are used where -necessary. - - - -SHLD/SHRD INSTRUCTIONS - -The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx" -must be written "shldl %eax,%ebx" for some assemblers. gas takes either, -Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is -gas), and omits %cl elsewhere. - -For GMP an autoconf test is used to determine whether %cl should be used and -the macros shldl, shrdl, shldw and shrdw in mpn/x86/x86-defs.m4 then pass -through or omit %cl as necessary. See comments with those macros for usage. - - - -DIRECTION FLAG - -The x86 calling conventions say that the direction flag should be clear at -function entry and exit. (See iBCS2 and SVR4 ABI books, references below.) - -Although this has been so since the year dot, it's not absolutely clear -whether it's universally respected. Since it's better to be safe than -sorry, gmp follows glibc and does a "cld" if it depends on the direction -flag being clear. This happens only in a few places. - - - -POSITION INDEPENDENT CODE - -Defining the symbol PIC in m4 processing selects position independent code. -This mainly affects computed jumps, and these are implemented in a -self-contained fashion (without using the global offset table). The few -calls from assembly code to global functions use the normal procedure -linkage table. - -PIC is necessary for ELF shared libraries because they can be mapped into -different processes at different virtual addresses. Text relocations in -shared libraries are allowed, but that presumably means a page with such a -relocation isn't shared. The use of the PLT for PIC adds a fixed cost to -every function call, which is small but might be noticeable when working with -small operands. - -Calls from one library function to another don't need to go through the PLT, -since of course the call instruction uses a displacement, not an absolute -address, and the relative locations of object files are known when libgmp.so -is created. "ld -Bsymbolic" (or "gcc -Wl,-Bsymbolic") will resolve calls -this way, so that there's no jump through the PLT, but of course leaving -setups of the GOT address in %ebx that may be unnecessary. - -The %ebx setup could be avoided in assembly if a separate option controlled -PIC for calls as opposed to computed jumps etc. But there's only ever -likely to be a handful of calls out of assembler, and getting the same -optimization for C intra-library calls would be more important. There seems -no easy way to tell gcc that certain functions can be called non-PIC, and -unfortunately many gmp functions use the global memory allocation variables, -so they need the GOT anyway. Object files with no global data references -and only intra-library calls could go into the library as non-PIC under --Bsymbolic. Integrating this into libtool and automake is left as an -exercise for the reader. - - - -SIMPLE LOOPS - -The overheads in setting up for an unrolled loop can mean that at small -sizes a simple loop is faster. Making small sizes go fast is important, -even if it adds a cycle or two to bigger sizes. To this end various -routines choose between a simple loop and an unrolled loop according to -operand size. The path to the simple loop, or to special case code for -small sizes, is always as fast as possible. - -Adding a simple loop requires a conditional jump to choose between the -simple and unrolled code. The size of a branch misprediction penalty -affects whether a simple loop is worthwhile. - -The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover -point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >= -UNROLL_THRESHOLD using the unrolled loop. If position independent code adds -a couple of cycles to an unrolled loop setup, the threshold will vary with -PIC or non-PIC. Something like the following is typical. - - ifdef(`PIC',` - deflit(UNROLL_THRESHOLD, 10) - ',` - deflit(UNROLL_THRESHOLD, 8) - ') - -There's no automated way to determine the threshold. Setting it to a small -value and then to a big value makes it possible to measure the simple and -unrolled loops each over a range of sizes, from which the crossover point -can be determined. Alternately, just adjust the threshold up or down until -there's no more speedups. - - - -UNROLLED LOOP CODING - -The x86 addressing modes allow a byte displacement of -128 to +127, making -it possible to access 256 bytes, which is 64 limbs, without adjusting -pointer registers within the loop. Dword sized displacements can be used -too, but they increase code size, and unrolling to 64 ought to be enough. - -When unrolling to the full 64 limbs/loop, the limb at the top of the loop -will have a displacement of -128, so pointers have to have a corresponding -+128 added before entering the loop. When unrolling to 32 limbs/loop -displacements 0 to 127 can be used with 0 at the top of the loop and no -adjustment needed to the pointers. - -Where 64 limbs/loop is supported, the +128 adjustment is done only when 64 -limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or -16 is small, so support for 64 limbs/loop is generally only for comparison. - - - -COMPUTED JUMPS - -When working from least significant limb to most significant limb (most -routines) the computed jump and pointer calculations in preparation for an -unrolled loop are as follows. - - S = operand size in limbs - N = number of limbs per loop (UNROLL_COUNT) - L = log2 of unrolling (UNROLL_LOG2) - M = mask for unrolling (UNROLL_MASK) - C = code bytes per limb in the loop - B = bytes per limb (4 for x86) - - computed jump (-S & M) * C + entrypoint - subtract from pointers (-S & M) * B - initial loop counter (S-1) >> L - displacements 0 to B*(N-1) - -The loop counter is decremented at the end of each loop, and the looping -stops when the decrement takes the counter to -1. The displacements are for -the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax". - -Usually the multiply by "C" can be handled without an imul, using instead an -leal, or a shift and subtract. - -When working from most significant to least significant limb (eg. mpn_lshift -and mpn_copyd), the calculations change as follows. - - add to pointers (-S & M) * B - displacements 0 to -B*(N-1) - - - -OLD GAS 1.92.3 - -This version comes with FreeBSD 2.2.8 and has a couple of gremlins that -affect gmp code. - -Firstly, an expression involving two forward references to labels comes out -as zero. For example, - - addl $bar-foo, %eax - foo: - nop - bar: - -This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax". -When only one forward reference is involved, it works correctly, as for -example, - - foo: - addl $bar-foo, %eax - nop - bar: - -Secondly, an expression involving two labels can't be used as the -displacement for an leal. For example, - - foo: - nop - bar: - leal bar-foo(%eax,%ebx,8), %ecx - -A slightly cryptic error is given, "Unimplemented segment type 0 in -parse_operand". When only one label is used it's ok, and the label can be a -forward reference too, as for example, - - leal foo(%eax,%ebx,8), %ecx - nop - foo: - -These problems only affect PIC computed jump calculations. The workarounds -are just to do an leal without a displacement and then an addl, and to make -sure the code is placed so that there's at most one forward reference in the -addl. - - - -REFERENCES - -"Intel Architecture Software Developer's Manual", volumes 1 to 3, 1999, -order numbers 243190, 243191 and 243192. Available on-line, - - ftp://download.intel.com/design/PentiumII/manuals/243190.htm - ftp://download.intel.com/design/PentiumII/manuals/243191.htm - ftp://download.intel.com/design/PentiumII/manuals/243192.htm - -"Intel386 Family Binary Compatibility Specification 2", Intel Corporation, -published by McGraw-Hill, 1991, ISBN 0-07-031219-2. - -"System V Application Binary Interface", Unix System Laboratories Inc, 1992, -published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor -Supplement", AT&T, 1991, ISBN 0-13-877689-X. (These have details of ELF -shared library PIC coding.) - - - ----------------- -Local variables: -mode: text -fill-column: 76 -End: diff --git a/rts/gmp/mpn/x86/addsub_n.S b/rts/gmp/mpn/x86/addsub_n.S deleted file mode 100644 index fe6f648f53..0000000000 --- a/rts/gmp/mpn/x86/addsub_n.S +++ /dev/null @@ -1,174 +0,0 @@ -/* Currently not working and not used. */ - -/* -Copyright (C) 1999 Free Software Foundation, Inc. - -This file is part of the GNU MP Library. - -The GNU MP Library is free software; you can redistribute it and/or modify -it under the terms of the GNU Lesser General Public License as published by -the Free Software Foundation; either version 2.1 of the License, or (at your -option) any later version. - -The GNU MP Library is distributed in the hope that it will be useful, but -WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY -or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public -License for more details. - -You should have received a copy of the GNU Lesser General Public License -along with the GNU MP Library; see the file COPYING.LIB. If not, write to -the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, -MA 02111-1307, USA. -*/ - - -#define SAVE_BORROW_RESTORE_CARRY(r) adcl r,r; shll $31,r -#define SAVE_CARRY_RESTORE_BORROW(r) adcl r,r - - .globl mpn_addsub_n_0 - .globl mpn_addsub_n_1 - -/* Cute i386/i486/p6 addsub loop for the "full overlap" case r1==s2,r2==s1. - We let subtraction and addition alternate in being two limbs - ahead of the other, thereby avoiding some SAVE_RESTORE. */ -// r1 = r2 + r1 edi = esi + edi -// r2 = r2 - r1 esi = esi - edi -// s1 s2 -// r2 r1 -// eax,ebx,ecx,edx,esi,edi,ebp -mpn_addsub_n_0: - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp - - movl 20(%esp),%edi /* res_ptr */ - movl 24(%esp),%esi /* s1_ptr */ - movl 36(%esp),%ebp /* size */ - - shrl $2,%ebp - xorl %edx,%edx - .align 4 -Loop0: // L=load E=execute S=store - movl (%esi),%ebx // sub 0 L - movl 4(%esi),%ecx // sub 1 L - sbbl (%edi),%ebx // sub 0 LE - sbbl 4(%edi),%ecx // sub 1 LE -// SAVE_BORROW_RESTORE_CARRY(%edx) - movl (%esi),%eax // add 0 L - adcl %eax,(%edi) // add 0 LES - movl 4(%esi),%eax // add 1 L - adcl %eax,4(%edi) // add 1 LES - movl %ebx,(%esi) // sub 0 S - movl %ecx,4(%esi) // sub 1 S - movl 8(%esi),%ebx // add 2 L - adcl 8(%edi),%ebx // add 2 LE - movl 12(%esi),%ecx // add 3 L - adcl 12(%edi),%ecx // add 3 LE -// SAVE_CARRY_RESTORE_BORROW(%edx) - movl 8(%edi),%eax // sub 2 L - sbbl %eax,8(%esi) // sub 2 LES - movl 12(%edi),%eax // sub 3 L - sbbl %eax,12(%esi) // sub 3 LES - movl %ebx,8(%edi) // add 2 S - movl %ecx,12(%edi) // add 3 S - leal 16(%esi),%esi - leal 16(%edi),%edi - decl %ebp - jnz Loop0 - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - -/* Cute i386/i486/p6 addsub loop for the "full overlap" case r1==s1,r2==s2. - We let subtraction and addition alternate in being two limbs - ahead of the other, thereby avoiding some SAVE_RESTORE. */ -// r1 = r1 + r2 edi = edi + esi -// r2 = r1 - r2 esi = edi - esi -// s2 s1 -// r2 r1 -// eax,ebx,ecx,edx,esi,edi,ebp -mpn_addsub_n_1: - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp - - movl 20(%esp),%edi /* res_ptr */ - movl 24(%esp),%esi /* s1_ptr */ - movl 36(%esp),%ebp /* size */ - - shrl $2,%ebp - xorl %edx,%edx - .align 4 -Loop1: // L=load E=execute S=store - movl (%edi),%ebx // sub 0 L - sbbl (%esi),%ebx // sub 0 LE - movl 4(%edi),%ecx // sub 1 L - sbbl 4(%esi),%ecx // sub 1 LE -// SAVE_BORROW_RESTORE_CARRY(%edx) - movl (%esi),%eax // add 0 L - adcl %eax,(%edi) // add 0 LES - movl 4(%esi),%eax // add 1 L - adcl %eax,4(%edi) // add 1 LES - movl %ebx,(%esi) // sub 0 S - movl %ecx,4(%esi) // sub 1 S - movl 8(%esi),%ebx // add 2 L - adcl 8(%edi),%ebx // add 2 LE - movl 12(%esi),%ecx // add 3 L - adcl 12(%edi),%ecx // add 3 LE -// SAVE_CARRY_RESTORE_BORROW(%edx) - movl 8(%edi),%eax // sub 2 L - sbbl 8(%esi),%eax // sub 2 LES - movl %eax,8(%esi) // sub 2 S - movl 12(%edi),%eax // sub 3 L - sbbl 12(%esi),%eax // sub 3 LE - movl %eax,12(%esi) // sub 3 S - movl %ebx,8(%edi) // add 2 S - movl %ecx,12(%edi) // add 3 S - leal 16(%esi),%esi - leal 16(%edi),%edi - decl %ebp - jnz Loop1 - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - - .globl mpn_copy -mpn_copy: - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp - - movl 20(%esp),%edi /* res_ptr */ - movl 24(%esp),%esi /* s1_ptr */ - movl 28(%esp),%ebp /* size */ - - shrl $2,%ebp - .align 4 -Loop2: - movl (%esi),%eax - movl 4(%esi),%ebx - movl %eax,(%edi) - movl %ebx,4(%edi) - movl 8(%esi),%eax - movl 12(%esi),%ebx - movl %eax,8(%edi) - movl %ebx,12(%edi) - leal 16(%esi),%esi - leal 16(%edi),%edi - decl %ebp - jnz Loop2 - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret diff --git a/rts/gmp/mpn/x86/aors_n.asm b/rts/gmp/mpn/x86/aors_n.asm deleted file mode 100644 index 18ef816b4d..0000000000 --- a/rts/gmp/mpn/x86/aors_n.asm +++ /dev/null @@ -1,187 +0,0 @@ -dnl x86 mpn_add_n/mpn_sub_n -- mpn addition and subtraction. - -dnl Copyright (C) 1992, 1994, 1995, 1996, 1999, 2000 Free Software -dnl Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -ifdef(`OPERATION_add_n',` - define(M4_inst, adcl) - define(M4_function_n, mpn_add_n) - define(M4_function_nc, mpn_add_nc) - -',`ifdef(`OPERATION_sub_n',` - define(M4_inst, sbbl) - define(M4_function_n, mpn_sub_n) - define(M4_function_nc, mpn_sub_nc) - -',`m4_error(`Need OPERATION_add_n or OPERATION_sub_n -')')') - -MULFUNC_PROLOGUE(mpn_add_n mpn_add_nc mpn_sub_n mpn_sub_nc) - - -C mp_limb_t M4_function_n (mp_ptr dst, mp_srcptr src1, mp_srcptr src2, -C mp_size_t size); -C mp_limb_t M4_function_nc (mp_ptr dst, mp_srcptr src1, mp_srcptr src2, -C mp_size_t size, mp_limb_t carry); - -defframe(PARAM_CARRY,20) -defframe(PARAM_SIZE, 16) -defframe(PARAM_SRC2, 12) -defframe(PARAM_SRC1, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(8) - -PROLOGUE(M4_function_nc) -deflit(`FRAME',0) - - pushl %edi FRAME_pushl() - pushl %esi FRAME_pushl() - - movl PARAM_DST,%edi - movl PARAM_SRC1,%esi - movl PARAM_SRC2,%edx - movl PARAM_SIZE,%ecx - - movl %ecx,%eax - shrl $3,%ecx C compute count for unrolled loop - negl %eax - andl $7,%eax C get index where to start loop - jz LF(M4_function_n,oopgo) C necessary special case for 0 - incl %ecx C adjust loop count - shll $2,%eax C adjustment for pointers... - subl %eax,%edi C ... since they are offset ... - subl %eax,%esi C ... by a constant when we ... - subl %eax,%edx C ... enter the loop - shrl $2,%eax C restore previous value - -ifdef(`PIC',` - C Calculate start address in loop for PIC. Due to limitations in - C old gas, LF(M4_function_n,oop)-L(0a)-3 cannot be put into the leal - call L(0a) -L(0a): leal (%eax,%eax,8),%eax - addl (%esp),%eax - addl $LF(M4_function_n,oop)-L(0a)-3,%eax - addl $4,%esp -',` - C Calculate start address in loop for non-PIC. - leal LF(M4_function_n,oop)-3(%eax,%eax,8),%eax -') - - C These lines initialize carry from the 5th parameter. Should be - C possible to simplify. - pushl %ebp FRAME_pushl() - movl PARAM_CARRY,%ebp - shrl $1,%ebp C shift bit 0 into carry - popl %ebp FRAME_popl() - - jmp *%eax C jump into loop - -EPILOGUE() - - - ALIGN(8) -PROLOGUE(M4_function_n) -deflit(`FRAME',0) - - pushl %edi FRAME_pushl() - pushl %esi FRAME_pushl() - - movl PARAM_DST,%edi - movl PARAM_SRC1,%esi - movl PARAM_SRC2,%edx - movl PARAM_SIZE,%ecx - - movl %ecx,%eax - shrl $3,%ecx C compute count for unrolled loop - negl %eax - andl $7,%eax C get index where to start loop - jz L(oop) C necessary special case for 0 - incl %ecx C adjust loop count - shll $2,%eax C adjustment for pointers... - subl %eax,%edi C ... since they are offset ... - subl %eax,%esi C ... by a constant when we ... - subl %eax,%edx C ... enter the loop - shrl $2,%eax C restore previous value - -ifdef(`PIC',` - C Calculate start address in loop for PIC. Due to limitations in - C some assemblers, L(oop)-L(0b)-3 cannot be put into the leal - call L(0b) -L(0b): leal (%eax,%eax,8),%eax - addl (%esp),%eax - addl $L(oop)-L(0b)-3,%eax - addl $4,%esp -',` - C Calculate start address in loop for non-PIC. - leal L(oop)-3(%eax,%eax,8),%eax -') - jmp *%eax C jump into loop - -L(oopgo): - pushl %ebp FRAME_pushl() - movl PARAM_CARRY,%ebp - shrl $1,%ebp C shift bit 0 into carry - popl %ebp FRAME_popl() - - ALIGN(8) -L(oop): movl (%esi),%eax - M4_inst (%edx),%eax - movl %eax,(%edi) - movl 4(%esi),%eax - M4_inst 4(%edx),%eax - movl %eax,4(%edi) - movl 8(%esi),%eax - M4_inst 8(%edx),%eax - movl %eax,8(%edi) - movl 12(%esi),%eax - M4_inst 12(%edx),%eax - movl %eax,12(%edi) - movl 16(%esi),%eax - M4_inst 16(%edx),%eax - movl %eax,16(%edi) - movl 20(%esi),%eax - M4_inst 20(%edx),%eax - movl %eax,20(%edi) - movl 24(%esi),%eax - M4_inst 24(%edx),%eax - movl %eax,24(%edi) - movl 28(%esi),%eax - M4_inst 28(%edx),%eax - movl %eax,28(%edi) - leal 32(%edi),%edi - leal 32(%esi),%esi - leal 32(%edx),%edx - decl %ecx - jnz L(oop) - - sbbl %eax,%eax - negl %eax - - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/aorsmul_1.asm b/rts/gmp/mpn/x86/aorsmul_1.asm deleted file mode 100644 index f32ad83989..0000000000 --- a/rts/gmp/mpn/x86/aorsmul_1.asm +++ /dev/null @@ -1,134 +0,0 @@ -dnl x86 __gmpn_addmul_1 (for 386 and 486) -- Multiply a limb vector with a -dnl limb and add the result to a second limb vector. - - -dnl Copyright (C) 1992, 1994, 1997, 1999, 2000 Free Software Foundation, -dnl Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -ifdef(`OPERATION_addmul_1',` - define(M4_inst, addl) - define(M4_function_1, mpn_addmul_1) - -',`ifdef(`OPERATION_submul_1',` - define(M4_inst, subl) - define(M4_function_1, mpn_submul_1) - -',`m4_error(`Need OPERATION_addmul_1 or OPERATION_submul_1 -')')') - -MULFUNC_PROLOGUE(mpn_addmul_1 mpn_submul_1) - - -C mp_limb_t M4_function_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t mult); - -define(PARAM_MULTIPLIER, `FRAME+16(%esp)') -define(PARAM_SIZE, `FRAME+12(%esp)') -define(PARAM_SRC, `FRAME+8(%esp)') -define(PARAM_DST, `FRAME+4(%esp)') - - TEXT - ALIGN(8) - -PROLOGUE(M4_function_1) -deflit(`FRAME',0) - - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp -deflit(`FRAME',16) - - movl PARAM_DST,%edi - movl PARAM_SRC,%esi - movl PARAM_SIZE,%ecx - - xorl %ebx,%ebx - andl $3,%ecx - jz L(end0) - -L(oop0): - movl (%esi),%eax - mull PARAM_MULTIPLIER - leal 4(%esi),%esi - addl %ebx,%eax - movl $0,%ebx - adcl %ebx,%edx - M4_inst %eax,(%edi) - adcl %edx,%ebx C propagate carry into cylimb - - leal 4(%edi),%edi - decl %ecx - jnz L(oop0) - -L(end0): - movl PARAM_SIZE,%ecx - shrl $2,%ecx - jz L(end) - - ALIGN(8) -L(oop): movl (%esi),%eax - mull PARAM_MULTIPLIER - addl %eax,%ebx - movl $0,%ebp - adcl %edx,%ebp - - movl 4(%esi),%eax - mull PARAM_MULTIPLIER - M4_inst %ebx,(%edi) - adcl %eax,%ebp C new lo + cylimb - movl $0,%ebx - adcl %edx,%ebx - - movl 8(%esi),%eax - mull PARAM_MULTIPLIER - M4_inst %ebp,4(%edi) - adcl %eax,%ebx C new lo + cylimb - movl $0,%ebp - adcl %edx,%ebp - - movl 12(%esi),%eax - mull PARAM_MULTIPLIER - M4_inst %ebx,8(%edi) - adcl %eax,%ebp C new lo + cylimb - movl $0,%ebx - adcl %edx,%ebx - - M4_inst %ebp,12(%edi) - adcl $0,%ebx C propagate carry into cylimb - - leal 16(%esi),%esi - leal 16(%edi),%edi - decl %ecx - jnz L(oop) - -L(end): movl %ebx,%eax - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/copyd.asm b/rts/gmp/mpn/x86/copyd.asm deleted file mode 100644 index 439640e836..0000000000 --- a/rts/gmp/mpn/x86/copyd.asm +++ /dev/null @@ -1,80 +0,0 @@ -dnl x86 mpn_copyd -- copy limb vector, decrementing. -dnl -dnl Future: On P6 an MMX loop should be able to go faster than this code. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C void mpn_copyd (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C Copy src,size to dst,size, working from high to low addresses. -C -C The code here is very generic and can be expected to be reasonable on all -C the x86 family. -C -C P5 - 1.0 cycles/limb. -C -C P6 - 2.4 cycles/limb, approx 40 cycles startup. - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - - .text - ALIGN(32) - -PROLOGUE(mpn_copyd) - C eax saved esi - C ebx - C ecx counter - C edx saved edi - C esi src - C edi dst - C ebp - - movl PARAM_SIZE, %ecx - movl %esi, %eax - - movl PARAM_SRC, %esi - movl %edi, %edx - - movl PARAM_DST, %edi - leal -4(%esi,%ecx,4), %esi - - leal -4(%edi,%ecx,4), %edi - - std - - rep - movsl - - cld - - movl %eax, %esi - movl %edx, %edi - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/copyi.asm b/rts/gmp/mpn/x86/copyi.asm deleted file mode 100644 index 5bc4e36689..0000000000 --- a/rts/gmp/mpn/x86/copyi.asm +++ /dev/null @@ -1,79 +0,0 @@ -dnl x86 mpn_copyi -- copy limb vector, incrementing. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C void mpn_copyi (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C Copy src,size to dst,size, working from low to high addresses. -C -C The code here is very generic and can be expected to be reasonable on all -C the x86 family. -C -C P5 - 1.0 cycles/limb. -C -C P6 - 0.75 cycles/limb. An MMX based copy was tried, but was found to be -C slower than a rep movs in all cases. The fastest MMX found was 0.8 -C cycles/limb (when fully aligned). A rep movs seems to have a startup -C time of about 15 cycles, but doing something special for small sizes -C could lead to a branch misprediction that would destroy any saving. -C For now a plain rep movs seems ok for P6. - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - - .text - ALIGN(32) - - C eax saved esi - C ebx - C ecx counter - C edx saved edi - C esi src - C edi dst - C ebp - -PROLOGUE(mpn_copyi) - - movl PARAM_SIZE, %ecx - movl %esi, %eax - - movl PARAM_SRC, %esi - movl %edi, %edx - - movl PARAM_DST, %edi - - cld C better safe than sorry, see mpn/x86/README.family - - rep - movsl - - movl %eax, %esi - movl %edx, %edi - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/diveby3.asm b/rts/gmp/mpn/x86/diveby3.asm deleted file mode 100644 index df879da9e1..0000000000 --- a/rts/gmp/mpn/x86/diveby3.asm +++ /dev/null @@ -1,115 +0,0 @@ -dnl x86 mpn_divexact_by3 -- mpn division by 3, expecting no remainder. - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -dnl The following all have their own optimized versions of this routine, -dnl but for reference the code here runs as follows. -dnl -dnl cycles/limb -dnl P54 18.0 -dnl P55 17.0 -dnl P6 14.5 -dnl K6 14.0 -dnl K7 10.0 - - -include(`../config.m4') - - -C mp_limb_t mpn_divexact_by3c (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t carry); - -defframe(PARAM_CARRY,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - -dnl multiplicative inverse of 3, modulo 2^32 -deflit(INVERSE_3, 0xAAAAAAAB) - -dnl ceil(b/3) and ceil(b*2/3) where b=2^32 -deflit(ONE_THIRD_CEIL, 0x55555556) -deflit(TWO_THIRDS_CEIL, 0xAAAAAAAB) - - .text - ALIGN(8) - -PROLOGUE(mpn_divexact_by3c) -deflit(`FRAME',0) - - movl PARAM_SRC, %ecx - pushl %ebp FRAME_pushl() - - movl PARAM_SIZE, %ebp - pushl %edi FRAME_pushl() - - movl PARAM_DST, %edi - pushl %esi FRAME_pushl() - - movl $INVERSE_3, %esi - pushl %ebx FRAME_pushl() - - leal (%ecx,%ebp,4), %ecx - movl PARAM_CARRY, %ebx - - leal (%edi,%ebp,4), %edi - negl %ebp - - - ALIGN(8) -L(top): - C eax scratch, low product - C ebx carry limb (0 to 3) - C ecx &src[size] - C edx scratch, high product - C esi multiplier - C edi &dst[size] - C ebp counter, limbs, negative - - movl (%ecx,%ebp,4), %eax - - subl %ebx, %eax - - setc %bl - - imull %esi - - cmpl $ONE_THIRD_CEIL, %eax - movl %eax, (%edi,%ebp,4) - - sbbl $-1, %ebx C +1 if eax>=ceil(b/3) - cmpl $TWO_THIRDS_CEIL, %eax - - sbbl $-1, %ebx C +1 if eax>=ceil(b*2/3) - incl %ebp - - jnz L(top) - - - movl %ebx, %eax - popl %ebx - popl %esi - popl %edi - popl %ebp - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/divrem_1.asm b/rts/gmp/mpn/x86/divrem_1.asm deleted file mode 100644 index 12f14676d6..0000000000 --- a/rts/gmp/mpn/x86/divrem_1.asm +++ /dev/null @@ -1,232 +0,0 @@ -dnl x86 mpn_divrem_1 -- mpn by limb division extending to fractional quotient. - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -dnl cycles/limb -dnl K6 20 -dnl P5 44 -dnl P6 39 -dnl 486 approx 43 maybe -dnl -dnl -dnl The following have their own optimized divrem_1 implementations, but -dnl for reference the code here runs as follows. -dnl -dnl cycles/limb -dnl P6MMX 39 -dnl K7 42 - - -include(`../config.m4') - - -C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, -C mp_srcptr src, mp_size_t size, mp_limb_t divisor); -C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, -C mp_srcptr src, mp_size_t size, mp_limb_t divisor); -C -C Divide src,size by divisor and store the quotient in dst+xsize,size. -C Extend the division to fractional quotient limbs in dst,xsize. Return the -C remainder. Either or both xsize and size can be 0. -C -C mpn_divrem_1c takes a carry parameter which is an initial high limb, -C effectively one extra limb at the top of src,size. Must have -C carry<divisor. -C -C -C Essentially the code is the same as the division based part of -C mpn/generic/divrem_1.c, but has the following advantages. -C -C - If gcc isn't being used then divrem_1.c will get the generic C -C udiv_qrnnd() and be rather slow. -C -C - On K6, using the loop instruction is a 10% speedup, but gcc doesn't -C generate that instruction (as of gcc 2.95.2 at least). -C -C A test is done to see if the high limb is less the the divisor, and if so -C one less div is done. A div is between 20 and 40 cycles on the various -C x86s, so assuming high<divisor about half the time, then this test saves -C half that amount. The branch misprediction penalty on each chip is less -C than half a div. -C -C -C K6: Back-to-back div instructions run at 20 cycles, the same as the loop -C here, so it seems there's nothing to gain by rearranging the loop. -C Pairing the mov and loop instructions was found to gain nothing. (The -C same is true of the mpn/x86/mod_1.asm loop.) -C -C With a "decl/jnz" rather than a "loop" this code runs at 22 cycles. -C The loop_or_decljnz macro is an easy way to get a 10% speedup. -C -C The fast K6 multiply might be thought to suit a multiply-by-inverse, -C but that algorithm has been found to suffer from the releatively poor -C carry handling on K6 and too many auxiliary instructions. The -C fractional part however could be done at about 13 c/l. -C -C P5: Moving the load down to pair with the store might save 1 cycle, but -C that doesn't seem worth bothering with, since it'd be only a 2.2% -C saving. -C -C Again here the auxiliary instructions hinder a multiply-by-inverse, -C though there might be a 10-15% speedup available - - -defframe(PARAM_CARRY, 24) -defframe(PARAM_DIVISOR,20) -defframe(PARAM_SIZE, 16) -defframe(PARAM_SRC, 12) -defframe(PARAM_XSIZE, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(16) - -PROLOGUE(mpn_divrem_1c) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - pushl %edi FRAME_pushl() - - movl PARAM_SRC, %edi - pushl %esi FRAME_pushl() - - movl PARAM_DIVISOR, %esi - pushl %ebx FRAME_pushl() - - movl PARAM_DST, %ebx - pushl %ebp FRAME_pushl() - - movl PARAM_XSIZE, %ebp - orl %ecx, %ecx - - movl PARAM_CARRY, %edx - jz LF(mpn_divrem_1,fraction) - - leal -4(%ebx,%ebp,4), %ebx C dst one limb below integer part - jmp LF(mpn_divrem_1,integer_top) - -EPILOGUE() - - -PROLOGUE(mpn_divrem_1) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - pushl %edi FRAME_pushl() - - movl PARAM_SRC, %edi - pushl %esi FRAME_pushl() - - movl PARAM_DIVISOR, %esi - orl %ecx,%ecx - - jz L(size_zero) - pushl %ebx FRAME_pushl() - - movl -4(%edi,%ecx,4), %eax C src high limb - xorl %edx, %edx - - movl PARAM_DST, %ebx - pushl %ebp FRAME_pushl() - - movl PARAM_XSIZE, %ebp - cmpl %esi, %eax - - leal -4(%ebx,%ebp,4), %ebx C dst one limb below integer part - jae L(integer_entry) - - - C high<divisor, so high of dst is zero, and avoid one div - - movl %edx, (%ebx,%ecx,4) - decl %ecx - - movl %eax, %edx - jz L(fraction) - - -L(integer_top): - C eax scratch (quotient) - C ebx dst+4*xsize-4 - C ecx counter - C edx scratch (remainder) - C esi divisor - C edi src - C ebp xsize - - movl -4(%edi,%ecx,4), %eax -L(integer_entry): - - divl %esi - - movl %eax, (%ebx,%ecx,4) - loop_or_decljnz L(integer_top) - - -L(fraction): - orl %ebp, %ecx - jz L(done) - - movl PARAM_DST, %ebx - - -L(fraction_top): - C eax scratch (quotient) - C ebx dst - C ecx counter - C edx scratch (remainder) - C esi divisor - C edi - C ebp - - xorl %eax, %eax - - divl %esi - - movl %eax, -4(%ebx,%ecx,4) - loop_or_decljnz L(fraction_top) - - -L(done): - popl %ebp - movl %edx, %eax - popl %ebx - popl %esi - popl %edi - ret - - -L(size_zero): -deflit(`FRAME',8) - movl PARAM_XSIZE, %ecx - xorl %eax, %eax - - movl PARAM_DST, %edi - - cld C better safe than sorry, see mpn/x86/README.family - - rep - stosl - - popl %esi - popl %edi - ret -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/README b/rts/gmp/mpn/x86/k6/README deleted file mode 100644 index 3ad96c8b89..0000000000 --- a/rts/gmp/mpn/x86/k6/README +++ /dev/null @@ -1,237 +0,0 @@ - - AMD K6 MPN SUBROUTINES - - - -This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and -K6-3. - -The mmx and k62mmx subdirectories have routines using MMX instructions. All -K6s have MMX, the separate directories are just so that ./configure can omit -them if the assembler doesn't support MMX. - - - - -STATUS - -Times for the loops, with all code and data in L1 cache, are as follows. - - cycles/limb - - mpn_add_n/sub_n 3.25 normal, 2.75 in-place - - mpn_mul_1 6.25 - mpn_add/submul_1 7.65-8.4 (varying with data values) - - mpn_mul_basecase 9.25 cycles/crossproduct (approx) - mpn_sqr_basecase 4.7 cycles/crossproduct (approx) - or 9.2 cycles/triangleproduct (approx) - - mpn_divrem_1 20.0 - mpn_mod_1 20.0 - mpn_divexact_by3 11.0 - - mpn_l/rshift 3.0 - - mpn_copyi/copyd 1.0 - - mpn_com_n 1.5-1.85 \ - mpn_and/andn/ior/xor_n 1.5-1.75 | varying with - mpn_iorn/xnor_n 2.0-2.25 | data alignment - mpn_nand/nior_n 2.0-2.25 / - - mpn_popcount 12.5 - mpn_hamdist 13.0 - - -K6-2 and K6-3 have dual-issue MMX and get the following improvements. - - mpn_l/rshift 1.75 - - mpn_copyi/copyd 0.56 or 1.0 \ - | - mpn_com_n 1.0-1.2 | varying with - mpn_and/andn/ior/xor_n 1.2-1.5 | data alignment - mpn_iorn/xnor_n 1.5-2.0 | - mpn_nand/nior_n 1.75-2.0 / - - mpn_popcount 9.0 - mpn_hamdist 11.5 - - -Prefetching of sources hasn't yet given any joy. With the 3DNow "prefetch" -instruction, code seems to run slower, and with just "mov" loads it doesn't -seem faster. Results so far are inconsistent. The K6 does a hardware -prefetch of the second cache line in a sector, so the penalty for not -prefetching in software is reduced. - - - - -NOTES - -All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow. - -Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can -execute them in both X and Y (and together). - -Branch misprediction penalty is 1 to 4 cycles (Optimization Manual -chapter 6 table 12). - -Write-allocate L1 data cache means prefetching of destinations is unnecessary. -Store queue is 7 entries of 64 bits each. - -Floating point multiplications can be done in parallel with integer -multiplications, but there doesn't seem to be any way to make use of this. - - - -OPTIMIZATIONS - -Unrolled loops are used to reduce looping overhead. The unrolling is -configurable up to 32 limbs/loop for most routines, up to 64 for some. - -Sometimes computed jumps into the unrolling are used to handle sizes not a -multiple of the unrolling. An attractive feature of this is that times -smoothly increase with operand size, but an indirect jump is about 6 cycles -and the setups about another 6, so it depends on how much the unrolled code -is faster than a simple loop as to whether a computed jump ought to be used. - -Position independent code is implemented using a call to get eip for -computed jumps and a ret is always done, rather than an addl $4,%esp or a -popl, so the CPU return address branch prediction stack stays synchronised -with the actual stack in memory. Such a call however still costs 4 to 7 -cycles. - -Branch prediction, in absence of any history, will guess forward jumps are -not taken and backward jumps are taken. Where possible it's arranged that -the less likely or less important case is under a taken forward jump. - - - -MMX - -Putting emms or femms as late as possible in a routine seems to be fastest. -Perhaps an emms or femms stalls until all outstanding MMX instructions have -completed, so putting it later gives them a chance to complete on their own, -in parallel with other operations (like register popping). - -The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3 -at the start of a routine, in case it's been preceded by x87 floating point -operations. This isn't done because in gmp programs it's expected that x87 -floating point won't be much used and that chances are an mpn routine won't -have been preceded by any x87 code. - - - -CODING - -Instructions in general code are shown paired if they can decode and execute -together, meaning two short decode instructions with the second not -depending on the first, only the first using the shifter, no more than one -load, and no more than one store. - -K6 does some out of order execution so the pairings aren't essential, they -just show what slots might be available. When decoding is the limiting -factor things can be scheduled that might not execute until later. - - - -NOTES - -Code alignment - -- if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary, - short decode is inhibited. The cross.pl script detects this. - -- loops and branch targets should be aligned to 16 bytes, or ensure at least - 2 instructions before a 32 byte boundary. This makes use of the 16 byte - cache in the BTB. - -Addressing modes - -- (%esi) degrades decoding from short to vector. 0(%esi) doesn't have this - problem, and can be used as an equivalent, or easier is just to use a - different register, like %ebx. - -- K6 and pre-CXT core K6-2 have the following problem. (K6-2 CXT and K6-3 - have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F). - - If more than 3 bytes are needed to determine instruction length then - decoding degrades from direct to long, or from long to vector. This - happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since - with mod=00 the sib determines whether there's a displacement. - - This affects all MMX and 3DNow instructions, and others with an 0F prefix - like movzbl. The modes affected are anything with an index and no - displacement, or an index but no base, and this includes (%esp) which is - really (,%esp,1). - - The cross.pl script detects problem cases. The workaround is to always - use a displacement, and to do this with Zdisp if it's zero so the - assembler doesn't discard it. - - See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages - 13-14 and 36-37. - -Calls - -- indirect jumps and calls are not branch predicted, they measure about 6 - cycles. - -Various - -- adcl 2 cycles of decode, maybe 2 cycles executing in the X pipe -- bsf 12-27 cycles -- emms 5 cycles -- femms 3 cycles -- jecxz 2 cycles taken, 13 not taken (optimization manual says 7 not taken) -- divl 20 cycles back-to-back -- imull 2 decode, 2 execute -- mull 2 decode, 3 execute (optimization manual decoding sample) -- prefetch 2 cycles -- rcll/rcrl implicit by one bit: 2 cycles - immediate or %cl count: 11 + 2 per bit for dword - 13 + 4 per bit for byte -- setCC 2 cycles -- xchgl %eax,reg 1.5 cycles, back-to-back (strange) - reg,reg 2 cycles, back-to-back - - - - -REFERENCES - -"AMD-K6 Processor Code Optimization Application Note", AMD publication -number 21924, revision D amendment 0, January 2000. This describes K6-2 and -K6-3. Available on-line, - - http://www.amd.com/K6/k6docs/pdf/21924.pdf - -"AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD -publication number 21828, revision A amendment 0, August 1997. This is an -older edition of the above document, describing plain K6. Available -on-line, - - http://www.amd.com/K6/k6docs/pdf/21828.pdf - -"3DNow Technology Manual", AMD publication number 21928F/0-August 1999. -This describes the femms and prefetch instructions, but nothing else from -3DNow has been used. Available on-line, - - http://www.amd.com/K6/k6docs/pdf/21928.pdf - -"3DNow Instruction Porting Guide", AMD publication number 22621, revision B, -August 1999. This has some notes on general K6 optimizations as well as -3DNow. Available on-line, - - http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf - - - ----------------- -Local variables: -mode: text -fill-column: 76 -End: diff --git a/rts/gmp/mpn/x86/k6/aors_n.asm b/rts/gmp/mpn/x86/k6/aors_n.asm deleted file mode 100644 index 31b05ada51..0000000000 --- a/rts/gmp/mpn/x86/k6/aors_n.asm +++ /dev/null @@ -1,329 +0,0 @@ -dnl AMD K6 mpn_add/sub_n -- mpn addition or subtraction. -dnl -dnl K6: normal 3.25 cycles/limb, in-place 2.75 cycles/limb. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -ifdef(`OPERATION_add_n', ` - define(M4_inst, adcl) - define(M4_function_n, mpn_add_n) - define(M4_function_nc, mpn_add_nc) - define(M4_description, add) -',`ifdef(`OPERATION_sub_n', ` - define(M4_inst, sbbl) - define(M4_function_n, mpn_sub_n) - define(M4_function_nc, mpn_sub_nc) - define(M4_description, subtract) -',`m4_error(`Need OPERATION_add_n or OPERATION_sub_n -')')') - -MULFUNC_PROLOGUE(mpn_add_n mpn_add_nc mpn_sub_n mpn_sub_nc) - - -C mp_limb_t M4_function_n (mp_ptr dst, mp_srcptr src1, mp_srcptr src2, -C mp_size_t size); -C mp_limb_t M4_function_nc (mp_ptr dst, mp_srcptr src1, mp_srcptr src2, -C mp_size_t size, mp_limb_t carry); -C -C Calculate src1,size M4_description src2,size, and store the result in -C dst,size. The return value is the carry bit from the top of the result -C (1 or 0). -C -C The _nc version accepts 1 or 0 for an initial carry into the low limb of -C the calculation. Note values other than 1 or 0 here will lead to garbage -C results. -C -C Instruction decoding limits a normal dst=src1+src2 operation to 3 c/l, and -C an in-place dst+=src to 2.5 c/l. The unrolled loops have 1 cycle/loop of -C loop control, which with 4 limbs/loop means an extra 0.25 c/l. - -define(PARAM_CARRY, `FRAME+20(%esp)') -define(PARAM_SIZE, `FRAME+16(%esp)') -define(PARAM_SRC2, `FRAME+12(%esp)') -define(PARAM_SRC1, `FRAME+8(%esp)') -define(PARAM_DST, `FRAME+4(%esp)') -deflit(`FRAME',0) - -dnl minimum 5 because the unrolled code can't handle less -deflit(UNROLL_THRESHOLD, 5) - - .text - ALIGN(32) - -PROLOGUE(M4_function_nc) - movl PARAM_CARRY, %eax - jmp LF(M4_function_n,start) -EPILOGUE() - - -PROLOGUE(M4_function_n) - xorl %eax, %eax -L(start): - movl PARAM_SIZE, %ecx - pushl %ebx -FRAME_pushl() - - movl PARAM_SRC1, %ebx - pushl %edi -FRAME_pushl() - - movl PARAM_SRC2, %edx - cmpl $UNROLL_THRESHOLD, %ecx - - movl PARAM_DST, %edi - jae L(unroll) - - - shrl %eax C initial carry flag - - C offset 0x21 here, close enough to aligned -L(simple): - C eax scratch - C ebx src1 - C ecx counter - C edx src2 - C esi - C edi dst - C ebp - C - C The store to (%edi) could be done with a stosl; it'd be smaller - C code, but there's no speed gain and a cld would have to be added - C (per mpn/x86/README.family). - - movl (%ebx), %eax - leal 4(%ebx), %ebx - - M4_inst (%edx), %eax - - movl %eax, (%edi) - leal 4(%edi), %edi - - leal 4(%edx), %edx - loop L(simple) - - - movl $0, %eax - popl %edi - - setc %al - - popl %ebx - ret - - -C ----------------------------------------------------------------------------- -L(unroll): - C eax carry - C ebx src1 - C ecx counter - C edx src2 - C esi - C edi dst - C ebp - - cmpl %edi, %ebx - pushl %esi - - je L(inplace) - -ifdef(`OPERATION_add_n',` - cmpl %edi, %edx - - je L(inplace_reverse) -') - - movl %ecx, %esi - - andl $-4, %ecx - andl $3, %esi - - leal (%ebx,%ecx,4), %ebx - leal (%edx,%ecx,4), %edx - leal (%edi,%ecx,4), %edi - - negl %ecx - shrl %eax - - ALIGN(32) -L(normal_top): - C eax counter, qwords, negative - C ebx src1 - C ecx scratch - C edx src2 - C esi - C edi dst - C ebp - - movl (%ebx,%ecx,4), %eax - leal 5(%ecx), %ecx - M4_inst -20(%edx,%ecx,4), %eax - movl %eax, -20(%edi,%ecx,4) - - movl 4-20(%ebx,%ecx,4), %eax - M4_inst 4-20(%edx,%ecx,4), %eax - movl %eax, 4-20(%edi,%ecx,4) - - movl 8-20(%ebx,%ecx,4), %eax - M4_inst 8-20(%edx,%ecx,4), %eax - movl %eax, 8-20(%edi,%ecx,4) - - movl 12-20(%ebx,%ecx,4), %eax - M4_inst 12-20(%edx,%ecx,4), %eax - movl %eax, 12-20(%edi,%ecx,4) - - loop L(normal_top) - - - decl %esi - jz L(normal_finish_one) - js L(normal_done) - - C two or three more limbs - - movl (%ebx), %eax - M4_inst (%edx), %eax - movl %eax, (%edi) - - movl 4(%ebx), %eax - M4_inst 4(%edx), %eax - decl %esi - movl %eax, 4(%edi) - - jz L(normal_done) - movl $2, %ecx - -L(normal_finish_one): - movl (%ebx,%ecx,4), %eax - M4_inst (%edx,%ecx,4), %eax - movl %eax, (%edi,%ecx,4) - -L(normal_done): - popl %esi - popl %edi - - movl $0, %eax - popl %ebx - - setc %al - - ret - - -C ----------------------------------------------------------------------------- - -ifdef(`OPERATION_add_n',` -L(inplace_reverse): - C dst==src2 - - movl %ebx, %edx -') - -L(inplace): - C eax initial carry - C ebx - C ecx size - C edx src - C esi - C edi dst - C ebp - - leal -1(%ecx), %esi - decl %ecx - - andl $-4, %ecx - andl $3, %esi - - movl (%edx), %ebx C src low limb - leal (%edx,%ecx,4), %edx - - leal (%edi,%ecx,4), %edi - negl %ecx - - shrl %eax - - - ALIGN(32) -L(inplace_top): - C eax - C ebx next src limb - C ecx size - C edx src - C esi - C edi dst - C ebp - - M4_inst %ebx, (%edi,%ecx,4) - - movl 4(%edx,%ecx,4), %eax - leal 5(%ecx), %ecx - - M4_inst %eax, 4-20(%edi,%ecx,4) - - movl 8-20(%edx,%ecx,4), %eax - movl 12-20(%edx,%ecx,4), %ebx - - M4_inst %eax, 8-20(%edi,%ecx,4) - M4_inst %ebx, 12-20(%edi,%ecx,4) - - movl 16-20(%edx,%ecx,4), %ebx - loop L(inplace_top) - - - C now %esi is 0 to 3 representing respectively 1 to 4 limbs more - - M4_inst %ebx, (%edi) - - decl %esi - jz L(inplace_finish_one) - js L(inplace_done) - - C two or three more limbs - - movl 4(%edx), %eax - movl 8(%edx), %ebx - M4_inst %eax, 4(%edi) - M4_inst %ebx, 8(%edi) - - decl %esi - movl $2, %ecx - - jz L(normal_done) - -L(inplace_finish_one): - movl 4(%edx,%ecx,4), %eax - M4_inst %eax, 4(%edi,%ecx,4) - -L(inplace_done): - popl %esi - popl %edi - - movl $0, %eax - popl %ebx - - setc %al - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/aorsmul_1.asm b/rts/gmp/mpn/x86/k6/aorsmul_1.asm deleted file mode 100644 index da4120fe2f..0000000000 --- a/rts/gmp/mpn/x86/k6/aorsmul_1.asm +++ /dev/null @@ -1,372 +0,0 @@ -dnl AMD K6 mpn_addmul_1/mpn_submul_1 -- add or subtract mpn multiple. -dnl -dnl K6: 7.65 to 8.5 cycles/limb (at 16 limbs/loop and depending on the data), -dnl PIC adds about 6 cycles at the start. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K6: large multpliers small multpliers -dnl UNROLL_COUNT cycles/limb cycles/limb -dnl 4 9.5 7.78 -dnl 8 9.0 7.78 -dnl 16 8.4 7.65 -dnl 32 8.4 8.2 -dnl -dnl Maximum possible unrolling with the current code is 32. -dnl -dnl Unrolling to 16 limbs/loop makes the unrolled loop fit exactly in a 256 -dnl byte block, which might explain the good speed at that unrolling. - -deflit(UNROLL_COUNT, 16) - - -ifdef(`OPERATION_addmul_1', ` - define(M4_inst, addl) - define(M4_function_1, mpn_addmul_1) - define(M4_function_1c, mpn_addmul_1c) - define(M4_description, add it to) - define(M4_desc_retval, carry) -',`ifdef(`OPERATION_submul_1', ` - define(M4_inst, subl) - define(M4_function_1, mpn_submul_1) - define(M4_function_1c, mpn_submul_1c) - define(M4_description, subtract it from) - define(M4_desc_retval, borrow) -',`m4_error(`Need OPERATION_addmul_1 or OPERATION_submul_1 -')')') - -MULFUNC_PROLOGUE(mpn_addmul_1 mpn_addmul_1c mpn_submul_1 mpn_submul_1c) - - -C mp_limb_t M4_function_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t mult); -C mp_limb_t M4_function_1c (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t mult, mp_limb_t carry); -C -C Calculate src,size multiplied by mult and M4_description dst,size. -C Return the M4_desc_retval limb from the top of the result. -C -C The jadcl0()s in the unrolled loop makes the speed data dependent. Small -C multipliers (most significant few bits clear) result in few carry bits and -C speeds up to 7.65 cycles/limb are attained. Large multipliers (most -C significant few bits set) make the carry bits 50/50 and lead to something -C more like 8.4 c/l. (With adcl's both of these would be 9.3 c/l.) -C -C It's important that the gains for jadcl0 on small multipliers don't come -C at the cost of slowing down other data. Tests on uniformly distributed -C random data, designed to confound branch prediction, show about a 7% -C speed-up using jadcl0 over adcl (8.93 versus 9.57 cycles/limb, with all -C overheads included). -C -C In the simple loop, jadcl0() measures slower than adcl (11.9-14.7 versus -C 11.0 cycles/limb), and hence isn't used. -C -C In the simple loop, note that running ecx from negative to zero and using -C it as an index in the two movs wouldn't help. It would save one -C instruction (2*addl+loop becoming incl+jnz), but there's nothing unpaired -C that would be collapsed by this. -C -C -C jadcl0 -C ------ -C -C jadcl0() being faster than adcl $0 seems to be an artifact of two things, -C firstly the instruction decoding and secondly the fact that there's a -C carry bit for the jadcl0 only on average about 1/4 of the time. -C -C The code in the unrolled loop decodes something like the following. -C -C decode cycles -C mull %ebp 2 -C M4_inst %esi, disp(%edi) 1 -C adcl %eax, %ecx 2 -C movl %edx, %esi \ 1 -C jnc 1f / -C incl %esi \ 1 -C 1: movl disp(%ebx), %eax / -C --- -C 7 -C -C In a back-to-back style test this measures 7 with the jnc not taken, or 8 -C with it taken (both when correctly predicted). This is opposite to the -C measurements showing small multipliers running faster than large ones. -C Watch this space for more info ... -C -C It's not clear how much branch misprediction might be costing. The K6 -C doco says it will be 1 to 4 cycles, but presumably it's near the low end -C of that range to get the measured results. -C -C -C In the code the two carries are more or less the preceding mul product and -C the calculation is roughly -C -C x*y + u*b+v -C -C where b=2^32 is the size of a limb, x*y is the two carry limbs, and u and -C v are the two limbs it's added to (being the low of the next mul, and a -C limb from the destination). -C -C To get a carry requires x*y+u*b+v >= b^2, which is u*b+v >= b^2-x*y, and -C there are b^2-(b^2-x*y) = x*y many such values, giving a probability of -C x*y/b^2. If x, y, u and v are random and uniformly distributed between 0 -C and b-1, then the total probability can be summed over x and y, -C -C 1 b-1 b-1 x*y 1 b*(b-1) b*(b-1) -C --- * sum sum --- = --- * ------- * ------- = 1/4 -C b^2 x=0 y=1 b^2 b^4 2 2 -C -C Actually it's a very tiny bit less than 1/4 of course. If y is fixed, -C then the probability is 1/2*y/b thus varying linearly between 0 and 1/2. - - -ifdef(`PIC',` -deflit(UNROLL_THRESHOLD, 9) -',` -deflit(UNROLL_THRESHOLD, 6) -') - -defframe(PARAM_CARRY, 20) -defframe(PARAM_MULTIPLIER,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(32) - -PROLOGUE(M4_function_1c) - pushl %esi -deflit(`FRAME',4) - movl PARAM_CARRY, %esi - jmp LF(M4_function_1,start_nc) -EPILOGUE() - -PROLOGUE(M4_function_1) - push %esi -deflit(`FRAME',4) - xorl %esi, %esi C initial carry - -L(start_nc): - movl PARAM_SIZE, %ecx - pushl %ebx -deflit(`FRAME',8) - - movl PARAM_SRC, %ebx - pushl %edi -deflit(`FRAME',12) - - cmpl $UNROLL_THRESHOLD, %ecx - movl PARAM_DST, %edi - - pushl %ebp -deflit(`FRAME',16) - jae L(unroll) - - - C simple loop - - movl PARAM_MULTIPLIER, %ebp - -L(simple): - C eax scratch - C ebx src - C ecx counter - C edx scratch - C esi carry - C edi dst - C ebp multiplier - - movl (%ebx), %eax - addl $4, %ebx - - mull %ebp - - addl $4, %edi - addl %esi, %eax - - adcl $0, %edx - - M4_inst %eax, -4(%edi) - - adcl $0, %edx - - movl %edx, %esi - loop L(simple) - - - popl %ebp - popl %edi - - popl %ebx - movl %esi, %eax - - popl %esi - ret - - - -C ----------------------------------------------------------------------------- -C The unrolled loop uses a "two carry limbs" scheme. At the top of the loop -C the carries are ecx=lo, esi=hi, then they swap for each limb processed. -C For the computed jump an odd size means they start one way around, an even -C size the other. -C -C VAR_JUMP holds the computed jump temporarily because there's not enough -C registers at the point of doing the mul for the initial two carry limbs. -C -C The add/adc for the initial carry in %esi is necessary only for the -C mpn_addmul/submul_1c entry points. Duplicating the startup code to -C eliminiate this for the plain mpn_add/submul_1 doesn't seem like a good -C idea. - -dnl overlapping with parameters already fetched -define(VAR_COUNTER, `PARAM_SIZE') -define(VAR_JUMP, `PARAM_DST') - -L(unroll): - C eax - C ebx src - C ecx size - C edx - C esi initial carry - C edi dst - C ebp - - movl %ecx, %edx - decl %ecx - - subl $2, %edx - negl %ecx - - shrl $UNROLL_LOG2, %edx - andl $UNROLL_MASK, %ecx - - movl %edx, VAR_COUNTER - movl %ecx, %edx - - shll $4, %edx - negl %ecx - - C 15 code bytes per limb -ifdef(`PIC',` - call L(pic_calc) -L(here): -',` - leal L(entry) (%edx,%ecx,1), %edx -') - movl (%ebx), %eax C src low limb - - movl PARAM_MULTIPLIER, %ebp - movl %edx, VAR_JUMP - - mull %ebp - - addl %esi, %eax C initial carry (from _1c) - jadcl0( %edx) - - - leal 4(%ebx,%ecx,4), %ebx - movl %edx, %esi C high carry - - movl VAR_JUMP, %edx - leal (%edi,%ecx,4), %edi - - testl $1, %ecx - movl %eax, %ecx C low carry - - jz L(noswap) - movl %esi, %ecx C high,low carry other way around - - movl %eax, %esi -L(noswap): - - jmp *%edx - - -ifdef(`PIC',` -L(pic_calc): - C See README.family about old gas bugs - leal (%edx,%ecx,1), %edx - addl $L(entry)-L(here), %edx - addl (%esp), %edx - ret -') - - -C ----------------------------------------------------------- - ALIGN(32) -L(top): -deflit(`FRAME',16) - C eax scratch - C ebx src - C ecx carry lo - C edx scratch - C esi carry hi - C edi dst - C ebp multiplier - C - C 15 code bytes per limb - - leal UNROLL_BYTES(%edi), %edi - -L(entry): -forloop(`i', 0, UNROLL_COUNT/2-1, ` - deflit(`disp0', eval(2*i*4)) - deflit(`disp1', eval(disp0 + 4)) - -Zdisp( movl, disp0,(%ebx), %eax) - mull %ebp -Zdisp( M4_inst,%ecx, disp0,(%edi)) - adcl %eax, %esi - movl %edx, %ecx - jadcl0( %ecx) - - movl disp1(%ebx), %eax - mull %ebp - M4_inst %esi, disp1(%edi) - adcl %eax, %ecx - movl %edx, %esi - jadcl0( %esi) -') - - decl VAR_COUNTER - leal UNROLL_BYTES(%ebx), %ebx - - jns L(top) - - - popl %ebp - M4_inst %ecx, UNROLL_BYTES(%edi) - - popl %edi - movl %esi, %eax - - popl %ebx - jadcl0( %eax) - - popl %esi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/cross.pl b/rts/gmp/mpn/x86/k6/cross.pl deleted file mode 100644 index 21734f3e52..0000000000 --- a/rts/gmp/mpn/x86/k6/cross.pl +++ /dev/null @@ -1,141 +0,0 @@ -#! /usr/bin/perl - -# Copyright (C) 2000 Free Software Foundation, Inc. -# -# This file is part of the GNU MP Library. -# -# The GNU MP Library is free software; you can redistribute it and/or modify -# it under the terms of the GNU Lesser General Public License as published -# by the Free Software Foundation; either version 2.1 of the License, or (at -# your option) any later version. -# -# The GNU MP Library is distributed in the hope that it will be useful, but -# WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY -# or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public -# License for more details. -# -# You should have received a copy of the GNU Lesser General Public License -# along with the GNU MP Library; see the file COPYING.LIB. If not, write to -# the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, -# MA 02111-1307, USA. - - -# Usage: cross.pl [filename.o]... -# -# Produce an annotated disassembly of the given object files, indicating -# certain code alignment and addressing mode problems afflicting K6 chips. -# "ZZ" is used on all annotations, so this can be searched for. -# -# With no arguments, all .o files corresponding to .asm files are processed. -# This is good in the mpn object directory of a k6*-*-* build. -# -# As far as fixing problems goes, any cache line crossing problems in loops -# get attention, but as a rule it's too tedious to rearrange code or slip in -# nops to fix every problem in setup or finishup code. -# -# Bugs: -# -# Instructions without mod/rm bytes or which are already vector decoded are -# unaffected by cache line boundary crossing, but not all of these have yet -# been put in as exceptions. All that occur in practice in GMP are present -# though. -# -# There's no messages for using the vector decoded addressing mode (%esi), -# but that mode is easy to avoid when coding. - -use strict; - -sub disassemble { - my ($file) = @_; - my ($addr,$b1,$b2,$b3, $prefix,$opcode,$modrm); - - open (IN, "objdump -Srfh $file |") - || die "Cannot open pipe from objdump\n"; - while (<IN>) { - print; - - if (/^[ \t]*[0-9]+[ \t]+\.text[ \t]/ && /2\*\*([0-9]+)$/) { - if ($1 < 5) { - print "ZZ need at least 2**5 for predictable cache line crossing\n"; - } - } - - if (/^[ \t]*([0-9a-f]*):[ \t]*([0-9a-f]+)[ \t]+([0-9a-f]+)[ \t]+([0-9a-f]+)/) { - ($addr,$b1,$b2,$b3) = ($1,$2,$3,$4); - - } elsif (/^[ \t]*([0-9a-f]*):[ \t]*([0-9a-f]+)[ \t]+([0-9a-f]+)/) { - ($addr,$b1,$b2,$b3) = ($1,$2,$3,''); - - } elsif (/^[ \t]*([0-9a-f]*):[ \t]*([0-9a-f]+)/) { - ($addr,$b1,$b2,$b3) = ($1,$2,'',''); - - } else { - next; - } - - if ($b1 =~ /0f/) { - $prefix = $b1; - $opcode = $b2; - $modrm = $b3; - } else { - $prefix = ''; - $opcode = $b1; - $modrm = $b2; - } - - # modrm of the form 00-xxx-100 with an 0F prefix is the problem case - # for K6 and pre-CXT K6-2 - if ($prefix =~ /0f/ - && $opcode !~ /^8/ # jcond disp32 - && $modrm =~ /^[0-3][4c]/) { - print "ZZ ($file) >3 bytes to determine instruction length\n"; - } - - # with just an opcode, starting 1f mod 20h - if ($addr =~ /[13579bdf]f$/ - && $prefix !~ /0f/ - && $opcode !~ /1[012345]/ # adc - && $opcode !~ /1[89abcd]/ # sbb - && $opcode !~ /68/ # push $imm32 - && $opcode !~ /^7/ # jcond disp8 - && $opcode !~ /a[89]/ # test+imm - && $opcode !~ /a[a-f]/ # stos/lods/scas - && $opcode !~ /b8/ # movl $imm32,%eax - && $opcode !~ /e[0123]/ # loop/loopz/loopnz/jcxz - && $opcode !~ /e[b9]/ # jmp disp8/disp32 - && $opcode !~ /f[89abcd]/ # clc,stc,cli,sti,cld,std - && !($opcode =~ /f[67]/ # grp 1 - && $modrm =~ /^[2367abef]/) # mul, imul, div, idiv - && $modrm !~ /^$/) { - print "ZZ ($file) opcode/modrm cross 32-byte boundary\n"; - } - - # with an 0F prefix, anything starting at 1f mod 20h - if ($addr =~ /[13579bdf][f]$/ - && $prefix =~ /0f/) { - print "ZZ ($file) prefix/opcode cross 32-byte boundary\n"; - } - - # with an 0F prefix, anything with mod/rm starting at 1e mod 20h - if ($addr =~ /[13579bdf][e]$/ - && $prefix =~ /0f/ - && $opcode !~ /^8/ # jcond disp32 - && $modrm !~ /^$/) { - print "ZZ ($file) prefix/opcode/modrm cross 32-byte boundary\n"; - } - } - close IN || die "Error from objdump (or objdump not available)\n"; -} - - -my @files; -if ($#ARGV >= 0) { - @files = @ARGV; -} else { - @files = glob "*.asm"; - map {s/.asm/.o/} @files; -} - -foreach (@files) { - disassemble($_); -} diff --git a/rts/gmp/mpn/x86/k6/diveby3.asm b/rts/gmp/mpn/x86/k6/diveby3.asm deleted file mode 100644 index ffb97bc380..0000000000 --- a/rts/gmp/mpn/x86/k6/diveby3.asm +++ /dev/null @@ -1,110 +0,0 @@ -dnl AMD K6 mpn_divexact_by3 -- mpn division by 3, expecting no remainder. -dnl -dnl K6: 11.0 cycles/limb - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_divexact_by3c (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t carry); -C -C Using %esi in (%esi,%ecx,4) or 0(%esi,%ecx,4) addressing modes doesn't -C lead to vector decoding, unlike plain (%esi) does. - -defframe(PARAM_CARRY,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - -dnl multiplicative inverse of 3, modulo 2^32 -deflit(INVERSE_3, 0xAAAAAAAB) - - .text - ALIGN(32) - -PROLOGUE(mpn_divexact_by3c) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - pushl %esi defframe_pushl(SAVE_ESI) - - movl PARAM_SRC, %esi - pushl %edi defframe_pushl(SAVE_EDI) - - movl PARAM_DST, %edi - pushl %ebx defframe_pushl(SAVE_EBX) - - movl PARAM_CARRY, %ebx - leal (%esi,%ecx,4), %esi - - pushl $3 defframe_pushl(VAR_THREE) - leal (%edi,%ecx,4), %edi - - negl %ecx - - - C Need 32 alignment for claimed speed, to avoid the movl store - C opcode/modrm crossing a cache line boundary - - ALIGN(32) -L(top): - C eax scratch, low product - C ebx carry limb (0 to 3) - C ecx counter, limbs, negative - C edx scratch, high product - C esi &src[size] - C edi &dst[size] - C ebp - C - C The 0(%esi,%ecx,4) form pads so the finishup "movl %ebx, %eax" - C doesn't cross a 32 byte boundary, saving a couple of cycles - C (that's a fixed couple, not per loop). - -Zdisp( movl, 0,(%esi,%ecx,4), %eax) - subl %ebx, %eax - - setc %bl - - imull $INVERSE_3, %eax - - movl %eax, (%edi,%ecx,4) - addl $2, %ecx - - mull VAR_THREE - - addl %edx, %ebx - loop L(top) - - - movl SAVE_ESI, %esi - movl %ebx, %eax - - movl SAVE_EBX, %ebx - - movl SAVE_EDI, %edi - addl $FRAME, %esp - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/gmp-mparam.h b/rts/gmp/mpn/x86/k6/gmp-mparam.h deleted file mode 100644 index 77f3948d77..0000000000 --- a/rts/gmp/mpn/x86/k6/gmp-mparam.h +++ /dev/null @@ -1,97 +0,0 @@ -/* AMD K6 gmp-mparam.h -- Compiler/machine parameter header file. - -Copyright (C) 1991, 1993, 1994, 2000 Free Software Foundation, Inc. - -This file is part of the GNU MP Library. - -The GNU MP Library is free software; you can redistribute it and/or modify -it under the terms of the GNU Lesser General Public License as published by -the Free Software Foundation; either version 2.1 of the License, or (at your -option) any later version. - -The GNU MP Library is distributed in the hope that it will be useful, but -WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY -or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public -License for more details. - -You should have received a copy of the GNU Lesser General Public License -along with the GNU MP Library; see the file COPYING.LIB. If not, write to -the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, -MA 02111-1307, USA. */ - -#define BITS_PER_MP_LIMB 32 -#define BYTES_PER_MP_LIMB 4 -#define BITS_PER_LONGINT 32 -#define BITS_PER_INT 32 -#define BITS_PER_SHORTINT 16 -#define BITS_PER_CHAR 8 - - -#ifndef UMUL_TIME -#define UMUL_TIME 3 /* cycles */ -#endif - -#ifndef UDIV_TIME -#define UDIV_TIME 20 /* cycles */ -#endif - -/* bsfl takes 12-27 cycles, put an average for uniform random numbers */ -#ifndef COUNT_TRAILING_ZEROS_TIME -#define COUNT_TRAILING_ZEROS_TIME 14 /* cycles */ -#endif - - -/* Generated by tuneup.c, 2000-07-04. */ - -#ifndef KARATSUBA_MUL_THRESHOLD -#define KARATSUBA_MUL_THRESHOLD 18 -#endif -#ifndef TOOM3_MUL_THRESHOLD -#define TOOM3_MUL_THRESHOLD 130 -#endif - -#ifndef KARATSUBA_SQR_THRESHOLD -#define KARATSUBA_SQR_THRESHOLD 34 -#endif -#ifndef TOOM3_SQR_THRESHOLD -#define TOOM3_SQR_THRESHOLD 116 -#endif - -#ifndef BZ_THRESHOLD -#define BZ_THRESHOLD 68 -#endif - -#ifndef FIB_THRESHOLD -#define FIB_THRESHOLD 98 -#endif - -#ifndef POWM_THRESHOLD -#define POWM_THRESHOLD 13 -#endif - -#ifndef GCD_ACCEL_THRESHOLD -#define GCD_ACCEL_THRESHOLD 4 -#endif -#ifndef GCDEXT_THRESHOLD -#define GCDEXT_THRESHOLD 67 -#endif - -#ifndef FFT_MUL_TABLE -#define FFT_MUL_TABLE { 528, 1184, 2176, 5632, 14336, 40960, 0 } -#endif -#ifndef FFT_MODF_MUL_THRESHOLD -#define FFT_MODF_MUL_THRESHOLD 472 -#endif -#ifndef FFT_MUL_THRESHOLD -#define FFT_MUL_THRESHOLD 4352 -#endif - -#ifndef FFT_SQR_TABLE -#define FFT_SQR_TABLE { 528, 1184, 2176, 5632, 14336, 40960, 0 } -#endif -#ifndef FFT_MODF_SQR_THRESHOLD -#define FFT_MODF_SQR_THRESHOLD 544 -#endif -#ifndef FFT_SQR_THRESHOLD -#define FFT_SQR_THRESHOLD 4352 -#endif diff --git a/rts/gmp/mpn/x86/k6/k62mmx/copyd.asm b/rts/gmp/mpn/x86/k6/k62mmx/copyd.asm deleted file mode 100644 index 20a33e6ccf..0000000000 --- a/rts/gmp/mpn/x86/k6/k62mmx/copyd.asm +++ /dev/null @@ -1,179 +0,0 @@ -dnl AMD K6-2 mpn_copyd -- copy limb vector, decrementing. -dnl -dnl K6-2: 0.56 or 1.0 cycles/limb (at 32 limbs/loop), depending on data -dnl alignment. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K6-2 aligned: -dnl UNROLL_COUNT cycles/limb -dnl 8 0.75 -dnl 16 0.625 -dnl 32 0.5625 -dnl 64 0.53 -dnl Maximum possible with the current code is 64, the minimum is 2. - -deflit(UNROLL_COUNT, 32) - - -C void mpn_copyd (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C Copy src,size to dst,size, processing limbs from high to low addresses. -C -C The comments in copyi.asm apply here too. - - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - - .text - ALIGN(32) - -PROLOGUE(mpn_copyd) - movl PARAM_SIZE, %ecx - movl %esi, %eax - - movl PARAM_SRC, %esi - movl %edi, %edx - - std - - movl PARAM_DST, %edi - cmpl $UNROLL_COUNT, %ecx - - leal -4(%esi,%ecx,4), %esi - - leal -4(%edi,%ecx,4), %edi - ja L(unroll) - -L(simple): - rep - movsl - - cld - - movl %eax, %esi - movl %edx, %edi - - ret - - -L(unroll): - C if src and dst are different alignments mod8, then use rep movs - C if src and dst are both 4mod8 then process one limb to get 0mod8 - - pushl %ebx - leal (%esi,%edi), %ebx - - testb $4, %bl - popl %ebx - - jnz L(simple) - testl $4, %esi - - leal -UNROLL_COUNT(%ecx), %ecx - jnz L(already_aligned) - - movsl - - decl %ecx -L(already_aligned): - - -ifelse(UNROLL_BYTES,256,` - subl $128, %esi - subl $128, %edi -') - - C offset 0x3D here, but gets full speed without further alignment -L(top): - C eax saved esi - C ebx - C ecx counter, limbs - C edx saved edi - C esi src, incrementing - C edi dst, incrementing - C ebp - C - C `disp' is never 0, so don't need to force 0(%esi). - -deflit(CHUNK_COUNT, 2) -forloop(`i', 0, UNROLL_COUNT/CHUNK_COUNT-1, ` - deflit(`disp', eval(-4-i*CHUNK_COUNT*4 ifelse(UNROLL_BYTES,256,+128))) - movq disp(%esi), %mm0 - movq %mm0, disp(%edi) -') - - leal -UNROLL_BYTES(%esi), %esi - subl $UNROLL_COUNT, %ecx - - leal -UNROLL_BYTES(%edi), %edi - jns L(top) - - - C now %ecx is -UNROLL_COUNT to -1 representing repectively 0 to - C UNROLL_COUNT-1 limbs remaining - - testb $eval(UNROLL_COUNT/2), %cl - - leal UNROLL_COUNT(%ecx), %ecx - jz L(not_half) - - - C at an unroll count of 32 this block of code is 16 cycles faster than - C the rep movs, less 3 or 4 to test whether to do it - -forloop(`i', 0, UNROLL_COUNT/CHUNK_COUNT/2-1, ` - deflit(`disp', eval(-4-i*CHUNK_COUNT*4 ifelse(UNROLL_BYTES,256,+128))) - movq disp(%esi), %mm0 - movq %mm0, disp(%edi) -') - - subl $eval(UNROLL_BYTES/2), %esi - subl $eval(UNROLL_BYTES/2), %edi - - subl $eval(UNROLL_COUNT/2), %ecx -L(not_half): - - -ifelse(UNROLL_BYTES,256,` - addl $128, %esi - addl $128, %edi -') - - rep - movsl - - cld - - movl %eax, %esi - movl %edx, %edi - - femms - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/k62mmx/copyi.asm b/rts/gmp/mpn/x86/k6/k62mmx/copyi.asm deleted file mode 100644 index 215d805f2e..0000000000 --- a/rts/gmp/mpn/x86/k6/k62mmx/copyi.asm +++ /dev/null @@ -1,196 +0,0 @@ -dnl AMD K6-2 mpn_copyi -- copy limb vector, incrementing. -dnl -dnl K6-2: 0.56 or 1.0 cycles/limb (at 32 limbs/loop), depending on data -dnl alignment. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K6-2 aligned: -dnl UNROLL_COUNT cycles/limb -dnl 8 0.75 -dnl 16 0.625 -dnl 32 0.5625 -dnl 64 0.53 -dnl Maximum possible with the current code is 64, the minimum is 2. - -deflit(UNROLL_COUNT, 32) - - -C void mpn_copyi (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C The MMX loop is faster than a rep movs when src and dst are both 0mod8. -C With one 0mod8 and one 4mod8 it's 1.056 c/l and the rep movs at 1.0 c/l is -C used instead. -C -C mod8 -C src dst -C 0 0 both aligned, use mmx -C 0 4 unaligned, use rep movs -C 4 0 unaligned, use rep movs -C 4 4 do one movs, then both aligned, use mmx -C -C The MMX code on aligned data is 0.5 c/l, plus loop overhead of 2 -C cycles/loop, which is 0.0625 c/l at 32 limbs/loop. -C -C A pattern of two movq loads and two movq stores (or four and four) was -C tried, but found to be the same speed as just one of each. -C -C Note that this code only suits K6-2 and K6-3. Plain K6 does only one mmx -C instruction per cycle, so "movq"s are no faster than the simple 1 c/l rep -C movs. -C -C Enhancement: -C -C Addressing modes like disp(%esi,%ecx,4) aren't currently used. They'd -C make it possible to avoid incrementing %esi and %edi in the loop and hence -C get loop overhead down to 1 cycle. Care would be needed to avoid bad -C cache line crossings since the "movq"s would then be 5 code bytes rather -C than 4. - - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - - .text - ALIGN(32) - -PROLOGUE(mpn_copyi) - movl PARAM_SIZE, %ecx - movl %esi, %eax - - movl PARAM_SRC, %esi - movl %edi, %edx - - cld - - movl PARAM_DST, %edi - cmpl $UNROLL_COUNT, %ecx - - ja L(unroll) - -L(simple): - rep - movsl - - movl %eax, %esi - movl %edx, %edi - - ret - - -L(unroll): - C if src and dst are different alignments mod8, then use rep movs - C if src and dst are both 4mod8 then process one limb to get 0mod8 - - pushl %ebx - leal (%esi,%edi), %ebx - - testb $4, %bl - popl %ebx - - jnz L(simple) - testl $4, %esi - - leal -UNROLL_COUNT(%ecx), %ecx - jz L(already_aligned) - - decl %ecx - - movsl -L(already_aligned): - - -ifelse(UNROLL_BYTES,256,` - addl $128, %esi - addl $128, %edi -') - - C this is offset 0x34, no alignment needed -L(top): - C eax saved esi - C ebx - C ecx counter, limbs - C edx saved edi - C esi src, incrementing - C edi dst, incrementing - C ebp - C - C Zdisp gets 0(%esi) left that way to avoid vector decode, and with - C 0(%edi) keeps code aligned to 16 byte boundaries. - -deflit(CHUNK_COUNT, 2) -forloop(`i', 0, UNROLL_COUNT/CHUNK_COUNT-1, ` - deflit(`disp', eval(i*CHUNK_COUNT*4 ifelse(UNROLL_BYTES,256,-128))) -Zdisp( movq, disp,(%esi), %mm0) -Zdisp( movq, %mm0, disp,(%edi)) -') - - addl $UNROLL_BYTES, %esi - subl $UNROLL_COUNT, %ecx - - leal UNROLL_BYTES(%edi), %edi - jns L(top) - - - C now %ecx is -UNROLL_COUNT to -1 representing repectively 0 to - C UNROLL_COUNT-1 limbs remaining - - testb $eval(UNROLL_COUNT/2), %cl - - leal UNROLL_COUNT(%ecx), %ecx - jz L(not_half) - - C at an unroll count of 32 this block of code is 16 cycles faster than - C the rep movs, less 3 or 4 to test whether to do it - -forloop(`i', 0, UNROLL_COUNT/CHUNK_COUNT/2-1, ` - deflit(`disp', eval(i*CHUNK_COUNT*4 ifelse(UNROLL_BYTES,256,-128))) - movq disp(%esi), %mm0 - movq %mm0, disp(%edi) -') - addl $eval(UNROLL_BYTES/2), %esi - addl $eval(UNROLL_BYTES/2), %edi - - subl $eval(UNROLL_COUNT/2), %ecx -L(not_half): - - -ifelse(UNROLL_BYTES,256,` - subl $128, %esi - subl $128, %edi -') - - rep - movsl - - movl %eax, %esi - movl %edx, %edi - - femms - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/k62mmx/lshift.asm b/rts/gmp/mpn/x86/k6/k62mmx/lshift.asm deleted file mode 100644 index f6d54f97a8..0000000000 --- a/rts/gmp/mpn/x86/k6/k62mmx/lshift.asm +++ /dev/null @@ -1,286 +0,0 @@ -dnl AMD K6-2 mpn_lshift -- mpn left shift. -dnl -dnl K6-2: 1.75 cycles/limb - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_lshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - -dnl used after src has been fetched -define(VAR_RETVAL,`PARAM_SRC') - -dnl minimum 9, because unrolled loop can't handle less -deflit(UNROLL_THRESHOLD, 9) - - .text - ALIGN(32) - -PROLOGUE(mpn_lshift) -deflit(`FRAME',0) - - C The 1 limb case can be done without the push %ebx, but it's then - C still the same speed. The push is left as a free helping hand for - C the two_or_more code. - - movl PARAM_SIZE, %eax - pushl %ebx FRAME_pushl() - - movl PARAM_SRC, %ebx - decl %eax - - movl PARAM_SHIFT, %ecx - jnz L(two_or_more) - - movl (%ebx), %edx C src limb - movl PARAM_DST, %ebx - - shldl( %cl, %edx, %eax) C return value - - shll %cl, %edx - - movl %edx, (%ebx) C dst limb - popl %ebx - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(16) C avoid offset 0x1f -L(two_or_more): - C eax size-1 - C ebx src - C ecx shift - C edx - - movl (%ebx,%eax,4), %edx C src high limb - negl %ecx - - movd PARAM_SHIFT, %mm6 - addl $32, %ecx C 32-shift - - shrl %cl, %edx - cmpl $UNROLL_THRESHOLD-1, %eax - - movl %edx, VAR_RETVAL - jae L(unroll) - - - movd %ecx, %mm7 - movl %eax, %ecx - - movl PARAM_DST, %eax - -L(simple): - C eax dst - C ebx src - C ecx counter, size-1 to 1 - C edx retval - C - C mm0 scratch - C mm6 shift - C mm7 32-shift - - movq -4(%ebx,%ecx,4), %mm0 - - psrlq %mm7, %mm0 - -Zdisp( movd, %mm0, 0,(%eax,%ecx,4)) - loop L(simple) - - - movd (%ebx), %mm0 - popl %ebx - - psllq %mm6, %mm0 - - movd %mm0, (%eax) - movl %edx, %eax - - femms - ret - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(unroll): - C eax size-1 - C ebx src - C ecx 32-shift - C edx retval (but instead VAR_RETVAL is used) - C - C mm6 shift - - addl $32, %ecx - movl PARAM_DST, %edx - - movd %ecx, %mm7 - subl $7, %eax C size-8 - - leal (%edx,%eax,4), %ecx C alignment of dst - - movq 32-8(%ebx,%eax,4), %mm2 C src high qword - testb $4, %cl - - jz L(dst_aligned) - psllq %mm6, %mm2 - - psrlq $32, %mm2 - decl %eax - - movd %mm2, 32(%edx,%eax,4) C dst high limb - movq 32-8(%ebx,%eax,4), %mm2 C new src high qword -L(dst_aligned): - - movq 32-16(%ebx,%eax,4), %mm0 C src second highest qword - - - C This loop is the important bit, the rest is just support for it. - C Four src limbs are held at the start, and four more will be read. - C Four dst limbs will be written. This schedule seems necessary for - C full speed. - C - C The use of size-8 lets the loop stop when %eax goes negative and - C leaves -4 to -1 which can be tested with test $1 and $2. - -L(top): - C eax counter, size-8 step by -4 until <0 - C ebx src - C ecx - C edx dst - C - C mm0 src next qword - C mm1 scratch - C mm2 src prev qword - C mm6 shift - C mm7 64-shift - - psllq %mm6, %mm2 - subl $4, %eax - - movq %mm0, %mm1 - psrlq %mm7, %mm0 - - por %mm0, %mm2 - movq 24(%ebx,%eax,4), %mm0 - - psllq %mm6, %mm1 - movq %mm2, 40(%edx,%eax,4) - - movq %mm0, %mm2 - psrlq %mm7, %mm0 - - por %mm0, %mm1 - movq 16(%ebx,%eax,4), %mm0 - - movq %mm1, 32(%edx,%eax,4) - jnc L(top) - - - C Now have four limbs in mm2 (prev) and mm0 (next), plus eax mod 4. - C - C 8(%ebx) is the next source, and 24(%edx) is the next destination. - C %eax is between -4 and -1, representing respectively 0 to 3 extra - C limbs that must be read. - - - testl $2, %eax C testl to avoid bad cache line crossing - jz L(finish_nottwo) - - C Two more limbs: lshift mm2, OR it with rshifted mm0, mm0 becomes - C new mm2 and a new mm0 is loaded. - - psllq %mm6, %mm2 - movq %mm0, %mm1 - - psrlq %mm7, %mm0 - subl $2, %eax - - por %mm0, %mm2 - movq 16(%ebx,%eax,4), %mm0 - - movq %mm2, 32(%edx,%eax,4) - movq %mm1, %mm2 -L(finish_nottwo): - - - C lshift mm2, OR with rshifted mm0, mm1 becomes lshifted mm0 - - testb $1, %al - psllq %mm6, %mm2 - - movq %mm0, %mm1 - psrlq %mm7, %mm0 - - por %mm0, %mm2 - psllq %mm6, %mm1 - - movq %mm2, 24(%edx,%eax,4) - jz L(finish_even) - - - C Size is odd, so mm1 and one extra limb to process. - - movd (%ebx), %mm0 C src[0] - popl %ebx -deflit(`FRAME',0) - - movq %mm0, %mm2 - psllq $32, %mm0 - - psrlq %mm7, %mm0 - - psllq %mm6, %mm2 - por %mm0, %mm1 - - movq %mm1, 4(%edx) C dst[1,2] - movd %mm2, (%edx) C dst[0] - - movl VAR_RETVAL, %eax - - femms - ret - - - nop C avoid bad cache line crossing -L(finish_even): -deflit(`FRAME',4) - C Size is even, so only mm1 left to process. - - movq %mm1, (%edx) C dst[0,1] - movl VAR_RETVAL, %eax - - popl %ebx - femms - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/k62mmx/rshift.asm b/rts/gmp/mpn/x86/k6/k62mmx/rshift.asm deleted file mode 100644 index 8a8c144241..0000000000 --- a/rts/gmp/mpn/x86/k6/k62mmx/rshift.asm +++ /dev/null @@ -1,285 +0,0 @@ -dnl AMD K6-2 mpn_rshift -- mpn right shift. -dnl -dnl K6-2: 1.75 cycles/limb - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_rshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - -dnl Minimum 9, because the unrolled loop can't handle less. -dnl -deflit(UNROLL_THRESHOLD, 9) - - .text - ALIGN(32) - -PROLOGUE(mpn_rshift) -deflit(`FRAME',0) - - C The 1 limb case can be done without the push %ebx, but it's then - C still the same speed. The push is left as a free helping hand for - C the two_or_more code. - - movl PARAM_SIZE, %eax - pushl %ebx FRAME_pushl() - - movl PARAM_SRC, %ebx - decl %eax - - movl PARAM_SHIFT, %ecx - jnz L(two_or_more) - - movl (%ebx), %edx C src limb - movl PARAM_DST, %ebx - - shrdl( %cl, %edx, %eax) C return value - - shrl %cl, %edx - - movl %edx, (%ebx) C dst limb - popl %ebx - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(16) C avoid offset 0x1f -L(two_or_more): - C eax size-1 - C ebx src - C ecx shift - C edx - - movl (%ebx), %edx C src low limb - negl %ecx - - addl $32, %ecx - movd PARAM_SHIFT, %mm6 - - shll %cl, %edx - cmpl $UNROLL_THRESHOLD-1, %eax - - jae L(unroll) - - - C eax size-1 - C ebx src - C ecx 32-shift - C edx retval - C - C mm6 shift - - movl PARAM_DST, %ecx - leal (%ebx,%eax,4), %ebx - - leal -4(%ecx,%eax,4), %ecx - negl %eax - - C This loop runs at about 3 cycles/limb, which is the amount of - C decoding, and this is despite every second access being unaligned. - -L(simple): - C eax counter, -(size-1) to -1 - C ebx &src[size-1] - C ecx &dst[size-1] - C edx retval - C - C mm0 scratch - C mm6 shift - -Zdisp( movq, 0,(%ebx,%eax,4), %mm0) - incl %eax - - psrlq %mm6, %mm0 - -Zdisp( movd, %mm0, 0,(%ecx,%eax,4)) - jnz L(simple) - - - movq %mm0, (%ecx) - movl %edx, %eax - - popl %ebx - - femms - ret - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(unroll): - C eax size-1 - C ebx src - C ecx 32-shift - C edx retval - C - C mm6 shift - - addl $32, %ecx - subl $7, %eax C size-8 - - movd %ecx, %mm7 - movl PARAM_DST, %ecx - - movq (%ebx), %mm2 C src low qword - leal (%ebx,%eax,4), %ebx C src end - 32 - - testb $4, %cl - leal (%ecx,%eax,4), %ecx C dst end - 32 - - notl %eax C -(size-7) - jz L(dst_aligned) - - psrlq %mm6, %mm2 - incl %eax - -Zdisp( movd, %mm2, 0,(%ecx,%eax,4)) C dst low limb - movq 4(%ebx,%eax,4), %mm2 C new src low qword -L(dst_aligned): - - movq 12(%ebx,%eax,4), %mm0 C src second lowest qword - nop C avoid bad cache line crossing - - - C This loop is the important bit, the rest is just support for it. - C Four src limbs are held at the start, and four more will be read. - C Four dst limbs will be written. This schedule seems necessary for - C full speed. - C - C The use of -(size-7) lets the loop stop when %eax becomes >= 0 and - C and leaves 0 to 3 which can be tested with test $1 and $2. - -L(top): - C eax counter, -(size-7) step by +4 until >=0 - C ebx src end - 32 - C ecx dst end - 32 - C edx retval - C - C mm0 src next qword - C mm1 scratch - C mm2 src prev qword - C mm6 shift - C mm7 64-shift - - psrlq %mm6, %mm2 - addl $4, %eax - - movq %mm0, %mm1 - psllq %mm7, %mm0 - - por %mm0, %mm2 - movq 4(%ebx,%eax,4), %mm0 - - psrlq %mm6, %mm1 - movq %mm2, -12(%ecx,%eax,4) - - movq %mm0, %mm2 - psllq %mm7, %mm0 - - por %mm0, %mm1 - movq 12(%ebx,%eax,4), %mm0 - - movq %mm1, -4(%ecx,%eax,4) - ja L(top) C jump if no carry and not zero - - - - C Now have the four limbs in mm2 (low) and mm0 (high), and %eax is 0 - C to 3 representing respectively 3 to 0 further limbs. - - testl $2, %eax C testl to avoid bad cache line crossings - jnz L(finish_nottwo) - - C Two or three extra limbs: rshift mm2, OR it with lshifted mm0, mm0 - C becomes new mm2 and a new mm0 is loaded. - - psrlq %mm6, %mm2 - movq %mm0, %mm1 - - psllq %mm7, %mm0 - addl $2, %eax - - por %mm0, %mm2 - movq 12(%ebx,%eax,4), %mm0 - - movq %mm2, -4(%ecx,%eax,4) - movq %mm1, %mm2 -L(finish_nottwo): - - - testb $1, %al - psrlq %mm6, %mm2 - - movq %mm0, %mm1 - psllq %mm7, %mm0 - - por %mm0, %mm2 - psrlq %mm6, %mm1 - - movq %mm2, 4(%ecx,%eax,4) - jnz L(finish_even) - - - C one further extra limb to process - - movd 32-4(%ebx), %mm0 C src[size-1], most significant limb - popl %ebx - - movq %mm0, %mm2 - psllq %mm7, %mm0 - - por %mm0, %mm1 - psrlq %mm6, %mm2 - - movq %mm1, 32-12(%ecx) C dst[size-3,size-2] - movd %mm2, 32-4(%ecx) C dst[size-1] - - movl %edx, %eax C retval - - femms - ret - - - nop C avoid bad cache line crossing -L(finish_even): - C no further extra limbs - - movq %mm1, 32-8(%ecx) C dst[size-2,size-1] - movl %edx, %eax C retval - - popl %ebx - - femms - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/mmx/com_n.asm b/rts/gmp/mpn/x86/k6/mmx/com_n.asm deleted file mode 100644 index 8915080f0f..0000000000 --- a/rts/gmp/mpn/x86/k6/mmx/com_n.asm +++ /dev/null @@ -1,91 +0,0 @@ -dnl AMD K6-2 mpn_com_n -- mpn bitwise one's complement. -dnl -dnl alignment dst/src, A=0mod8 N=4mod8 -dnl A/A A/N N/A N/N -dnl K6-2 1.0 1.18 1.18 1.18 cycles/limb -dnl K6 1.5 1.85 1.75 1.85 - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C void mpn_com_n (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C Take the bitwise ones-complement of src,size and write it to dst,size. - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(32) -PROLOGUE(mpn_com_n) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - movl PARAM_SRC, %eax - movl PARAM_DST, %edx - shrl %ecx - jnz L(two_or_more) - - movl (%eax), %eax - notl %eax - movl %eax, (%edx) - ret - - -L(two_or_more): - pushl %ebx -FRAME_pushl() - movl %ecx, %ebx - - pcmpeqd %mm7, %mm7 C all ones - - - ALIGN(16) -L(top): - C eax src - C ebx floor(size/2) - C ecx counter - C edx dst - C esi - C edi - C ebp - - movq -8(%eax,%ecx,8), %mm0 - pxor %mm7, %mm0 - movq %mm0, -8(%edx,%ecx,8) - loop L(top) - - - jnc L(no_extra) - movl (%eax,%ebx,8), %eax - notl %eax - movl %eax, (%edx,%ebx,8) -L(no_extra): - - popl %ebx - emms_or_femms - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/mmx/logops_n.asm b/rts/gmp/mpn/x86/k6/mmx/logops_n.asm deleted file mode 100644 index 46cb3b7ea5..0000000000 --- a/rts/gmp/mpn/x86/k6/mmx/logops_n.asm +++ /dev/null @@ -1,212 +0,0 @@ -dnl AMD K6-2 mpn_and_n, mpn_andn_n, mpn_nand_n, mpn_ior_n, mpn_iorn_n, -dnl mpn_nior_n, mpn_xor_n, mpn_xnor_n -- mpn bitwise logical operations. -dnl -dnl alignment dst/src1/src2, A=0mod8, N=4mod8 -dnl A/A/A A/A/N A/N/A A/N/N N/A/A N/A/N N/N/A N/N/N -dnl -dnl K6-2 1.2 1.5 1.5 1.2 1.2 1.5 1.5 1.2 and,andn,ior,xor -dnl K6-2 1.5 1.75 2.0 1.75 1.75 2.0 1.75 1.5 iorn,xnor -dnl K6-2 1.75 2.0 2.0 2.0 2.0 2.0 2.0 1.75 nand,nior -dnl -dnl K6 1.5 1.68 1.75 1.2 1.75 1.75 1.68 1.5 and,andn,ior,xor -dnl K6 2.0 2.0 2.25 2.25 2.25 2.25 2.0 2.0 iorn,xnor -dnl K6 2.0 2.25 2.25 2.25 2.25 2.25 2.25 2.0 nand,nior - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl M4_p and M4_i are the MMX and integer instructions -dnl M4_*_neg_dst means whether to negate the final result before writing -dnl M4_*_neg_src2 means whether to negate the src2 values before using them - -define(M4_choose_op, -m4_assert_numargs(7) -`ifdef(`OPERATION_$1',` -define(`M4_function', `mpn_$1') -define(`M4_operation', `$1') -define(`M4_p', `$2') -define(`M4_p_neg_dst', `$3') -define(`M4_p_neg_src2',`$4') -define(`M4_i', `$5') -define(`M4_i_neg_dst', `$6') -define(`M4_i_neg_src2',`$7') -')') - -dnl xnor is done in "iorn" style because it's a touch faster than "nior" -dnl style (the two are equivalent for xor). - -M4_choose_op( and_n, pand,0,0, andl,0,0) -M4_choose_op( andn_n, pandn,0,0, andl,0,1) -M4_choose_op( nand_n, pand,1,0, andl,1,0) -M4_choose_op( ior_n, por,0,0, orl,0,0) -M4_choose_op( iorn_n, por,0,1, orl,0,1) -M4_choose_op( nior_n, por,1,0, orl,1,0) -M4_choose_op( xor_n, pxor,0,0, xorl,0,0) -M4_choose_op( xnor_n, pxor,0,1, xorl,0,1) - -ifdef(`M4_function',, -`m4_error(`Unrecognised or undefined OPERATION symbol -')') - -MULFUNC_PROLOGUE(mpn_and_n mpn_andn_n mpn_nand_n mpn_ior_n mpn_iorn_n mpn_nior_n mpn_xor_n mpn_xnor_n) - - -C void M4_function (mp_ptr dst, mp_srcptr src1, mp_srcptr src2, -C mp_size_t size); -C -C Do src1,size M4_operation src2,size, storing the result in dst,size. -C -C Unaligned movq loads and stores are a bit slower than aligned ones. The -C test at the start of the routine checks the alignment of src1 and if -C necessary processes one limb separately at the low end to make it aligned. -C -C The raw speeds without this alignment switch are as follows. -C -C alignment dst/src1/src2, A=0mod8, N=4mod8 -C A/A/A A/A/N A/N/A A/N/N N/A/A N/A/N N/N/A N/N/N -C -C K6 1.5 2.0 1.5 2.0 and,andn,ior,xor -C K6 1.75 2.2 2.0 2.28 iorn,xnor -C K6 2.0 2.25 2.35 2.28 nand,nior -C -C -C Future: -C -C K6 can do one 64-bit load per cycle so each of these routines should be -C able to approach 1.0 c/l, if aligned. The basic and/andn/ior/xor might be -C able to get 1.0 with just a 4 limb loop, being 3 instructions per 2 limbs. -C The others are 4 instructions per 2 limbs, and so can only approach 1.0 -C because there's nowhere to hide some loop control. - -defframe(PARAM_SIZE,16) -defframe(PARAM_SRC2,12) -defframe(PARAM_SRC1,8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - - .text - ALIGN(32) -PROLOGUE(M4_function) - movl PARAM_SIZE, %ecx - pushl %ebx - FRAME_pushl() - movl PARAM_SRC1, %eax - movl PARAM_SRC2, %ebx - cmpl $1, %ecx - movl PARAM_DST, %edx - ja L(two_or_more) - - - movl (%ebx), %ecx - popl %ebx -ifelse(M4_i_neg_src2,1,`notl %ecx') - M4_i (%eax), %ecx -ifelse(M4_i_neg_dst,1,` notl %ecx') - movl %ecx, (%edx) - - ret - - -L(two_or_more): - C eax src1 - C ebx src2 - C ecx size - C edx dst - C esi - C edi - C ebp - C - C carry bit is low of size - - pushl %esi - FRAME_pushl() - testl $4, %eax - jz L(alignment_ok) - - movl (%ebx), %esi - addl $4, %ebx -ifelse(M4_i_neg_src2,1,`notl %esi') - M4_i (%eax), %esi - addl $4, %eax -ifelse(M4_i_neg_dst,1,` notl %esi') - movl %esi, (%edx) - addl $4, %edx - decl %ecx - -L(alignment_ok): - movl %ecx, %esi - shrl %ecx - jnz L(still_two_or_more) - - movl (%ebx), %ecx - popl %esi -ifelse(M4_i_neg_src2,1,`notl %ecx') - M4_i (%eax), %ecx -ifelse(M4_i_neg_dst,1,` notl %ecx') - popl %ebx - movl %ecx, (%edx) - ret - - -L(still_two_or_more): -ifelse(eval(M4_p_neg_src2 || M4_p_neg_dst),1,` - pcmpeqd %mm7, %mm7 C all ones -') - - ALIGN(16) -L(top): - C eax src1 - C ebx src2 - C ecx counter - C edx dst - C esi - C edi - C ebp - C - C carry bit is low of size - - movq -8(%ebx,%ecx,8), %mm0 -ifelse(M4_p_neg_src2,1,`pxor %mm7, %mm0') - M4_p -8(%eax,%ecx,8), %mm0 -ifelse(M4_p_neg_dst,1,` pxor %mm7, %mm0') - movq %mm0, -8(%edx,%ecx,8) - - loop L(top) - - - jnc L(no_extra) - - movl -4(%ebx,%esi,4), %ebx -ifelse(M4_i_neg_src2,1,`notl %ebx') - M4_i -4(%eax,%esi,4), %ebx -ifelse(M4_i_neg_dst,1,` notl %ebx') - movl %ebx, -4(%edx,%esi,4) -L(no_extra): - - popl %esi - popl %ebx - emms_or_femms - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/mmx/lshift.asm b/rts/gmp/mpn/x86/k6/mmx/lshift.asm deleted file mode 100644 index f1dc83db46..0000000000 --- a/rts/gmp/mpn/x86/k6/mmx/lshift.asm +++ /dev/null @@ -1,122 +0,0 @@ -dnl AMD K6 mpn_lshift -- mpn left shift. -dnl -dnl K6: 3.0 cycles/limb - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_lshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C -C The loop runs at 3 cycles/limb, limited by decoding and by having 3 mmx -C instructions. This is despite every second fetch being unaligned. - - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(32) - -PROLOGUE(mpn_lshift) -deflit(`FRAME',0) - - C The 1 limb case can be done without the push %ebx, but it's then - C still the same speed. The push is left as a free helping hand for - C the two_or_more code. - - movl PARAM_SIZE, %eax - pushl %ebx FRAME_pushl() - - movl PARAM_SRC, %ebx - decl %eax - - movl PARAM_SHIFT, %ecx - jnz L(two_or_more) - - movl (%ebx), %edx C src limb - movl PARAM_DST, %ebx - - shldl( %cl, %edx, %eax) C return value - - shll %cl, %edx - - movl %edx, (%ebx) C dst limb - popl %ebx - - ret - - - ALIGN(16) C avoid offset 0x1f - nop C avoid bad cache line crossing -L(two_or_more): - C eax size-1 - C ebx src - C ecx shift - C edx - - movl (%ebx,%eax,4), %edx C src high limb - negl %ecx - - movd PARAM_SHIFT, %mm6 - addl $32, %ecx C 32-shift - - shrl %cl, %edx - - movd %ecx, %mm7 - movl PARAM_DST, %ecx - -L(top): - C eax counter, size-1 to 1 - C ebx src - C ecx dst - C edx retval - C - C mm0 scratch - C mm6 shift - C mm7 32-shift - - movq -4(%ebx,%eax,4), %mm0 - decl %eax - - psrlq %mm7, %mm0 - - movd %mm0, 4(%ecx,%eax,4) - jnz L(top) - - - movd (%ebx), %mm0 - popl %ebx - - psllq %mm6, %mm0 - movl %edx, %eax - - movd %mm0, (%ecx) - - emms - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/mmx/popham.asm b/rts/gmp/mpn/x86/k6/mmx/popham.asm deleted file mode 100644 index 2c619252bb..0000000000 --- a/rts/gmp/mpn/x86/k6/mmx/popham.asm +++ /dev/null @@ -1,238 +0,0 @@ -dnl AMD K6-2 mpn_popcount, mpn_hamdist -- mpn bit population count and -dnl hamming distance. -dnl -dnl popcount hamdist -dnl K6-2: 9.0 11.5 cycles/limb -dnl K6: 12.5 13.0 - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C unsigned long mpn_popcount (mp_srcptr src, mp_size_t size); -C unsigned long mpn_hamdist (mp_srcptr src, mp_srcptr src2, mp_size_t size); -C -C The code here isn't optimal, but it's already a 2x speedup over the plain -C integer mpn/generic/popcount.c,hamdist.c. - - -ifdef(`OPERATION_popcount',, -`ifdef(`OPERATION_hamdist',, -`m4_error(`Need OPERATION_popcount or OPERATION_hamdist -')m4exit(1)')') - -define(HAM, -m4_assert_numargs(1) -`ifdef(`OPERATION_hamdist',`$1')') - -define(POP, -m4_assert_numargs(1) -`ifdef(`OPERATION_popcount',`$1')') - -HAM(` -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC2, 8) -defframe(PARAM_SRC, 4) -define(M4_function,mpn_hamdist) -') -POP(` -defframe(PARAM_SIZE, 8) -defframe(PARAM_SRC, 4) -define(M4_function,mpn_popcount) -') - -MULFUNC_PROLOGUE(mpn_popcount mpn_hamdist) - - -ifdef(`PIC',,` - dnl non-PIC - - DATA - ALIGN(8) - -define(LS, -m4_assert_numargs(1) -`LF(M4_function,`$1')') - -LS(rodata_AAAAAAAAAAAAAAAA): - .long 0xAAAAAAAA - .long 0xAAAAAAAA - -LS(rodata_3333333333333333): - .long 0x33333333 - .long 0x33333333 - -LS(rodata_0F0F0F0F0F0F0F0F): - .long 0x0F0F0F0F - .long 0x0F0F0F0F - -LS(rodata_000000FF000000FF): - .long 0x000000FF - .long 0x000000FF -') - - .text - ALIGN(32) - -POP(`ifdef(`PIC', ` - C avoid shrl crossing a 32-byte boundary - nop')') - -PROLOGUE(M4_function) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - orl %ecx, %ecx - jz L(zero) - -ifdef(`PIC',` - movl $0xAAAAAAAA, %eax - movl $0x33333333, %edx - - movd %eax, %mm7 - movd %edx, %mm6 - - movl $0x0F0F0F0F, %eax - movl $0x000000FF, %edx - - punpckldq %mm7, %mm7 - punpckldq %mm6, %mm6 - - movd %eax, %mm5 - movd %edx, %mm4 - - punpckldq %mm5, %mm5 - punpckldq %mm4, %mm4 -',` - - movq LS(rodata_AAAAAAAAAAAAAAAA), %mm7 - movq LS(rodata_3333333333333333), %mm6 - movq LS(rodata_0F0F0F0F0F0F0F0F), %mm5 - movq LS(rodata_000000FF000000FF), %mm4 -') - -define(REG_AAAAAAAAAAAAAAAA, %mm7) -define(REG_3333333333333333, %mm6) -define(REG_0F0F0F0F0F0F0F0F, %mm5) -define(REG_000000FF000000FF, %mm4) - - - movl PARAM_SRC, %eax -HAM(` movl PARAM_SRC2, %edx') - - pxor %mm2, %mm2 C total - - shrl %ecx - jnc L(top) - -Zdisp( movd, 0,(%eax,%ecx,8), %mm1) - -HAM(` -Zdisp( movd, 0,(%edx,%ecx,8), %mm0) - pxor %mm0, %mm1 -') - - incl %ecx - jmp L(loaded) - - - ALIGN(16) -POP(` nop C alignment to avoid crossing 32-byte boundaries') - -L(top): - C eax src - C ebx - C ecx counter, qwords, decrementing - C edx [hamdist] src2 - C - C mm0 (scratch) - C mm1 (scratch) - C mm2 total (low dword) - C mm3 - C mm4 \ - C mm5 | special constants - C mm6 | - C mm7 / - - movq -8(%eax,%ecx,8), %mm1 -HAM(` pxor -8(%edx,%ecx,8), %mm1') - -L(loaded): - movq %mm1, %mm0 - pand REG_AAAAAAAAAAAAAAAA, %mm1 - - psrlq $1, %mm1 -HAM(` nop C code alignment') - - psubd %mm1, %mm0 C bit pairs -HAM(` nop C code alignment') - - - movq %mm0, %mm1 - psrlq $2, %mm0 - - pand REG_3333333333333333, %mm0 - pand REG_3333333333333333, %mm1 - - paddd %mm1, %mm0 C nibbles - - - movq %mm0, %mm1 - psrlq $4, %mm0 - - pand REG_0F0F0F0F0F0F0F0F, %mm0 - pand REG_0F0F0F0F0F0F0F0F, %mm1 - - paddd %mm1, %mm0 C bytes - - movq %mm0, %mm1 - psrlq $8, %mm0 - - - paddb %mm1, %mm0 C words - - - movq %mm0, %mm1 - psrlq $16, %mm0 - - paddd %mm1, %mm0 C dwords - - pand REG_000000FF000000FF, %mm0 - - paddd %mm0, %mm2 C low to total - psrlq $32, %mm0 - - paddd %mm0, %mm2 C high to total - loop L(top) - - - - movd %mm2, %eax - emms_or_femms - ret - -L(zero): - movl $0, %eax - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/mmx/rshift.asm b/rts/gmp/mpn/x86/k6/mmx/rshift.asm deleted file mode 100644 index cc5948f26c..0000000000 --- a/rts/gmp/mpn/x86/k6/mmx/rshift.asm +++ /dev/null @@ -1,122 +0,0 @@ -dnl AMD K6 mpn_rshift -- mpn right shift. -dnl -dnl K6: 3.0 cycles/limb - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_rshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C -C The loop runs at 3 cycles/limb, limited by decoding and by having 3 mmx -C instructions. This is despite every second fetch being unaligned. - - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - - .text - ALIGN(32) - -PROLOGUE(mpn_rshift) -deflit(`FRAME',0) - - C The 1 limb case can be done without the push %ebx, but it's then - C still the same speed. The push is left as a free helping hand for - C the two_or_more code. - - movl PARAM_SIZE, %eax - pushl %ebx FRAME_pushl() - - movl PARAM_SRC, %ebx - decl %eax - - movl PARAM_SHIFT, %ecx - jnz L(two_or_more) - - movl (%ebx), %edx C src limb - movl PARAM_DST, %ebx - - shrdl( %cl, %edx, %eax) C return value - - shrl %cl, %edx - - movl %edx, (%ebx) C dst limb - popl %ebx - - ret - - - ALIGN(16) C avoid offset 0x1f -L(two_or_more): - C eax size-1 - C ebx src - C ecx shift - C edx - - movl (%ebx), %edx C src low limb - negl %ecx - - addl $32, %ecx C 32-shift - movd PARAM_SHIFT, %mm6 - - shll %cl, %edx C retval - movl PARAM_DST, %ecx - - leal (%ebx,%eax,4), %ebx - - leal -4(%ecx,%eax,4), %ecx - negl %eax - - -L(simple): - C eax counter (negative) - C ebx &src[size-1] - C ecx &dst[size-1] - C edx retval - C - C mm0 scratch - C mm6 shift - -Zdisp( movq, 0,(%ebx,%eax,4), %mm0) - incl %eax - - psrlq %mm6, %mm0 - -Zdisp( movd, %mm0, 0,(%ecx,%eax,4)) - jnz L(simple) - - - movq %mm0, (%ecx) - movl %edx, %eax - - popl %ebx - - emms - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/mul_1.asm b/rts/gmp/mpn/x86/k6/mul_1.asm deleted file mode 100644 index c2220fe4ca..0000000000 --- a/rts/gmp/mpn/x86/k6/mul_1.asm +++ /dev/null @@ -1,272 +0,0 @@ -dnl AMD K6 mpn_mul_1 -- mpn by limb multiply. -dnl -dnl K6: 6.25 cycles/limb. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_mul_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t multiplier); -C mp_limb_t mpn_mul_1c (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t multiplier, mp_limb_t carry); -C -C Multiply src,size by mult and store the result in dst,size. -C Return the carry limb from the top of the result. -C -C mpn_mul_1c() accepts an initial carry for the calculation, it's added into -C the low limb of the result. - -defframe(PARAM_CARRY, 20) -defframe(PARAM_MULTIPLIER,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - -dnl minimum 5 because the unrolled code can't handle less -deflit(UNROLL_THRESHOLD, 5) - - .text - ALIGN(32) - -PROLOGUE(mpn_mul_1c) - pushl %esi -deflit(`FRAME',4) - movl PARAM_CARRY, %esi - jmp LF(mpn_mul_1,start_nc) -EPILOGUE() - - -PROLOGUE(mpn_mul_1) - push %esi -deflit(`FRAME',4) - xorl %esi, %esi C initial carry - -L(start_nc): - mov PARAM_SIZE, %ecx - push %ebx -FRAME_pushl() - - movl PARAM_SRC, %ebx - push %edi -FRAME_pushl() - - movl PARAM_DST, %edi - pushl %ebp -FRAME_pushl() - - cmpl $UNROLL_THRESHOLD, %ecx - movl PARAM_MULTIPLIER, %ebp - - jae L(unroll) - - - C code offset 0x22 here, close enough to aligned -L(simple): - C eax scratch - C ebx src - C ecx counter - C edx scratch - C esi carry - C edi dst - C ebp multiplier - C - C this loop 8 cycles/limb - - movl (%ebx), %eax - addl $4, %ebx - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, (%edi) - addl $4, %edi - - loop L(simple) - - - popl %ebp - - popl %edi - popl %ebx - - movl %esi, %eax - popl %esi - - ret - - -C ----------------------------------------------------------------------------- -C The code for each limb is 6 cycles, with instruction decoding being the -C limiting factor. At 4 limbs/loop and 1 cycle/loop of overhead it's 6.25 -C cycles/limb in total. -C -C The secret ingredient to get 6.25 is to start the loop with the mul and -C have the load/store pair at the end. Rotating the load/store to the top -C is an 0.5 c/l slowdown. (Some address generation effect probably.) -C -C The whole unrolled loop fits nicely in exactly 80 bytes. - - - ALIGN(16) C already aligned to 16 here actually -L(unroll): - movl (%ebx), %eax - leal -16(%ebx,%ecx,4), %ebx - - leal -16(%edi,%ecx,4), %edi - subl $4, %ecx - - negl %ecx - - - ALIGN(16) C one byte nop for this alignment -L(top): - C eax scratch - C ebx &src[size-4] - C ecx counter - C edx scratch - C esi carry - C edi &dst[size-4] - C ebp multiplier - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, (%edi,%ecx,4) - movl 4(%ebx,%ecx,4), %eax - - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, 4(%edi,%ecx,4) - movl 8(%ebx,%ecx,4), %eax - - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, 8(%edi,%ecx,4) - movl 12(%ebx,%ecx,4), %eax - - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, 12(%edi,%ecx,4) - movl 16(%ebx,%ecx,4), %eax - - - addl $4, %ecx - js L(top) - - - - C eax next src limb - C ebx &src[size-4] - C ecx 0 to 3 representing respectively 4 to 1 further limbs - C edx - C esi carry - C edi &dst[size-4] - - testb $2, %cl - jnz L(finish_not_two) - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, (%edi,%ecx,4) - movl 4(%ebx,%ecx,4), %eax - - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, 4(%edi,%ecx,4) - movl 8(%ebx,%ecx,4), %eax - - addl $2, %ecx -L(finish_not_two): - - - testb $1, %cl - jnz L(finish_not_one) - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, 8(%edi) - movl 12(%ebx), %eax -L(finish_not_one): - - - mull %ebp - - addl %esi, %eax - popl %ebp - - adcl $0, %edx - - movl %eax, 12(%edi) - popl %edi - - popl %ebx - movl %edx, %eax - - popl %esi - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/mul_basecase.asm b/rts/gmp/mpn/x86/k6/mul_basecase.asm deleted file mode 100644 index 1f5a3a4b4b..0000000000 --- a/rts/gmp/mpn/x86/k6/mul_basecase.asm +++ /dev/null @@ -1,600 +0,0 @@ -dnl AMD K6 mpn_mul_basecase -- multiply two mpn numbers. -dnl -dnl K6: approx 9.0 cycles per cross product on 30x30 limbs (with 16 limbs/loop -dnl unrolling). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K6: UNROLL_COUNT cycles/product (approx) -dnl 8 9.75 -dnl 16 9.3 -dnl 32 9.3 -dnl Maximum possible with the current code is 32. -dnl -dnl With 16 the inner unrolled loop fits exactly in a 256 byte block, which -dnl might explain it's good performance. - -deflit(UNROLL_COUNT, 16) - - -C void mpn_mul_basecase (mp_ptr wp, -C mp_srcptr xp, mp_size_t xsize, -C mp_srcptr yp, mp_size_t ysize); -C -C Calculate xp,xsize multiplied by yp,ysize, storing the result in -C wp,xsize+ysize. -C -C This routine is essentially the same as mpn/generic/mul_basecase.c, but -C it's faster because it does most of the mpn_addmul_1() entry code only -C once. The saving is about 10-20% on typical sizes coming from the -C Karatsuba multiply code. -C -C Future: -C -C The unrolled loop could be shared by mpn_addmul_1, with some extra stack -C setups and maybe 2 or 3 wasted cycles at the end. Code saving would be -C 256 bytes. - -ifdef(`PIC',` -deflit(UNROLL_THRESHOLD, 8) -',` -deflit(UNROLL_THRESHOLD, 8) -') - -defframe(PARAM_YSIZE,20) -defframe(PARAM_YP, 16) -defframe(PARAM_XSIZE,12) -defframe(PARAM_XP, 8) -defframe(PARAM_WP, 4) - - .text - ALIGN(32) -PROLOGUE(mpn_mul_basecase) -deflit(`FRAME',0) - - movl PARAM_XSIZE, %ecx - movl PARAM_YP, %eax - - movl PARAM_XP, %edx - movl (%eax), %eax C yp low limb - - cmpl $2, %ecx - ja L(xsize_more_than_two_limbs) - je L(two_by_something) - - - C one limb by one limb - - movl (%edx), %edx C xp low limb - movl PARAM_WP, %ecx - - mull %edx - - movl %eax, (%ecx) - movl %edx, 4(%ecx) - ret - - -C ----------------------------------------------------------------------------- -L(two_by_something): - decl PARAM_YSIZE - pushl %ebx -deflit(`FRAME',4) - - movl PARAM_WP, %ebx - pushl %esi -deflit(`FRAME',8) - - movl %eax, %ecx C yp low limb - movl (%edx), %eax C xp low limb - - movl %edx, %esi C xp - jnz L(two_by_two) - - - C two limbs by one limb - - mull %ecx - - movl %eax, (%ebx) - movl 4(%esi), %eax - - movl %edx, %esi C carry - - mull %ecx - - addl %eax, %esi - movl %esi, 4(%ebx) - - adcl $0, %edx - - movl %edx, 8(%ebx) - popl %esi - - popl %ebx - ret - - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(two_by_two): - C eax xp low limb - C ebx wp - C ecx yp low limb - C edx - C esi xp - C edi - C ebp -deflit(`FRAME',8) - - mull %ecx C xp[0] * yp[0] - - push %edi -deflit(`FRAME',12) - movl %eax, (%ebx) - - movl 4(%esi), %eax - movl %edx, %edi C carry, for wp[1] - - mull %ecx C xp[1] * yp[0] - - addl %eax, %edi - movl PARAM_YP, %ecx - - adcl $0, %edx - - movl %edi, 4(%ebx) - movl 4(%ecx), %ecx C yp[1] - - movl 4(%esi), %eax C xp[1] - movl %edx, %edi C carry, for wp[2] - - mull %ecx C xp[1] * yp[1] - - addl %eax, %edi - - adcl $0, %edx - - movl (%esi), %eax C xp[0] - movl %edx, %esi C carry, for wp[3] - - mull %ecx C xp[0] * yp[1] - - addl %eax, 4(%ebx) - adcl %edx, %edi - adcl $0, %esi - - movl %edi, 8(%ebx) - popl %edi - - movl %esi, 12(%ebx) - popl %esi - - popl %ebx - ret - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(xsize_more_than_two_limbs): - -C The first limb of yp is processed with a simple mpn_mul_1 style loop -C inline. Unrolling this doesn't seem worthwhile since it's only run once -C (whereas the addmul below is run ysize-1 many times). A call to the -C actual mpn_mul_1 will be slowed down by the call and parameter pushing and -C popping, and doesn't seem likely to be worthwhile on the typical 10-20 -C limb operations the Karatsuba code calls here with. - - C eax yp[0] - C ebx - C ecx xsize - C edx xp - C esi - C edi - C ebp -deflit(`FRAME',0) - - pushl %edi defframe_pushl(SAVE_EDI) - pushl %ebp defframe_pushl(SAVE_EBP) - - movl PARAM_WP, %edi - pushl %esi defframe_pushl(SAVE_ESI) - - movl %eax, %ebp - pushl %ebx defframe_pushl(SAVE_EBX) - - leal (%edx,%ecx,4), %ebx C xp end - xorl %esi, %esi - - leal (%edi,%ecx,4), %edi C wp end of mul1 - negl %ecx - - -L(mul1): - C eax scratch - C ebx xp end - C ecx counter, negative - C edx scratch - C esi carry - C edi wp end of mul1 - C ebp multiplier - - movl (%ebx,%ecx,4), %eax - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, (%edi,%ecx,4) - incl %ecx - - jnz L(mul1) - - - movl PARAM_YSIZE, %edx - movl %esi, (%edi) C final carry - - movl PARAM_XSIZE, %ecx - decl %edx - - jnz L(ysize_more_than_one_limb) - - popl %ebx - popl %esi - popl %ebp - popl %edi - ret - - -L(ysize_more_than_one_limb): - cmpl $UNROLL_THRESHOLD, %ecx - movl PARAM_YP, %eax - - jae L(unroll) - - -C ----------------------------------------------------------------------------- -C Simple addmul loop. -C -C Using ebx and edi pointing at the ends of their respective locations saves -C a couple of instructions in the outer loop. The inner loop is still 11 -C cycles, the same as the simple loop in aorsmul_1.asm. - - C eax yp - C ebx xp end - C ecx xsize - C edx ysize-1 - C esi - C edi wp end of mul1 - C ebp - - movl 4(%eax), %ebp C multiplier - negl %ecx - - movl %ecx, PARAM_XSIZE C -xsize - xorl %esi, %esi C initial carry - - leal 4(%eax,%edx,4), %eax C yp end - negl %edx - - movl %eax, PARAM_YP - movl %edx, PARAM_YSIZE - - jmp L(simple_outer_entry) - - - C aligning here saves a couple of cycles - ALIGN(16) -L(simple_outer_top): - C edx ysize counter, negative - - movl PARAM_YP, %eax C yp end - xorl %esi, %esi C carry - - movl PARAM_XSIZE, %ecx C -xsize - movl %edx, PARAM_YSIZE - - movl (%eax,%edx,4), %ebp C yp limb multiplier -L(simple_outer_entry): - addl $4, %edi - - -L(simple_inner): - C eax scratch - C ebx xp end - C ecx counter, negative - C edx scratch - C esi carry - C edi wp end of this addmul - C ebp multiplier - - movl (%ebx,%ecx,4), %eax - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl $0, %edx - addl %eax, (%edi,%ecx,4) - adcl %edx, %esi - - incl %ecx - jnz L(simple_inner) - - - movl PARAM_YSIZE, %edx - movl %esi, (%edi) - - incl %edx - jnz L(simple_outer_top) - - - popl %ebx - popl %esi - popl %ebp - popl %edi - ret - - -C ----------------------------------------------------------------------------- -C Unrolled loop. -C -C The unrolled inner loop is the same as in aorsmul_1.asm, see that code for -C some comments. -C -C VAR_COUNTER is for the inner loop, running from VAR_COUNTER_INIT down to -C 0, inclusive. -C -C VAR_JMP is the computed jump into the unrolled loop. -C -C PARAM_XP and PARAM_WP get offset appropriately for where the unrolled loop -C is entered. -C -C VAR_XP_LOW is the least significant limb of xp, which is needed at the -C start of the unrolled loop. This can't just be fetched through the xp -C pointer because of the offset applied to it. -C -C PARAM_YSIZE is the outer loop counter, going from -(ysize-1) up to -1, -C inclusive. -C -C PARAM_YP is offset appropriately so that the PARAM_YSIZE counter can be -C added to give the location of the next limb of yp, which is the multiplier -C in the unrolled loop. -C -C PARAM_WP is similarly offset so that the PARAM_YSIZE counter can be added -C to give the starting point in the destination for each unrolled loop (this -C point is one limb upwards for each limb of yp processed). -C -C Having PARAM_YSIZE count negative to zero means it's not necessary to -C store new values of PARAM_YP and PARAM_WP on each loop. Those values on -C the stack remain constant and on each loop an leal adjusts them with the -C PARAM_YSIZE counter value. - - -defframe(VAR_COUNTER, -20) -defframe(VAR_COUNTER_INIT, -24) -defframe(VAR_JMP, -28) -defframe(VAR_XP_LOW, -32) -deflit(VAR_STACK_SPACE, 16) - -dnl For some strange reason using (%esp) instead of 0(%esp) is a touch -dnl slower in this code, hence the defframe empty-if-zero feature is -dnl disabled. -dnl -dnl If VAR_COUNTER is at (%esp), the effect is worse. In this case the -dnl unrolled loop is 255 instead of 256 bytes, but quite how this affects -dnl anything isn't clear. -dnl -define(`defframe_empty_if_zero_disabled',1) - -L(unroll): - C eax yp (not used) - C ebx xp end (not used) - C ecx xsize - C edx ysize-1 - C esi - C edi wp end of mul1 (not used) - C ebp -deflit(`FRAME', 16) - - leal -2(%ecx), %ebp C one limb processed at start, - decl %ecx C and ebp is one less - - shrl $UNROLL_LOG2, %ebp - negl %ecx - - subl $VAR_STACK_SPACE, %esp -deflit(`FRAME', 16+VAR_STACK_SPACE) - andl $UNROLL_MASK, %ecx - - movl %ecx, %esi - shll $4, %ecx - - movl %ebp, VAR_COUNTER_INIT - negl %esi - - C 15 code bytes per limb -ifdef(`PIC',` - call L(pic_calc) -L(unroll_here): -',` - leal L(unroll_entry) (%ecx,%esi,1), %ecx -') - - movl PARAM_XP, %ebx - movl %ebp, VAR_COUNTER - - movl PARAM_WP, %edi - movl %ecx, VAR_JMP - - movl (%ebx), %eax - leal 4(%edi,%esi,4), %edi C wp adjust for unrolling and mul1 - - leal (%ebx,%esi,4), %ebx C xp adjust for unrolling - - movl %eax, VAR_XP_LOW - - movl %ebx, PARAM_XP - movl PARAM_YP, %ebx - - leal (%edi,%edx,4), %ecx C wp adjust for ysize indexing - movl 4(%ebx), %ebp C multiplier (yp second limb) - - leal 4(%ebx,%edx,4), %ebx C yp adjust for ysize indexing - - movl %ecx, PARAM_WP - - leal 1(%esi), %ecx C adjust parity for decl %ecx above - - movl %ebx, PARAM_YP - negl %edx - - movl %edx, PARAM_YSIZE - jmp L(unroll_outer_entry) - - -ifdef(`PIC',` -L(pic_calc): - C See README.family about old gas bugs - leal (%ecx,%esi,1), %ecx - addl $L(unroll_entry)-L(unroll_here), %ecx - addl (%esp), %ecx - ret -') - - -C ----------------------------------------------------------------------------- - C Aligning here saves a couple of cycles per loop. Using 32 doesn't - C cost any extra space, since the inner unrolled loop below is - C aligned to 32. - ALIGN(32) -L(unroll_outer_top): - C edx ysize - - movl PARAM_YP, %eax - movl %edx, PARAM_YSIZE C incremented ysize counter - - movl PARAM_WP, %edi - - movl VAR_COUNTER_INIT, %ebx - movl (%eax,%edx,4), %ebp C next multiplier - - movl PARAM_XSIZE, %ecx - leal (%edi,%edx,4), %edi C adjust wp for where we are in yp - - movl VAR_XP_LOW, %eax - movl %ebx, VAR_COUNTER - -L(unroll_outer_entry): - mull %ebp - - C using testb is a tiny bit faster than testl - testb $1, %cl - - movl %eax, %ecx C low carry - movl VAR_JMP, %eax - - movl %edx, %esi C high carry - movl PARAM_XP, %ebx - - jnz L(unroll_noswap) - movl %ecx, %esi C high,low carry other way around - - movl %edx, %ecx -L(unroll_noswap): - - jmp *%eax - - - -C ----------------------------------------------------------------------------- - ALIGN(32) -L(unroll_top): - C eax scratch - C ebx xp - C ecx carry low - C edx scratch - C esi carry high - C edi wp - C ebp multiplier - C VAR_COUNTER loop counter - C - C 15 code bytes each limb - - leal UNROLL_BYTES(%edi), %edi - -L(unroll_entry): -deflit(CHUNK_COUNT,2) -forloop(`i', 0, UNROLL_COUNT/CHUNK_COUNT-1, ` - deflit(`disp0', eval(i*CHUNK_COUNT*4)) - deflit(`disp1', eval(disp0 + 4)) - deflit(`disp2', eval(disp1 + 4)) - - movl disp1(%ebx), %eax - mull %ebp -Zdisp( addl, %ecx, disp0,(%edi)) - adcl %eax, %esi - movl %edx, %ecx - jadcl0( %ecx) - - movl disp2(%ebx), %eax - mull %ebp - addl %esi, disp1(%edi) - adcl %eax, %ecx - movl %edx, %esi - jadcl0( %esi) -') - - decl VAR_COUNTER - leal UNROLL_BYTES(%ebx), %ebx - - jns L(unroll_top) - - - movl PARAM_YSIZE, %edx - addl %ecx, UNROLL_BYTES(%edi) - - adcl $0, %esi - - incl %edx - movl %esi, UNROLL_BYTES+4(%edi) - - jnz L(unroll_outer_top) - - - movl SAVE_ESI, %esi - movl SAVE_EBP, %ebp - movl SAVE_EDI, %edi - movl SAVE_EBX, %ebx - - addl $FRAME, %esp - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k6/sqr_basecase.asm b/rts/gmp/mpn/x86/k6/sqr_basecase.asm deleted file mode 100644 index 70d49b3e57..0000000000 --- a/rts/gmp/mpn/x86/k6/sqr_basecase.asm +++ /dev/null @@ -1,672 +0,0 @@ -dnl AMD K6 mpn_sqr_basecase -- square an mpn number. -dnl -dnl K6: approx 4.7 cycles per cross product, or 9.2 cycles per triangular -dnl product (measured on the speed difference between 17 and 33 limbs, -dnl which is roughly the Karatsuba recursing range). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl KARATSUBA_SQR_THRESHOLD_MAX is the maximum KARATSUBA_SQR_THRESHOLD this -dnl code supports. This value is used only by the tune program to know -dnl what it can go up to. (An attempt to compile with a bigger value will -dnl trigger some m4_assert()s in the code, making the build fail.) -dnl -dnl The value is determined by requiring the displacements in the unrolled -dnl addmul to fit in single bytes. This means a maximum UNROLL_COUNT of -dnl 63, giving a maximum KARATSUBA_SQR_THRESHOLD of 66. - -deflit(KARATSUBA_SQR_THRESHOLD_MAX, 66) - - -dnl Allow a value from the tune program to override config.m4. - -ifdef(`KARATSUBA_SQR_THRESHOLD_OVERRIDE', -`define(`KARATSUBA_SQR_THRESHOLD',KARATSUBA_SQR_THRESHOLD_OVERRIDE)') - - -dnl UNROLL_COUNT is the number of code chunks in the unrolled addmul. The -dnl number required is determined by KARATSUBA_SQR_THRESHOLD, since -dnl mpn_sqr_basecase only needs to handle sizes < KARATSUBA_SQR_THRESHOLD. -dnl -dnl The first addmul is the biggest, and this takes the second least -dnl significant limb and multiplies it by the third least significant and -dnl up. Hence for a maximum operand size of KARATSUBA_SQR_THRESHOLD-1 -dnl limbs, UNROLL_COUNT needs to be KARATSUBA_SQR_THRESHOLD-3. - -m4_config_gmp_mparam(`KARATSUBA_SQR_THRESHOLD') -deflit(UNROLL_COUNT, eval(KARATSUBA_SQR_THRESHOLD-3)) - - -C void mpn_sqr_basecase (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C The algorithm is essentially the same as mpn/generic/sqr_basecase.c, but a -C lot of function call overheads are avoided, especially when the given size -C is small. -C -C The code size might look a bit excessive, but not all of it is executed -C and so won't fill up the code cache. The 1x1, 2x2 and 3x3 special cases -C clearly apply only to those sizes; mid sizes like 10x10 only need part of -C the unrolled addmul; and big sizes like 35x35 that do need all of it will -C at least be getting value for money, because 35x35 spends something like -C 5780 cycles here. -C -C Different values of UNROLL_COUNT give slightly different speeds, between -C 9.0 and 9.2 c/tri-prod measured on the difference between 17 and 33 limbs. -C This isn't a big difference, but it's presumably some alignment effect -C which if understood could give a simple speedup. - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(32) -PROLOGUE(mpn_sqr_basecase) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - movl PARAM_SRC, %eax - - cmpl $2, %ecx - je L(two_limbs) - - movl PARAM_DST, %edx - ja L(three_or_more) - - -C ----------------------------------------------------------------------------- -C one limb only - C eax src - C ebx - C ecx size - C edx dst - - movl (%eax), %eax - movl %edx, %ecx - - mull %eax - - movl %eax, (%ecx) - movl %edx, 4(%ecx) - ret - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(two_limbs): - C eax src - C ebx - C ecx size - C edx dst - - pushl %ebx - movl %eax, %ebx C src -deflit(`FRAME',4) - - movl (%ebx), %eax - movl PARAM_DST, %ecx - - mull %eax C src[0]^2 - - movl %eax, (%ecx) - movl 4(%ebx), %eax - - movl %edx, 4(%ecx) - - mull %eax C src[1]^2 - - movl %eax, 8(%ecx) - movl (%ebx), %eax - - movl %edx, 12(%ecx) - movl 4(%ebx), %edx - - mull %edx C src[0]*src[1] - - addl %eax, 4(%ecx) - - adcl %edx, 8(%ecx) - adcl $0, 12(%ecx) - - popl %ebx - addl %eax, 4(%ecx) - - adcl %edx, 8(%ecx) - adcl $0, 12(%ecx) - - ret - - -C ----------------------------------------------------------------------------- -L(three_or_more): -deflit(`FRAME',0) - cmpl $4, %ecx - jae L(four_or_more) - - -C ----------------------------------------------------------------------------- -C three limbs - C eax src - C ecx size - C edx dst - - pushl %ebx - movl %eax, %ebx C src - - movl (%ebx), %eax - movl %edx, %ecx C dst - - mull %eax C src[0] ^ 2 - - movl %eax, (%ecx) - movl 4(%ebx), %eax - - movl %edx, 4(%ecx) - pushl %esi - - mull %eax C src[1] ^ 2 - - movl %eax, 8(%ecx) - movl 8(%ebx), %eax - - movl %edx, 12(%ecx) - pushl %edi - - mull %eax C src[2] ^ 2 - - movl %eax, 16(%ecx) - movl (%ebx), %eax - - movl %edx, 20(%ecx) - movl 4(%ebx), %edx - - mull %edx C src[0] * src[1] - - movl %eax, %esi - movl (%ebx), %eax - - movl %edx, %edi - movl 8(%ebx), %edx - - pushl %ebp - xorl %ebp, %ebp - - mull %edx C src[0] * src[2] - - addl %eax, %edi - movl 4(%ebx), %eax - - adcl %edx, %ebp - - movl 8(%ebx), %edx - - mull %edx C src[1] * src[2] - - addl %eax, %ebp - - adcl $0, %edx - - - C eax will be dst[5] - C ebx - C ecx dst - C edx dst[4] - C esi dst[1] - C edi dst[2] - C ebp dst[3] - - xorl %eax, %eax - addl %esi, %esi - adcl %edi, %edi - adcl %ebp, %ebp - adcl %edx, %edx - adcl $0, %eax - - addl %esi, 4(%ecx) - adcl %edi, 8(%ecx) - adcl %ebp, 12(%ecx) - - popl %ebp - popl %edi - - adcl %edx, 16(%ecx) - - popl %esi - popl %ebx - - adcl %eax, 20(%ecx) - ASSERT(nc) - - ret - - -C ----------------------------------------------------------------------------- - -defframe(SAVE_EBX, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EDI, -12) -defframe(SAVE_EBP, -16) -defframe(VAR_COUNTER,-20) -defframe(VAR_JMP, -24) -deflit(STACK_SPACE, 24) - - ALIGN(16) -L(four_or_more): - - C eax src - C ebx - C ecx size - C edx dst - C esi - C edi - C ebp - -C First multiply src[0]*src[1..size-1] and store at dst[1..size]. -C -C A test was done calling mpn_mul_1 here to get the benefit of its unrolled -C loop, but this was only a tiny speedup; at 35 limbs it took 24 cycles off -C a 5780 cycle operation, which is not surprising since the loop here is 8 -C c/l and mpn_mul_1 is 6.25 c/l. - - subl $STACK_SPACE, %esp deflit(`FRAME',STACK_SPACE) - - movl %edi, SAVE_EDI - leal 4(%edx), %edi - - movl %ebx, SAVE_EBX - leal 4(%eax), %ebx - - movl %esi, SAVE_ESI - xorl %esi, %esi - - movl %ebp, SAVE_EBP - - C eax - C ebx src+4 - C ecx size - C edx - C esi - C edi dst+4 - C ebp - - movl (%eax), %ebp C multiplier - leal -1(%ecx), %ecx C size-1, and pad to a 16 byte boundary - - - ALIGN(16) -L(mul_1): - C eax scratch - C ebx src ptr - C ecx counter - C edx scratch - C esi carry - C edi dst ptr - C ebp multiplier - - movl (%ebx), %eax - addl $4, %ebx - - mull %ebp - - addl %esi, %eax - movl $0, %esi - - adcl %edx, %esi - - movl %eax, (%edi) - addl $4, %edi - - loop L(mul_1) - - -C Addmul src[n]*src[n+1..size-1] at dst[2*n-1...], for each n=1..size-2. -C -C The last two addmuls, which are the bottom right corner of the product -C triangle, are left to the end. These are src[size-3]*src[size-2,size-1] -C and src[size-2]*src[size-1]. If size is 4 then it's only these corner -C cases that need to be done. -C -C The unrolled code is the same as mpn_addmul_1(), see that routine for some -C comments. -C -C VAR_COUNTER is the outer loop, running from -(size-4) to -1, inclusive. -C -C VAR_JMP is the computed jump into the unrolled code, stepped by one code -C chunk each outer loop. -C -C K6 doesn't do any branch prediction on indirect jumps, which is good -C actually because it's a different target each time. The unrolled addmul -C is about 3 cycles/limb faster than a simple loop, so the 6 cycle cost of -C the indirect jump is quickly recovered. - - -dnl This value is also implicitly encoded in a shift and add. -dnl -deflit(CODE_BYTES_PER_LIMB, 15) - -dnl With the unmodified &src[size] and &dst[size] pointers, the -dnl displacements in the unrolled code fit in a byte for UNROLL_COUNT -dnl values up to 31. Above that an offset must be added to them. -dnl -deflit(OFFSET, -ifelse(eval(UNROLL_COUNT>31),1, -eval((UNROLL_COUNT-31)*4), -0)) - - C eax - C ebx &src[size] - C ecx - C edx - C esi carry - C edi &dst[size] - C ebp - - movl PARAM_SIZE, %ecx - movl %esi, (%edi) - - subl $4, %ecx - jz L(corner) - - movl %ecx, %edx -ifelse(OFFSET,0,, -` subl $OFFSET, %ebx') - - shll $4, %ecx -ifelse(OFFSET,0,, -` subl $OFFSET, %edi') - - negl %ecx - -ifdef(`PIC',` - call L(pic_calc) -L(here): -',` - leal L(unroll_inner_end)-eval(2*CODE_BYTES_PER_LIMB)(%ecx,%edx), %ecx -') - negl %edx - - - C The calculated jump mustn't be before the start of the available - C code. This is the limitation UNROLL_COUNT puts on the src operand - C size, but checked here using the jump address directly. - C - ASSERT(ae,` - movl_text_address( L(unroll_inner_start), %eax) - cmpl %eax, %ecx - ') - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(unroll_outer_top): - C eax - C ebx &src[size], constant - C ecx VAR_JMP - C edx VAR_COUNTER, limbs, negative - C esi high limb to store - C edi dst ptr, high of last addmul - C ebp - - movl -12+OFFSET(%ebx,%edx,4), %ebp C multiplier - movl %edx, VAR_COUNTER - - movl -8+OFFSET(%ebx,%edx,4), %eax C first limb of multiplicand - - mull %ebp - - testb $1, %cl - - movl %edx, %esi C high carry - movl %ecx, %edx C jump - - movl %eax, %ecx C low carry - leal CODE_BYTES_PER_LIMB(%edx), %edx - - movl %edx, VAR_JMP - leal 4(%edi), %edi - - C A branch-free version of this using some xors was found to be a - C touch slower than just a conditional jump, despite the jump - C switching between taken and not taken on every loop. - -ifelse(eval(UNROLL_COUNT%2),0, - jz,jnz) L(unroll_noswap) - movl %esi, %eax C high,low carry other way around - - movl %ecx, %esi - movl %eax, %ecx -L(unroll_noswap): - - jmp *%edx - - - C Must be on an even address here so the low bit of the jump address - C will indicate which way around ecx/esi should start. - C - C An attempt was made at padding here to get the end of the unrolled - C code to come out on a good alignment, to save padding before - C L(corner). This worked, but turned out to run slower than just an - C ALIGN(2). The reason for this is not clear, it might be related - C to the different speeds on different UNROLL_COUNTs noted above. - - ALIGN(2) - -L(unroll_inner_start): - C eax scratch - C ebx src - C ecx carry low - C edx scratch - C esi carry high - C edi dst - C ebp multiplier - C - C 15 code bytes each limb - C ecx/esi swapped on each chunk - -forloop(`i', UNROLL_COUNT, 1, ` - deflit(`disp_src', eval(-i*4 + OFFSET)) - deflit(`disp_dst', eval(disp_src - 4)) - - m4_assert(`disp_src>=-128 && disp_src<128') - m4_assert(`disp_dst>=-128 && disp_dst<128') - -ifelse(eval(i%2),0,` -Zdisp( movl, disp_src,(%ebx), %eax) - mull %ebp -Zdisp( addl, %esi, disp_dst,(%edi)) - adcl %eax, %ecx - movl %edx, %esi - jadcl0( %esi) -',` - dnl this one comes out last -Zdisp( movl, disp_src,(%ebx), %eax) - mull %ebp -Zdisp( addl, %ecx, disp_dst,(%edi)) - adcl %eax, %esi - movl %edx, %ecx - jadcl0( %ecx) -') -') -L(unroll_inner_end): - - addl %esi, -4+OFFSET(%edi) - - movl VAR_COUNTER, %edx - jadcl0( %ecx) - - movl %ecx, m4_empty_if_zero(OFFSET)(%edi) - movl VAR_JMP, %ecx - - incl %edx - jnz L(unroll_outer_top) - - -ifelse(OFFSET,0,,` - addl $OFFSET, %ebx - addl $OFFSET, %edi -') - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(corner): - C ebx &src[size] - C edi &dst[2*size-5] - - movl -12(%ebx), %ebp - - movl -8(%ebx), %eax - movl %eax, %ecx - - mull %ebp - - addl %eax, -4(%edi) - adcl $0, %edx - - movl -4(%ebx), %eax - movl %edx, %esi - movl %eax, %ebx - - mull %ebp - - addl %esi, %eax - adcl $0, %edx - - addl %eax, (%edi) - adcl $0, %edx - - movl %edx, %esi - movl %ebx, %eax - - mull %ecx - - addl %esi, %eax - movl %eax, 4(%edi) - - adcl $0, %edx - - movl %edx, 8(%edi) - - -C ----------------------------------------------------------------------------- -C Left shift of dst[1..2*size-2], the bit shifted out becomes dst[2*size-1]. -C The loop measures about 6 cycles/iteration, though it looks like it should -C decode in 5. - -L(lshift_start): - movl PARAM_SIZE, %ecx - - movl PARAM_DST, %edi - subl $1, %ecx C size-1 and clear carry - - movl PARAM_SRC, %ebx - movl %ecx, %edx - - xorl %eax, %eax C ready for adcl - - - ALIGN(16) -L(lshift): - C eax - C ebx src (for later use) - C ecx counter, decrementing - C edx size-1 (for later use) - C esi - C edi dst, incrementing - C ebp - - rcll 4(%edi) - rcll 8(%edi) - leal 8(%edi), %edi - loop L(lshift) - - - adcl %eax, %eax - - movl %eax, 4(%edi) C dst most significant limb - movl (%ebx), %eax C src[0] - - leal 4(%ebx,%edx,4), %ebx C &src[size] - subl %edx, %ecx C -(size-1) - - -C ----------------------------------------------------------------------------- -C Now add in the squares on the diagonal, src[0]^2, src[1]^2, ..., -C src[size-1]^2. dst[0] hasn't yet been set at all yet, and just gets the -C low limb of src[0]^2. - - - mull %eax - - movl %eax, (%edi,%ecx,8) C dst[0] - - - ALIGN(16) -L(diag): - C eax scratch - C ebx &src[size] - C ecx counter, negative - C edx carry - C esi scratch - C edi dst[2*size-2] - C ebp - - movl (%ebx,%ecx,4), %eax - movl %edx, %esi - - mull %eax - - addl %esi, 4(%edi,%ecx,8) - adcl %eax, 8(%edi,%ecx,8) - adcl $0, %edx - - incl %ecx - jnz L(diag) - - - movl SAVE_EBX, %ebx - movl SAVE_ESI, %esi - - addl %edx, 4(%edi) C dst most significant limb - - movl SAVE_EDI, %edi - movl SAVE_EBP, %ebp - addl $FRAME, %esp - ret - - - -C ----------------------------------------------------------------------------- -ifdef(`PIC',` -L(pic_calc): - C See README.family about old gas bugs - addl (%esp), %ecx - addl $L(unroll_inner_end)-L(here)-eval(2*CODE_BYTES_PER_LIMB), %ecx - addl %edx, %ecx - ret -') - - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/README b/rts/gmp/mpn/x86/k7/README deleted file mode 100644 index c34315c401..0000000000 --- a/rts/gmp/mpn/x86/k7/README +++ /dev/null @@ -1,145 +0,0 @@ - - AMD K7 MPN SUBROUTINES - - -This directory contains code optimized for the AMD Athlon CPU. - -The mmx subdirectory has routines using MMX instructions. All Athlons have -MMX, the separate directory is just so that configure can omit it if the -assembler doesn't support MMX. - - - -STATUS - -Times for the loops, with all code and data in L1 cache. - - cycles/limb - mpn_add/sub_n 1.6 - - mpn_copyi 0.75 or 1.0 \ varying with data alignment - mpn_copyd 0.75 or 1.0 / - - mpn_divrem_1 17.0 integer part, 15.0 fractional part - mpn_mod_1 17.0 - mpn_divexact_by3 8.0 - - mpn_l/rshift 1.2 - - mpn_mul_1 3.4 - mpn_addmul/submul_1 3.9 - - mpn_mul_basecase 4.42 cycles/crossproduct (approx) - - mpn_popcount 5.0 - mpn_hamdist 6.0 - -Prefetching of sources hasn't yet been tried. - - - -NOTES - -cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available. - -Write-allocate L1 data cache means prefetching of destinations is unnecessary. - -Floating point multiplications can be done in parallel with integer -multiplications, but there doesn't seem to be any way to make use of this. - -Unsigned "mul"s can be issued every 3 cycles. This suggests 3 is a limit on -the speed of the multiplication routines. The documentation shows mul -executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that, -to get near 3 cycles code has to be arranged so that nothing else is issued -to IEU0. A busy IEU0 could explain why some code takes 4 cycles and other -apparently equivalent code takes 5. - - - -OPTIMIZATIONS - -Unrolled loops are used to reduce looping overhead. The unrolling is -configurable up to 32 limbs/loop for most routines and up to 64 for some. -The K7 has 64k L1 code cache so quite big unrolling is allowable. - -Computed jumps into the unrolling are used to handle sizes not a multiple of -the unrolling. An attractive feature of this is that times increase -smoothly with operand size, but it may be that some routines should just -have simple loops to finish up, especially when PIC adds between 2 and 16 -cycles to get %eip. - -Position independent code is implemented using a call to get %eip for the -computed jumps and a ret is always done, rather than an addl $4,%esp or a -popl, so the CPU return address branch prediction stack stays synchronised -with the actual stack in memory. - -Branch prediction, in absence of any history, will guess forward jumps are -not taken and backward jumps are taken. Where possible it's arranged that -the less likely or less important case is under a taken forward jump. - - - -CODING - -Instructions in general code have been shown grouped if they can execute -together, which means up to three direct-path instructions which have no -successive dependencies. K7 always decodes three and has out-of-order -execution, but the groupings show what slots might be available and what -dependency chains exist. - -When there's vector-path instructions an effort is made to get triplets of -direct-path instructions in between them, even if there's dependencies, -since this maximizes decoding throughput and might save a cycle or two if -decoding is the limiting factor. - - - -INSTRUCTIONS - -adcl direct -divl 39 cycles back-to-back -lodsl,etc vector -loop 1 cycle vector (decl/jnz opens up one decode slot) -movd reg vector -movd mem direct -mull issue every 3 cycles, latency 4 cycles low word, 6 cycles high word -popl vector (use movl for more than one pop) -pushl direct, will pair with a load -shrdl %cl vector, 3 cycles, seems to be 3 decode too -xorl r,r false read dependency recognised - - - -REFERENCES - -"AMD Athlon Processor X86 Code Optimization Guide", AMD publication number -22007, revision E, November 1999. Available on-line, - - http://www.amd.com/products/cpg/athlon/techdocs/pdf/22007.pdf - -"3DNow Technology Manual", AMD publication number 21928F/0-August 1999. -This describes the femms and prefetch instructions. Available on-line, - - http://www.amd.com/K6/k6docs/pdf/21928.pdf - -"AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD -publication number 22466, revision B, August 1999. This describes -instructions added in the Athlon processor, such as pswapd and the extra -prefetch forms. Available on-line, - - http://www.amd.com/products/cpg/athlon/techdocs/pdf/22466.pdf - -"3DNow Instruction Porting Guide", AMD publication number 22621, revision B, -August 1999. This has some notes on general Athlon optimizations as well as -3DNow. Available on-line, - - http://www.amd.com/products/cpg/athlon/techdocs/pdf/22621.pdf - - - - ----------------- -Local variables: -mode: text -fill-column: 76 -End: diff --git a/rts/gmp/mpn/x86/k7/aors_n.asm b/rts/gmp/mpn/x86/k7/aors_n.asm deleted file mode 100644 index 85fa9d3036..0000000000 --- a/rts/gmp/mpn/x86/k7/aors_n.asm +++ /dev/null @@ -1,250 +0,0 @@ -dnl AMD K7 mpn_add_n/mpn_sub_n -- mpn add or subtract. -dnl -dnl K7: 1.64 cycles/limb (at 16 limb/loop). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K7: UNROLL_COUNT cycles/limb -dnl 8 1.9 -dnl 16 1.64 -dnl 32 1.7 -dnl 64 2.0 -dnl Maximum possible with the current code is 64. - -deflit(UNROLL_COUNT, 16) - - -ifdef(`OPERATION_add_n', ` - define(M4_inst, adcl) - define(M4_function_n, mpn_add_n) - define(M4_function_nc, mpn_add_nc) - define(M4_description, add) -',`ifdef(`OPERATION_sub_n', ` - define(M4_inst, sbbl) - define(M4_function_n, mpn_sub_n) - define(M4_function_nc, mpn_sub_nc) - define(M4_description, subtract) -',`m4_error(`Need OPERATION_add_n or OPERATION_sub_n -')')') - -MULFUNC_PROLOGUE(mpn_add_n mpn_add_nc mpn_sub_n mpn_sub_nc) - - -C mp_limb_t M4_function_n (mp_ptr dst, mp_srcptr src1, mp_srcptr src2, -C mp_size_t size); -C mp_limb_t M4_function_nc (mp_ptr dst, mp_srcptr src1, mp_srcptr src2, -C mp_size_t size, mp_limb_t carry); -C -C Calculate src1,size M4_description src2,size, and store the result in -C dst,size. The return value is the carry bit from the top of the result (1 -C or 0). -C -C The _nc version accepts 1 or 0 for an initial carry into the low limb of -C the calculation. Note values other than 1 or 0 here will lead to garbage -C results. -C -C This code runs at 1.64 cycles/limb, which is probably the best possible -C with plain integer operations. Each limb is 2 loads and 1 store, and in -C one cycle the K7 can do two loads, or a load and a store, leading to 1.5 -C c/l. - -dnl Must have UNROLL_THRESHOLD >= 2, since the unrolled loop can't handle 1. -ifdef(`PIC',` -deflit(UNROLL_THRESHOLD, 8) -',` -deflit(UNROLL_THRESHOLD, 8) -') - -defframe(PARAM_CARRY,20) -defframe(PARAM_SIZE, 16) -defframe(PARAM_SRC2, 12) -defframe(PARAM_SRC1, 8) -defframe(PARAM_DST, 4) - -defframe(SAVE_EBP, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EBX, -12) -defframe(SAVE_EDI, -16) -deflit(STACK_SPACE, 16) - - .text - ALIGN(32) -deflit(`FRAME',0) - -PROLOGUE(M4_function_nc) - movl PARAM_CARRY, %eax - jmp LF(M4_function_n,start) -EPILOGUE() - -PROLOGUE(M4_function_n) - - xorl %eax, %eax C carry -L(start): - movl PARAM_SIZE, %ecx - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %edi, SAVE_EDI - movl %ebx, SAVE_EBX - cmpl $UNROLL_THRESHOLD, %ecx - - movl PARAM_SRC2, %edx - movl PARAM_SRC1, %ebx - jae L(unroll) - - movl PARAM_DST, %edi - leal (%ebx,%ecx,4), %ebx - leal (%edx,%ecx,4), %edx - - leal (%edi,%ecx,4), %edi - negl %ecx - shrl %eax - - C This loop in in a single 16 byte code block already, so no - C alignment necessary. -L(simple): - C eax scratch - C ebx src1 - C ecx counter - C edx src2 - C esi - C edi dst - C ebp - - movl (%ebx,%ecx,4), %eax - M4_inst (%edx,%ecx,4), %eax - movl %eax, (%edi,%ecx,4) - incl %ecx - jnz L(simple) - - movl $0, %eax - movl SAVE_EDI, %edi - - movl SAVE_EBX, %ebx - setc %al - addl $STACK_SPACE, %esp - - ret - - -C ----------------------------------------------------------------------------- - C This is at 0x55, close enough to aligned. -L(unroll): -deflit(`FRAME',STACK_SPACE) - movl %ebp, SAVE_EBP - andl $-2, %ecx C size low bit masked out - andl $1, PARAM_SIZE C size low bit kept - - movl %ecx, %edi - decl %ecx - movl PARAM_DST, %ebp - - shrl $UNROLL_LOG2, %ecx - negl %edi - movl %esi, SAVE_ESI - - andl $UNROLL_MASK, %edi - -ifdef(`PIC',` - call L(pic_calc) -L(here): -',` - leal L(entry) (%edi,%edi,8), %esi C 9 bytes per -') - negl %edi - shrl %eax - - leal ifelse(UNROLL_BYTES,256,128) (%ebx,%edi,4), %ebx - leal ifelse(UNROLL_BYTES,256,128) (%edx,%edi,4), %edx - leal ifelse(UNROLL_BYTES,256,128) (%ebp,%edi,4), %edi - - jmp *%esi - - -ifdef(`PIC',` -L(pic_calc): - C See README.family about old gas bugs - leal (%edi,%edi,8), %esi - addl $L(entry)-L(here), %esi - addl (%esp), %esi - ret -') - - -C ----------------------------------------------------------------------------- - ALIGN(32) -L(top): - C eax zero - C ebx src1 - C ecx counter - C edx src2 - C esi scratch (was computed jump) - C edi dst - C ebp scratch - - leal UNROLL_BYTES(%edx), %edx - -L(entry): -deflit(CHUNK_COUNT, 2) -forloop(i, 0, UNROLL_COUNT/CHUNK_COUNT-1, ` - deflit(`disp0', eval(i*CHUNK_COUNT*4 ifelse(UNROLL_BYTES,256,-128))) - deflit(`disp1', eval(disp0 + 4)) - -Zdisp( movl, disp0,(%ebx), %esi) - movl disp1(%ebx), %ebp -Zdisp( M4_inst,disp0,(%edx), %esi) -Zdisp( movl, %esi, disp0,(%edi)) - M4_inst disp1(%edx), %ebp - movl %ebp, disp1(%edi) -') - - decl %ecx - leal UNROLL_BYTES(%ebx), %ebx - leal UNROLL_BYTES(%edi), %edi - jns L(top) - - - mov PARAM_SIZE, %esi - movl SAVE_EBP, %ebp - movl $0, %eax - - decl %esi - js L(even) - - movl (%ebx), %ecx - M4_inst UNROLL_BYTES(%edx), %ecx - movl %ecx, (%edi) -L(even): - - movl SAVE_EDI, %edi - movl SAVE_EBX, %ebx - setc %al - - movl SAVE_ESI, %esi - addl $STACK_SPACE, %esp - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/aorsmul_1.asm b/rts/gmp/mpn/x86/k7/aorsmul_1.asm deleted file mode 100644 index 9f9c3daaf4..0000000000 --- a/rts/gmp/mpn/x86/k7/aorsmul_1.asm +++ /dev/null @@ -1,364 +0,0 @@ -dnl AMD K7 mpn_addmul_1/mpn_submul_1 -- add or subtract mpn multiple. -dnl -dnl K7: 3.9 cycles/limb. -dnl -dnl Future: It should be possible to avoid the separate mul after the -dnl unrolled loop by moving the movl/adcl to the top. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K7: UNROLL_COUNT cycles/limb -dnl 4 4.42 -dnl 8 4.16 -dnl 16 3.9 -dnl 32 3.9 -dnl 64 3.87 -dnl Maximum possible with the current code is 64. - -deflit(UNROLL_COUNT, 16) - - -ifdef(`OPERATION_addmul_1',` - define(M4_inst, addl) - define(M4_function_1, mpn_addmul_1) - define(M4_function_1c, mpn_addmul_1c) - define(M4_description, add it to) - define(M4_desc_retval, carry) -',`ifdef(`OPERATION_submul_1',` - define(M4_inst, subl) - define(M4_function_1, mpn_submul_1) - define(M4_function_1c, mpn_submul_1c) - define(M4_description, subtract it from) - define(M4_desc_retval, borrow) -',`m4_error(`Need OPERATION_addmul_1 or OPERATION_submul_1 -')')') - -MULFUNC_PROLOGUE(mpn_addmul_1 mpn_addmul_1c mpn_submul_1 mpn_submul_1c) - - -C mp_limb_t M4_function_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t mult); -C mp_limb_t M4_function_1c (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t mult, mp_limb_t carry); -C -C Calculate src,size multiplied by mult and M4_description dst,size. -C Return the M4_desc_retval limb from the top of the result. - -ifdef(`PIC',` -deflit(UNROLL_THRESHOLD, 9) -',` -deflit(UNROLL_THRESHOLD, 6) -') - -defframe(PARAM_CARRY, 20) -defframe(PARAM_MULTIPLIER,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - -defframe(SAVE_EBX, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EDI, -12) -defframe(SAVE_EBP, -16) -deflit(SAVE_SIZE, 16) - - .text - ALIGN(32) -PROLOGUE(M4_function_1) - movl PARAM_SIZE, %edx - movl PARAM_SRC, %eax - xorl %ecx, %ecx - - decl %edx - jnz LF(M4_function_1c,start_1) - - movl (%eax), %eax - movl PARAM_DST, %ecx - - mull PARAM_MULTIPLIER - - M4_inst %eax, (%ecx) - adcl $0, %edx - movl %edx, %eax - - ret -EPILOGUE() - - ALIGN(16) -PROLOGUE(M4_function_1c) - movl PARAM_SIZE, %edx - movl PARAM_SRC, %eax - - decl %edx - jnz L(more_than_one_limb) - - movl (%eax), %eax - movl PARAM_DST, %ecx - - mull PARAM_MULTIPLIER - - addl PARAM_CARRY, %eax - - adcl $0, %edx - M4_inst %eax, (%ecx) - - adcl $0, %edx - movl %edx, %eax - - ret - - - C offset 0x44 so close enough to aligned -L(more_than_one_limb): - movl PARAM_CARRY, %ecx -L(start_1): - C eax src - C ecx initial carry - C edx size-1 - subl $SAVE_SIZE, %esp -deflit(`FRAME',16) - - movl %ebx, SAVE_EBX - movl %esi, SAVE_ESI - movl %edx, %ebx C size-1 - - movl PARAM_SRC, %esi - movl %ebp, SAVE_EBP - cmpl $UNROLL_THRESHOLD, %edx - - movl PARAM_MULTIPLIER, %ebp - movl %edi, SAVE_EDI - - movl (%esi), %eax C src low limb - movl PARAM_DST, %edi - ja L(unroll) - - - C simple loop - - leal 4(%esi,%ebx,4), %esi C point one limb past last - leal (%edi,%ebx,4), %edi C point at last limb - negl %ebx - - C The movl to load the next source limb is done well ahead of the - C mul. This is necessary for full speed, and leads to one limb - C handled separately at the end. - -L(simple): - C eax src limb - C ebx loop counter - C ecx carry limb - C edx scratch - C esi src - C edi dst - C ebp multiplier - - mull %ebp - - addl %eax, %ecx - adcl $0, %edx - - M4_inst %ecx, (%edi,%ebx,4) - movl (%esi,%ebx,4), %eax - adcl $0, %edx - - incl %ebx - movl %edx, %ecx - jnz L(simple) - - - mull %ebp - - movl SAVE_EBX, %ebx - movl SAVE_ESI, %esi - movl SAVE_EBP, %ebp - - addl %eax, %ecx - adcl $0, %edx - - M4_inst %ecx, (%edi) - adcl $0, %edx - movl SAVE_EDI, %edi - - addl $SAVE_SIZE, %esp - movl %edx, %eax - ret - - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(unroll): - C eax src low limb - C ebx size-1 - C ecx carry - C edx size-1 - C esi src - C edi dst - C ebp multiplier - -dnl overlapping with parameters no longer needed -define(VAR_COUNTER,`PARAM_SIZE') -define(VAR_JUMP, `PARAM_MULTIPLIER') - - subl $2, %ebx C (size-2)-1 - decl %edx C size-2 - - shrl $UNROLL_LOG2, %ebx - negl %edx - - movl %ebx, VAR_COUNTER - andl $UNROLL_MASK, %edx - - movl %edx, %ebx - shll $4, %edx - -ifdef(`PIC',` - call L(pic_calc) -L(here): -',` - leal L(entry) (%edx,%ebx,1), %edx -') - negl %ebx - movl %edx, VAR_JUMP - - mull %ebp - - addl %eax, %ecx C initial carry, becomes low carry - adcl $0, %edx - testb $1, %bl - - movl 4(%esi), %eax C src second limb - leal ifelse(UNROLL_BYTES,256,128+) 8(%esi,%ebx,4), %esi - leal ifelse(UNROLL_BYTES,256,128) (%edi,%ebx,4), %edi - - movl %edx, %ebx C high carry - cmovnz( %ecx, %ebx) C high,low carry other way around - cmovnz( %edx, %ecx) - - jmp *VAR_JUMP - - -ifdef(`PIC',` -L(pic_calc): - C See README.family about old gas bugs - leal (%edx,%ebx,1), %edx - addl $L(entry)-L(here), %edx - addl (%esp), %edx - ret -') - - -C ----------------------------------------------------------------------------- -C This code uses a "two carry limbs" scheme. At the top of the loop the -C carries are ebx=lo, ecx=hi, then they swap for each limb processed. For -C the computed jump an odd size means they start one way around, an even -C size the other. Either way one limb is handled separately at the start of -C the loop. -C -C The positioning of the movl to load the next source limb is important. -C Moving it after the adcl with a view to avoiding a separate mul at the end -C of the loop slows the code down. - - ALIGN(32) -L(top): - C eax src limb - C ebx carry high - C ecx carry low - C edx scratch - C esi src+8 - C edi dst - C ebp multiplier - C - C VAR_COUNTER loop counter - C - C 17 bytes each limb - -L(entry): -deflit(CHUNK_COUNT,2) -forloop(`i', 0, UNROLL_COUNT/CHUNK_COUNT-1, ` - deflit(`disp0', eval(i*CHUNK_COUNT*4 ifelse(UNROLL_BYTES,256,-128))) - deflit(`disp1', eval(disp0 + 4)) - - mull %ebp - -Zdisp( M4_inst,%ecx, disp0,(%edi)) - movl $0, %ecx - - adcl %eax, %ebx - -Zdisp( movl, disp0,(%esi), %eax) - adcl %edx, %ecx - - - mull %ebp - - M4_inst %ebx, disp1(%edi) - movl $0, %ebx - - adcl %eax, %ecx - - movl disp1(%esi), %eax - adcl %edx, %ebx -') - - decl VAR_COUNTER - leal UNROLL_BYTES(%esi), %esi - leal UNROLL_BYTES(%edi), %edi - - jns L(top) - - - C eax src limb - C ebx carry high - C ecx carry low - C edx - C esi - C edi dst (points at second last limb) - C ebp multiplier -deflit(`disp0', ifelse(UNROLL_BYTES,256,-128)) -deflit(`disp1', eval(disp0-0 + 4)) - - mull %ebp - - M4_inst %ecx, disp0(%edi) - movl SAVE_EBP, %ebp - - adcl %ebx, %eax - movl SAVE_EBX, %ebx - movl SAVE_ESI, %esi - - adcl $0, %edx - M4_inst %eax, disp1(%edi) - movl SAVE_EDI, %edi - - adcl $0, %edx - addl $SAVE_SIZE, %esp - - movl %edx, %eax - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/diveby3.asm b/rts/gmp/mpn/x86/k7/diveby3.asm deleted file mode 100644 index 57684958a5..0000000000 --- a/rts/gmp/mpn/x86/k7/diveby3.asm +++ /dev/null @@ -1,131 +0,0 @@ -dnl AMD K7 mpn_divexact_by3 -- mpn division by 3, expecting no remainder. -dnl -dnl K7: 8.0 cycles/limb - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_divexact_by3c (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t carry); - -defframe(PARAM_CARRY,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - -dnl multiplicative inverse of 3, modulo 2^32 -deflit(INVERSE_3, 0xAAAAAAAB) - -dnl ceil(b/3) and floor(b*2/3) where b=2^32 -deflit(ONE_THIRD_CEIL, 0x55555556) -deflit(TWO_THIRDS_FLOOR, 0xAAAAAAAA) - - .text - ALIGN(32) - -PROLOGUE(mpn_divexact_by3c) -deflit(`FRAME',0) - - movl PARAM_SRC, %ecx - pushl %ebx defframe_pushl(SAVE_EBX) - - movl PARAM_CARRY, %ebx - pushl %ebp defframe_pushl(SAVE_EBP) - - movl PARAM_SIZE, %ebp - pushl %edi defframe_pushl(SAVE_EDI) - - movl (%ecx), %eax C src low limb - pushl %esi defframe_pushl(SAVE_ESI) - - movl PARAM_DST, %edi - movl $TWO_THIRDS_FLOOR, %esi - leal -4(%ecx,%ebp,4), %ecx C &src[size-1] - - subl %ebx, %eax - - setc %bl - decl %ebp - jz L(last) - - leal (%edi,%ebp,4), %edi C &dst[size-1] - negl %ebp - - - ALIGN(16) -L(top): - C eax src limb, carry subtracted - C ebx carry limb (0 or 1) - C ecx &src[size-1] - C edx scratch - C esi TWO_THIRDS_FLOOR - C edi &dst[size-1] - C ebp counter, limbs, negative - - imull $INVERSE_3, %eax, %edx - - movl 4(%ecx,%ebp,4), %eax C next src limb - cmpl $ONE_THIRD_CEIL, %edx - - sbbl $-1, %ebx C +1 if result>=ceil(b/3) - cmpl %edx, %esi - - sbbl %ebx, %eax C and further 1 if result>=ceil(b*2/3) - movl %edx, (%edi,%ebp,4) - incl %ebp - - setc %bl C new carry - jnz L(top) - - - -L(last): - C eax src limb, carry subtracted - C ebx carry limb (0 or 1) - C ecx &src[size-1] - C edx scratch - C esi multiplier - C edi &dst[size-1] - C ebp - - imull $INVERSE_3, %eax - - cmpl $ONE_THIRD_CEIL, %eax - movl %eax, (%edi) - movl SAVE_EBP, %ebp - - sbbl $-1, %ebx C +1 if eax>=ceil(b/3) - cmpl %eax, %esi - movl $0, %eax - - adcl %ebx, %eax C further +1 if eax>=ceil(b*2/3) - movl SAVE_EDI, %edi - movl SAVE_ESI, %esi - - movl SAVE_EBX, %ebx - addl $FRAME, %esp - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/gmp-mparam.h b/rts/gmp/mpn/x86/k7/gmp-mparam.h deleted file mode 100644 index c3bba0afc4..0000000000 --- a/rts/gmp/mpn/x86/k7/gmp-mparam.h +++ /dev/null @@ -1,100 +0,0 @@ -/* AMD K7 gmp-mparam.h -- Compiler/machine parameter header file. - -Copyright (C) 1991, 1993, 1994, 2000 Free Software Foundation, Inc. - -This file is part of the GNU MP Library. - -The GNU MP Library is free software; you can redistribute it and/or modify -it under the terms of the GNU Lesser General Public License as published by -the Free Software Foundation; either version 2.1 of the License, or (at your -option) any later version. - -The GNU MP Library is distributed in the hope that it will be useful, but -WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY -or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public -License for more details. - -You should have received a copy of the GNU Lesser General Public License -along with the GNU MP Library; see the file COPYING.LIB. If not, write to -the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, -MA 02111-1307, USA. */ - -#define BITS_PER_MP_LIMB 32 -#define BYTES_PER_MP_LIMB 4 -#define BITS_PER_LONGINT 32 -#define BITS_PER_INT 32 -#define BITS_PER_SHORTINT 16 -#define BITS_PER_CHAR 8 - - -/* the low limb is ready after 4 cycles, but normally it's the high limb - which is of interest, and that comes out after 6 cycles */ -#ifndef UMUL_TIME -#define UMUL_TIME 6 /* cycles */ -#endif - -/* AMD doco says 40, but it measures 39 back-to-back */ -#ifndef UDIV_TIME -#define UDIV_TIME 39 /* cycles */ -#endif - -/* using bsf */ -#ifndef COUNT_TRAILING_ZEROS_TIME -#define COUNT_TRAILING_ZEROS_TIME 7 /* cycles */ -#endif - - -/* Generated by tuneup.c, 2000-07-06. */ - -#ifndef KARATSUBA_MUL_THRESHOLD -#define KARATSUBA_MUL_THRESHOLD 26 -#endif -#ifndef TOOM3_MUL_THRESHOLD -#define TOOM3_MUL_THRESHOLD 177 -#endif - -#ifndef KARATSUBA_SQR_THRESHOLD -#define KARATSUBA_SQR_THRESHOLD 52 -#endif -#ifndef TOOM3_SQR_THRESHOLD -#define TOOM3_SQR_THRESHOLD 173 -#endif - -#ifndef BZ_THRESHOLD -#define BZ_THRESHOLD 76 -#endif - -#ifndef FIB_THRESHOLD -#define FIB_THRESHOLD 114 -#endif - -#ifndef POWM_THRESHOLD -#define POWM_THRESHOLD 34 -#endif - -#ifndef GCD_ACCEL_THRESHOLD -#define GCD_ACCEL_THRESHOLD 5 -#endif -#ifndef GCDEXT_THRESHOLD -#define GCDEXT_THRESHOLD 54 -#endif - -#ifndef FFT_MUL_TABLE -#define FFT_MUL_TABLE { 720, 1440, 2944, 7680, 18432, 57344, 0 } -#endif -#ifndef FFT_MODF_MUL_THRESHOLD -#define FFT_MODF_MUL_THRESHOLD 736 -#endif -#ifndef FFT_MUL_THRESHOLD -#define FFT_MUL_THRESHOLD 6912 -#endif - -#ifndef FFT_SQR_TABLE -#define FFT_SQR_TABLE { 784, 1696, 3200, 7680, 18432, 57344, 0 } -#endif -#ifndef FFT_MODF_SQR_THRESHOLD -#define FFT_MODF_SQR_THRESHOLD 800 -#endif -#ifndef FFT_SQR_THRESHOLD -#define FFT_SQR_THRESHOLD 8448 -#endif diff --git a/rts/gmp/mpn/x86/k7/mmx/copyd.asm b/rts/gmp/mpn/x86/k7/mmx/copyd.asm deleted file mode 100644 index 33214daa1f..0000000000 --- a/rts/gmp/mpn/x86/k7/mmx/copyd.asm +++ /dev/null @@ -1,136 +0,0 @@ -dnl AMD K7 mpn_copyd -- copy limb vector, decrementing. -dnl -dnl alignment dst/src, A=0mod8 N=4mod8 -dnl A/A A/N N/A N/N -dnl K7 0.75 1.0 1.0 0.75 - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C void mpn_copyd (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C The various comments in mpn/x86/k7/copyi.asm apply here too. - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - -dnl parameter space reused -define(SAVE_EBX,`PARAM_SIZE') -define(SAVE_ESI,`PARAM_SRC') - -dnl minimum 5 since the unrolled code can't handle less than 5 -deflit(UNROLL_THRESHOLD, 5) - - .text - ALIGN(32) -PROLOGUE(mpn_copyd) - - movl PARAM_SIZE, %ecx - movl %ebx, SAVE_EBX - - movl PARAM_SRC, %eax - movl PARAM_DST, %edx - - cmpl $UNROLL_THRESHOLD, %ecx - jae L(unroll) - - orl %ecx, %ecx - jz L(simple_done) - -L(simple): - C eax src - C ebx scratch - C ecx counter - C edx dst - C - C this loop is 2 cycles/limb - - movl -4(%eax,%ecx,4), %ebx - movl %ebx, -4(%edx,%ecx,4) - decl %ecx - jnz L(simple) - -L(simple_done): - movl SAVE_EBX, %ebx - ret - - -L(unroll): - movl %esi, SAVE_ESI - leal (%eax,%ecx,4), %ebx - leal (%edx,%ecx,4), %esi - - andl %esi, %ebx - movl SAVE_ESI, %esi - subl $4, %ecx C size-4 - - testl $4, %ebx C testl to pad code closer to 16 bytes for L(top) - jz L(aligned) - - C both src and dst unaligned, process one limb to align them - movl 12(%eax,%ecx,4), %ebx - movl %ebx, 12(%edx,%ecx,4) - decl %ecx -L(aligned): - - - ALIGN(16) -L(top): - C eax src - C ebx - C ecx counter, limbs - C edx dst - - movq 8(%eax,%ecx,4), %mm0 - movq (%eax,%ecx,4), %mm1 - subl $4, %ecx - movq %mm0, 16+8(%edx,%ecx,4) - movq %mm1, 16(%edx,%ecx,4) - jns L(top) - - - C now %ecx is -4 to -1 representing respectively 0 to 3 limbs remaining - - testb $2, %cl - jz L(finish_not_two) - - movq 8(%eax,%ecx,4), %mm0 - movq %mm0, 8(%edx,%ecx,4) -L(finish_not_two): - - testb $1, %cl - jz L(done) - - movl (%eax), %ebx - movl %ebx, (%edx) - -L(done): - movl SAVE_EBX, %ebx - emms - ret - - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/mmx/copyi.asm b/rts/gmp/mpn/x86/k7/mmx/copyi.asm deleted file mode 100644 index b234a1628c..0000000000 --- a/rts/gmp/mpn/x86/k7/mmx/copyi.asm +++ /dev/null @@ -1,147 +0,0 @@ -dnl AMD K7 mpn_copyi -- copy limb vector, incrementing. -dnl -dnl alignment dst/src, A=0mod8 N=4mod8 -dnl A/A A/N N/A N/N -dnl K7 0.75 1.0 1.0 0.75 - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C void mpn_copyi (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C Copy src,size to dst,size. -C -C This code at 0.75 or 1.0 c/l is always faster than a plain rep movsl at -C 1.33 c/l. -C -C The K7 can do two loads, or two stores, or a load and a store, in one -C cycle, so if those are 64-bit operations then 0.5 c/l should be possible, -C however nothing under 0.7 c/l is known. -C -C If both source and destination are unaligned then one limb is processed at -C the start to make them aligned and so get 0.75 c/l, whereas if they'd been -C used unaligned it would be 1.5 c/l. - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - -dnl parameter space reused -define(SAVE_EBX,`PARAM_SIZE') - -dnl minimum 5 since the unrolled code can't handle less than 5 -deflit(UNROLL_THRESHOLD, 5) - - .text - ALIGN(32) -PROLOGUE(mpn_copyi) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - movl %ebx, SAVE_EBX - - movl PARAM_SRC, %eax - movl PARAM_DST, %edx - - cmpl $UNROLL_THRESHOLD, %ecx - jae L(unroll) - - orl %ecx, %ecx - jz L(simple_done) - -L(simple): - C eax src, incrementing - C ebx scratch - C ecx counter - C edx dst, incrementing - C - C this loop is 2 cycles/limb - - movl (%eax), %ebx - movl %ebx, (%edx) - decl %ecx - leal 4(%eax), %eax - leal 4(%edx), %edx - jnz L(simple) - -L(simple_done): - movl SAVE_EBX, %ebx - ret - - -L(unroll): - movl %eax, %ebx - leal -12(%eax,%ecx,4), %eax C src end - 12 - subl $3, %ecx C size-3 - - andl %edx, %ebx - leal (%edx,%ecx,4), %edx C dst end - 12 - negl %ecx - - testl $4, %ebx C testl to pad code closer to 16 bytes for L(top) - jz L(aligned) - - C both src and dst unaligned, process one limb to align them - movl (%eax,%ecx,4), %ebx - movl %ebx, (%edx,%ecx,4) - incl %ecx -L(aligned): - - - ALIGN(16) -L(top): - C eax src end - 12 - C ebx - C ecx counter, negative, limbs - C edx dst end - 12 - - movq (%eax,%ecx,4), %mm0 - movq 8(%eax,%ecx,4), %mm1 - addl $4, %ecx - movq %mm0, -16(%edx,%ecx,4) - movq %mm1, -16+8(%edx,%ecx,4) - ja L(top) C jump no carry and not zero - - - C now %ecx is 0 to 3 representing respectively 3 to 0 limbs remaining - - testb $2, %cl - jnz L(finish_not_two) - - movq (%eax,%ecx,4), %mm0 - movq %mm0, (%edx,%ecx,4) -L(finish_not_two): - - testb $1, %cl - jnz L(done) - - movl 8(%eax), %ebx - movl %ebx, 8(%edx) - -L(done): - movl SAVE_EBX, %ebx - emms - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/mmx/divrem_1.asm b/rts/gmp/mpn/x86/k7/mmx/divrem_1.asm deleted file mode 100644 index 483ad6a9a1..0000000000 --- a/rts/gmp/mpn/x86/k7/mmx/divrem_1.asm +++ /dev/null @@ -1,718 +0,0 @@ -dnl AMD K7 mpn_divrem_1 -- mpn by limb division. -dnl -dnl K7: 17.0 cycles/limb integer part, 15.0 cycles/limb fraction part. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, -C mp_srcptr src, mp_size_t size, -C mp_limb_t divisor); -C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, -C mp_srcptr src, mp_size_t size, -C mp_limb_t divisor, mp_limb_t carry); -C -C The method and nomenclature follow part 8 of "Division by Invariant -C Integers using Multiplication" by Granlund and Montgomery, reference in -C gmp.texi. -C -C The "and"s shown in the paper are done here with "cmov"s. "m" is written -C for m', and "d" for d_norm, which won't cause any confusion since it's -C only the normalized divisor that's of any use in the code. "b" is written -C for 2^N, the size of a limb, N being 32 here. -C -C mpn_divrem_1 avoids one division if the src high limb is less than the -C divisor. mpn_divrem_1c doesn't check for a zero carry, since in normal -C circumstances that will be a very rare event. -C -C There's a small bias towards expecting xsize==0, by having code for -C xsize==0 in a straight line and xsize!=0 under forward jumps. - - -dnl MUL_THRESHOLD is the value of xsize+size at which the multiply by -dnl inverse method is used, rather than plain "divl"s. Minimum value 1. -dnl -dnl The inverse takes about 50 cycles to calculate, but after that the -dnl multiply is 17 c/l versus division at 42 c/l. -dnl -dnl At 3 limbs the mul is a touch faster than div on the integer part, and -dnl even more so on the fractional part. - -deflit(MUL_THRESHOLD, 3) - - -defframe(PARAM_CARRY, 24) -defframe(PARAM_DIVISOR,20) -defframe(PARAM_SIZE, 16) -defframe(PARAM_SRC, 12) -defframe(PARAM_XSIZE, 8) -defframe(PARAM_DST, 4) - -defframe(SAVE_EBX, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EDI, -12) -defframe(SAVE_EBP, -16) - -defframe(VAR_NORM, -20) -defframe(VAR_INVERSE, -24) -defframe(VAR_SRC, -28) -defframe(VAR_DST, -32) -defframe(VAR_DST_STOP,-36) - -deflit(STACK_SPACE, 36) - - .text - ALIGN(32) - -PROLOGUE(mpn_divrem_1c) -deflit(`FRAME',0) - movl PARAM_CARRY, %edx - movl PARAM_SIZE, %ecx - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %ebx, SAVE_EBX - movl PARAM_XSIZE, %ebx - - movl %edi, SAVE_EDI - movl PARAM_DST, %edi - - movl %ebp, SAVE_EBP - movl PARAM_DIVISOR, %ebp - - movl %esi, SAVE_ESI - movl PARAM_SRC, %esi - - leal -4(%edi,%ebx,4), %edi - jmp LF(mpn_divrem_1,start_1c) - -EPILOGUE() - - - C offset 0x31, close enough to aligned -PROLOGUE(mpn_divrem_1) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - movl $0, %edx C initial carry (if can't skip a div) - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %ebp, SAVE_EBP - movl PARAM_DIVISOR, %ebp - - movl %ebx, SAVE_EBX - movl PARAM_XSIZE, %ebx - - movl %esi, SAVE_ESI - movl PARAM_SRC, %esi - orl %ecx, %ecx - - movl %edi, SAVE_EDI - movl PARAM_DST, %edi - leal -4(%edi,%ebx,4), %edi C &dst[xsize-1] - - jz L(no_skip_div) - movl -4(%esi,%ecx,4), %eax C src high limb - - cmpl %ebp, %eax C one less div if high<divisor - jnb L(no_skip_div) - - movl $0, (%edi,%ecx,4) C dst high limb - decl %ecx C size-1 - movl %eax, %edx C src high limb as initial carry -L(no_skip_div): - - -L(start_1c): - C eax - C ebx xsize - C ecx size - C edx carry - C esi src - C edi &dst[xsize-1] - C ebp divisor - - leal (%ebx,%ecx), %eax C size+xsize - cmpl $MUL_THRESHOLD, %eax - jae L(mul_by_inverse) - - -C With MUL_THRESHOLD set to 3, the simple loops here only do 0 to 2 limbs. -C It'd be possible to write them out without the looping, but no speedup -C would be expected. -C -C Using PARAM_DIVISOR instead of %ebp measures 1 cycle/loop faster on the -C integer part, but curiously not on the fractional part, where %ebp is a -C (fixed) couple of cycles faster. - - orl %ecx, %ecx - jz L(divide_no_integer) - -L(divide_integer): - C eax scratch (quotient) - C ebx xsize - C ecx counter - C edx scratch (remainder) - C esi src - C edi &dst[xsize-1] - C ebp divisor - - movl -4(%esi,%ecx,4), %eax - - divl PARAM_DIVISOR - - movl %eax, (%edi,%ecx,4) - decl %ecx - jnz L(divide_integer) - - -L(divide_no_integer): - movl PARAM_DST, %edi - orl %ebx, %ebx - jnz L(divide_fraction) - -L(divide_done): - movl SAVE_ESI, %esi - movl SAVE_EDI, %edi - movl %edx, %eax - - movl SAVE_EBX, %ebx - movl SAVE_EBP, %ebp - addl $STACK_SPACE, %esp - - ret - - -L(divide_fraction): - C eax scratch (quotient) - C ebx counter - C ecx - C edx scratch (remainder) - C esi - C edi dst - C ebp divisor - - movl $0, %eax - - divl %ebp - - movl %eax, -4(%edi,%ebx,4) - decl %ebx - jnz L(divide_fraction) - - jmp L(divide_done) - - - -C ----------------------------------------------------------------------------- - -L(mul_by_inverse): - C eax - C ebx xsize - C ecx size - C edx carry - C esi src - C edi &dst[xsize-1] - C ebp divisor - - bsrl %ebp, %eax C 31-l - - leal 12(%edi), %ebx - leal 4(%edi,%ecx,4), %edi C &dst[xsize+size] - - movl %edi, VAR_DST - movl %ebx, VAR_DST_STOP - - movl %ecx, %ebx C size - movl $31, %ecx - - movl %edx, %edi C carry - movl $-1, %edx - - C - - xorl %eax, %ecx C l - incl %eax C 32-l - - shll %cl, %ebp C d normalized - movl %ecx, VAR_NORM - - movd %eax, %mm7 - - movl $-1, %eax - subl %ebp, %edx C (b-d)-1 giving edx:eax = b*(b-d)-1 - - divl %ebp C floor (b*(b-d)-1) / d - - orl %ebx, %ebx C size - movl %eax, VAR_INVERSE - leal -12(%esi,%ebx,4), %eax C &src[size-3] - - jz L(start_zero) - movl %eax, VAR_SRC - cmpl $1, %ebx - - movl 8(%eax), %esi C src high limb - jz L(start_one) - -L(start_two_or_more): - movl 4(%eax), %edx C src second highest limb - - shldl( %cl, %esi, %edi) C n2 = carry,high << l - - shldl( %cl, %edx, %esi) C n10 = high,second << l - - cmpl $2, %ebx - je L(integer_two_left) - jmp L(integer_top) - - -L(start_one): - shldl( %cl, %esi, %edi) C n2 = carry,high << l - - shll %cl, %esi C n10 = high << l - movl %eax, VAR_SRC - jmp L(integer_one_left) - - -L(start_zero): - shll %cl, %edi C n2 = carry << l - movl $0, %esi C n10 = 0 - - C we're here because xsize+size>=MUL_THRESHOLD, so with size==0 then - C must have xsize!=0 - jmp L(fraction_some) - - - -C ----------------------------------------------------------------------------- -C -C The multiply by inverse loop is 17 cycles, and relies on some out-of-order -C execution. The instruction scheduling is important, with various -C apparently equivalent forms running 1 to 5 cycles slower. -C -C A lower bound for the time would seem to be 16 cycles, based on the -C following successive dependencies. -C -C cycles -C n2+n1 1 -C mul 6 -C q1+1 1 -C mul 6 -C sub 1 -C addback 1 -C --- -C 16 -C -C This chain is what the loop has already, but 16 cycles isn't achieved. -C K7 has enough decode, and probably enough execute (depending maybe on what -C a mul actually consumes), but nothing running under 17 has been found. -C -C In theory n2+n1 could be done in the sub and addback stages (by -C calculating both n2 and n2+n1 there), but lack of registers makes this an -C unlikely proposition. -C -C The jz in the loop keeps the q1+1 stage to 1 cycle. Handling an overflow -C from q1+1 with an "sbbl $0, %ebx" would add a cycle to the dependent -C chain, and nothing better than 18 cycles has been found when using it. -C The jump is taken only when q1 is 0xFFFFFFFF, and on random data this will -C be an extremely rare event. -C -C Branch mispredictions will hit random occurrances of q1==0xFFFFFFFF, but -C if some special data is coming out with this always, the q1_ff special -C case actually runs at 15 c/l. 0x2FFF...FFFD divided by 3 is a good way to -C induce the q1_ff case, for speed measurements or testing. Note that -C 0xFFF...FFF divided by 1 or 2 doesn't induce it. -C -C The instruction groupings and empty comments show the cycles for a naive -C in-order view of the code (conveniently ignoring the load latency on -C VAR_INVERSE). This shows some of where the time is going, but is nonsense -C to the extent that out-of-order execution rearranges it. In this case -C there's 19 cycles shown, but it executes at 17. - - ALIGN(16) -L(integer_top): - C eax scratch - C ebx scratch (nadj, q1) - C ecx scratch (src, dst) - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 scratch (src qword) - C mm7 rshift for normalization - - cmpl $0x80000000, %esi C n1 as 0=c, 1=nc - movl %edi, %eax C n2 - movl VAR_SRC, %ecx - - leal (%ebp,%esi), %ebx - cmovc( %esi, %ebx) C nadj = n10 + (-n1 & d), ignoring overflow - sbbl $-1, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - movq (%ecx), %mm0 C next limb and the one below it - subl $4, %ecx - - movl %ecx, VAR_SRC - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - movl %ebp, %eax C d - - C - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - jz L(q1_ff) - movl VAR_DST, %ecx - - mull %ebx C (q1+1)*d - - psrlq %mm7, %mm0 - - leal -4(%ecx), %ecx - - C - - subl %eax, %esi - movl VAR_DST_STOP, %eax - - C - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - movd %mm0, %esi - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - sbbl $0, %ebx C q - cmpl %eax, %ecx - - movl %ebx, (%ecx) - movl %ecx, VAR_DST - jne L(integer_top) - - -L(integer_loop_done): - - -C ----------------------------------------------------------------------------- -C -C Here, and in integer_one_left below, an sbbl $0 is used rather than a jz -C q1_ff special case. This make the code a bit smaller and simpler, and -C costs only 1 cycle (each). - -L(integer_two_left): - C eax scratch - C ebx scratch (nadj, q1) - C ecx scratch (src, dst) - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 src limb, shifted - C mm7 rshift - - cmpl $0x80000000, %esi C n1 as 0=c, 1=nc - movl %edi, %eax C n2 - movl PARAM_SRC, %ecx - - leal (%ebp,%esi), %ebx - cmovc( %esi, %ebx) C nadj = n10 + (-n1 & d), ignoring overflow - sbbl $-1, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - movd (%ecx), %mm0 C src low limb - - movl VAR_DST_STOP, %ecx - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - movl %ebp, %eax C d - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - - sbbl $0, %ebx - - mull %ebx C (q1+1)*d - - psllq $32, %mm0 - - psrlq %mm7, %mm0 - - C - - subl %eax, %esi - - C - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - movd %mm0, %esi - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - sbbl $0, %ebx C q - - movl %ebx, -4(%ecx) - - -C ----------------------------------------------------------------------------- -L(integer_one_left): - C eax scratch - C ebx scratch (nadj, q1) - C ecx dst - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 src limb, shifted - C mm7 rshift - - movl VAR_DST_STOP, %ecx - cmpl $0x80000000, %esi C n1 as 0=c, 1=nc - movl %edi, %eax C n2 - - leal (%ebp,%esi), %ebx - cmovc( %esi, %ebx) C nadj = n10 + (-n1 & d), ignoring overflow - sbbl $-1, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - C - - C - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - movl %ebp, %eax C d - - C - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - - sbbl $0, %ebx C q1 if q1+1 overflowed - - mull %ebx - - C - - C - - C - - subl %eax, %esi - - C - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - sbbl $0, %ebx C q - - movl %ebx, -8(%ecx) - subl $8, %ecx - - - -L(integer_none): - cmpl $0, PARAM_XSIZE - jne L(fraction_some) - - movl %edi, %eax -L(fraction_done): - movl VAR_NORM, %ecx - movl SAVE_EBP, %ebp - - movl SAVE_EDI, %edi - movl SAVE_ESI, %esi - - movl SAVE_EBX, %ebx - addl $STACK_SPACE, %esp - - shrl %cl, %eax - emms - - ret - - -C ----------------------------------------------------------------------------- -C -C Special case for q1=0xFFFFFFFF, giving q=0xFFFFFFFF meaning the low dword -C of q*d is simply -d and the remainder n-q*d = n10+d - -L(q1_ff): - C eax (divisor) - C ebx (q1+1 == 0) - C ecx - C edx - C esi n10 - C edi n2 - C ebp divisor - - movl VAR_DST, %ecx - movl VAR_DST_STOP, %edx - subl $4, %ecx - - psrlq %mm7, %mm0 - leal (%ebp,%esi), %edi C n-q*d remainder -> next n2 - movl %ecx, VAR_DST - - movd %mm0, %esi C next n10 - - movl $-1, (%ecx) - cmpl %ecx, %edx - jne L(integer_top) - - jmp L(integer_loop_done) - - - -C ----------------------------------------------------------------------------- -C -C Being the fractional part, the "source" limbs are all zero, meaning -C n10=0, n1=0, and hence nadj=0, leading to many instructions eliminated. -C -C The loop runs at 15 cycles. The dependent chain is the same as the -C general case above, but without the n2+n1 stage (due to n1==0), so 15 -C would seem to be the lower bound. -C -C A not entirely obvious simplification is that q1+1 never overflows a limb, -C and so there's no need for the sbbl $0 or jz q1_ff from the general case. -C q1 is the high word of m*n2+b*n2 and the following shows q1<=b-2 always. -C rnd() means rounding down to a multiple of d. -C -C m*n2 + b*n2 <= m*(d-1) + b*(d-1) -C = m*d + b*d - m - b -C = floor((b(b-d)-1)/d)*d + b*d - m - b -C = rnd(b(b-d)-1) + b*d - m - b -C = rnd(b(b-d)-1 + b*d) - m - b -C = rnd(b*b-1) - m - b -C <= (b-2)*b -C -C Unchanged from the general case is that the final quotient limb q can be -C either q1 or q1+1, and the q1+1 case occurs often. This can be seen from -C equation 8.4 of the paper which simplifies as follows when n1==0 and -C n0==0. -C -C n-q1*d = (n2*k+q0*d)/b <= d + (d*d-2d)/b -C -C As before, the instruction groupings and empty comments show a naive -C in-order view of the code, which is made a nonsense by out of order -C execution. There's 17 cycles shown, but it executes at 15. -C -C Rotating the store q and remainder->n2 instructions up to the top of the -C loop gets the run time down from 16 to 15. - - ALIGN(16) -L(fraction_some): - C eax - C ebx - C ecx - C edx - C esi - C edi carry - C ebp divisor - - movl PARAM_DST, %esi - movl VAR_DST_STOP, %ecx - movl %edi, %eax - - subl $8, %ecx - - jmp L(fraction_entry) - - - ALIGN(16) -L(fraction_top): - C eax n2 carry, then scratch - C ebx scratch (nadj, q1) - C ecx dst, decrementing - C edx scratch - C esi dst stop point - C edi (will be n2) - C ebp divisor - - movl %ebx, (%ecx) C previous q - movl %eax, %edi C remainder->n2 - -L(fraction_entry): - mull VAR_INVERSE C m*n2 - - movl %ebp, %eax C d - subl $4, %ecx C dst - leal 1(%edi), %ebx - - C - - C - - C - - C - - addl %edx, %ebx C 1 + high(n2<<32 + m*n2) = q1+1 - - mull %ebx C (q1+1)*d - - C - - C - - C - - negl %eax C low of n - (q1+1)*d - - C - - sbbl %edx, %edi C high of n - (q1+1)*d, caring only about carry - leal (%ebp,%eax), %edx - - cmovc( %edx, %eax) C n - q1*d if underflow from using q1+1 - sbbl $0, %ebx C q - cmpl %esi, %ecx - - jne L(fraction_top) - - - movl %ebx, (%ecx) - jmp L(fraction_done) - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/mmx/lshift.asm b/rts/gmp/mpn/x86/k7/mmx/lshift.asm deleted file mode 100644 index 4d17c881ec..0000000000 --- a/rts/gmp/mpn/x86/k7/mmx/lshift.asm +++ /dev/null @@ -1,472 +0,0 @@ -dnl AMD K7 mpn_lshift -- mpn left shift. -dnl -dnl K7: 1.21 cycles/limb (at 16 limbs/loop). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K7: UNROLL_COUNT cycles/limb -dnl 4 1.51 -dnl 8 1.26 -dnl 16 1.21 -dnl 32 1.2 -dnl Maximum possible with the current code is 64. - -deflit(UNROLL_COUNT, 16) - - -C mp_limb_t mpn_lshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C -C Shift src,size left by shift many bits and store the result in dst,size. -C Zeros are shifted in at the right. The bits shifted out at the left are -C the return value. -C -C The comments in mpn_rshift apply here too. - -ifdef(`PIC',` -deflit(UNROLL_THRESHOLD, 10) -',` -deflit(UNROLL_THRESHOLD, 10) -') - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - -defframe(SAVE_EDI, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EBX, -12) -deflit(SAVE_SIZE, 12) - - .text - ALIGN(32) - -PROLOGUE(mpn_lshift) -deflit(`FRAME',0) - - movl PARAM_SIZE, %eax - movl PARAM_SRC, %edx - subl $SAVE_SIZE, %esp -deflit(`FRAME',SAVE_SIZE) - - movl PARAM_SHIFT, %ecx - movl %edi, SAVE_EDI - - movl PARAM_DST, %edi - decl %eax - jnz L(more_than_one_limb) - - movl (%edx), %edx - - shldl( %cl, %edx, %eax) C eax was decremented to zero - - shll %cl, %edx - - movl %edx, (%edi) - movl SAVE_EDI, %edi - addl $SAVE_SIZE, %esp - - ret - - -C ----------------------------------------------------------------------------- -L(more_than_one_limb): - C eax size-1 - C ebx - C ecx shift - C edx src - C esi - C edi dst - C ebp - - movd PARAM_SHIFT, %mm6 - movd (%edx,%eax,4), %mm5 C src high limb - cmp $UNROLL_THRESHOLD-1, %eax - - jae L(unroll) - negl %ecx - movd (%edx), %mm4 C src low limb - - addl $32, %ecx - - movd %ecx, %mm7 - -L(simple_top): - C eax loop counter, limbs - C ebx - C ecx - C edx src - C esi - C edi dst - C ebp - C - C mm0 scratch - C mm4 src low limb - C mm5 src high limb - C mm6 shift - C mm7 32-shift - - movq -4(%edx,%eax,4), %mm0 - decl %eax - - psrlq %mm7, %mm0 - - movd %mm0, 4(%edi,%eax,4) - jnz L(simple_top) - - - psllq %mm6, %mm5 - psllq %mm6, %mm4 - - psrlq $32, %mm5 - movd %mm4, (%edi) C dst low limb - - movd %mm5, %eax C return value - - movl SAVE_EDI, %edi - addl $SAVE_SIZE, %esp - emms - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(unroll): - C eax size-1 - C ebx (saved) - C ecx shift - C edx src - C esi - C edi dst - C ebp - C - C mm5 src high limb, for return value - C mm6 lshift - - movl %esi, SAVE_ESI - movl %ebx, SAVE_EBX - leal -4(%edx,%eax,4), %edx C &src[size-2] - - testb $4, %dl - movq (%edx), %mm1 C src high qword - - jz L(start_src_aligned) - - - C src isn't aligned, process high limb (marked xxx) separately to - C make it so - C - C source -4(edx,%eax,4) - C | - C +-------+-------+-------+-- - C | xxx | - C +-------+-------+-------+-- - C 0mod8 4mod8 0mod8 - C - C dest -4(edi,%eax,4) - C | - C +-------+-------+-- - C | xxx | | - C +-------+-------+-- - - psllq %mm6, %mm1 - subl $4, %edx - movl %eax, PARAM_SIZE C size-1 - - psrlq $32, %mm1 - decl %eax C size-2 is new size-1 - - movd %mm1, 4(%edi,%eax,4) - movq (%edx), %mm1 C new src high qword -L(start_src_aligned): - - - leal -4(%edi,%eax,4), %edi C &dst[size-2] - psllq %mm6, %mm5 - - testl $4, %edi - psrlq $32, %mm5 C return value - - jz L(start_dst_aligned) - - - C dst isn't aligned, subtract 4 bytes to make it so, and pretend the - C shift is 32 bits extra. High limb of dst (marked xxx) handled - C here separately. - C - C source %edx - C +-------+-------+-- - C | mm1 | - C +-------+-------+-- - C 0mod8 4mod8 - C - C dest %edi - C +-------+-------+-------+-- - C | xxx | - C +-------+-------+-------+-- - C 0mod8 4mod8 0mod8 - - movq %mm1, %mm0 - psllq %mm6, %mm1 - addl $32, %ecx C shift+32 - - psrlq $32, %mm1 - - movd %mm1, 4(%edi) - movq %mm0, %mm1 - subl $4, %edi - - movd %ecx, %mm6 C new lshift -L(start_dst_aligned): - - decl %eax C size-2, two last limbs handled at end - movq %mm1, %mm2 C copy of src high qword - negl %ecx - - andl $-2, %eax C round size down to even - addl $64, %ecx - - movl %eax, %ebx - negl %eax - - andl $UNROLL_MASK, %eax - decl %ebx - - shll %eax - - movd %ecx, %mm7 C rshift = 64-lshift - -ifdef(`PIC',` - call L(pic_calc) -L(here): -',` - leal L(entry) (%eax,%eax,4), %esi -') - shrl $UNROLL_LOG2, %ebx C loop counter - - leal ifelse(UNROLL_BYTES,256,128) -8(%edx,%eax,2), %edx - leal ifelse(UNROLL_BYTES,256,128) (%edi,%eax,2), %edi - movl PARAM_SIZE, %eax C for use at end - jmp *%esi - - -ifdef(`PIC',` -L(pic_calc): - C See README.family about old gas bugs - leal (%eax,%eax,4), %esi - addl $L(entry)-L(here), %esi - addl (%esp), %esi - - ret -') - - -C ----------------------------------------------------------------------------- - ALIGN(32) -L(top): - C eax size (for use at end) - C ebx loop counter - C ecx rshift - C edx src - C esi computed jump - C edi dst - C ebp - C - C mm0 scratch - C mm1 \ carry (alternating, mm2 first) - C mm2 / - C mm6 lshift - C mm7 rshift - C - C 10 code bytes/limb - C - C The two chunks differ in whether mm1 or mm2 hold the carry. - C The computed jump puts the initial carry in both mm1 and mm2. - -L(entry): -deflit(CHUNK_COUNT, 4) -forloop(i, 0, UNROLL_COUNT/CHUNK_COUNT-1, ` - deflit(`disp0', eval(-i*CHUNK_COUNT*4 ifelse(UNROLL_BYTES,256,-128))) - deflit(`disp1', eval(disp0 - 8)) - - movq disp0(%edx), %mm0 - psllq %mm6, %mm2 - - movq %mm0, %mm1 - psrlq %mm7, %mm0 - - por %mm2, %mm0 - movq %mm0, disp0(%edi) - - - movq disp1(%edx), %mm0 - psllq %mm6, %mm1 - - movq %mm0, %mm2 - psrlq %mm7, %mm0 - - por %mm1, %mm0 - movq %mm0, disp1(%edi) -') - - subl $UNROLL_BYTES, %edx - subl $UNROLL_BYTES, %edi - decl %ebx - - jns L(top) - - - -define(`disp', `m4_empty_if_zero(eval($1 ifelse(UNROLL_BYTES,256,-128)))') - -L(end): - testb $1, %al - movl SAVE_EBX, %ebx - psllq %mm6, %mm2 C wanted left shifted in all cases below - - movd %mm5, %eax - - movl SAVE_ESI, %esi - jz L(end_even) - - -L(end_odd): - - C Size odd, destination was aligned. - C - C source edx+8 edx+4 - C --+---------------+-------+ - C | mm2 | | - C --+---------------+-------+ - C - C dest edi - C --+---------------+---------------+-------+ - C | written | | | - C --+---------------+---------------+-------+ - C - C mm6 = shift - C mm7 = ecx = 64-shift - - - C Size odd, destination was unaligned. - C - C source edx+8 edx+4 - C --+---------------+-------+ - C | mm2 | | - C --+---------------+-------+ - C - C dest edi - C --+---------------+---------------+ - C | written | | - C --+---------------+---------------+ - C - C mm6 = shift+32 - C mm7 = ecx = 64-(shift+32) - - - C In both cases there's one extra limb of src to fetch and combine - C with mm2 to make a qword at (%edi), and in the aligned case - C there's an extra limb of dst to be formed from that extra src limb - C left shifted. - - movd disp(4) (%edx), %mm0 - testb $32, %cl - - movq %mm0, %mm1 - psllq $32, %mm0 - - psrlq %mm7, %mm0 - psllq %mm6, %mm1 - - por %mm2, %mm0 - - movq %mm0, disp(0) (%edi) - jz L(end_odd_unaligned) - movd %mm1, disp(-4) (%edi) -L(end_odd_unaligned): - - movl SAVE_EDI, %edi - addl $SAVE_SIZE, %esp - emms - - ret - - -L(end_even): - - C Size even, destination was aligned. - C - C source edx+8 - C --+---------------+ - C | mm2 | - C --+---------------+ - C - C dest edi - C --+---------------+---------------+ - C | written | | - C --+---------------+---------------+ - C - C mm6 = shift - C mm7 = ecx = 64-shift - - - C Size even, destination was unaligned. - C - C source edx+8 - C --+---------------+ - C | mm2 | - C --+---------------+ - C - C dest edi+4 - C --+---------------+-------+ - C | written | | - C --+---------------+-------+ - C - C mm6 = shift+32 - C mm7 = ecx = 64-(shift+32) - - - C The movq for the aligned case overwrites the movd for the - C unaligned case. - - movq %mm2, %mm0 - psrlq $32, %mm2 - - testb $32, %cl - movd %mm2, disp(4) (%edi) - - jz L(end_even_unaligned) - movq %mm0, disp(0) (%edi) -L(end_even_unaligned): - - movl SAVE_EDI, %edi - addl $SAVE_SIZE, %esp - emms - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/mmx/mod_1.asm b/rts/gmp/mpn/x86/k7/mmx/mod_1.asm deleted file mode 100644 index 545ca56ddf..0000000000 --- a/rts/gmp/mpn/x86/k7/mmx/mod_1.asm +++ /dev/null @@ -1,457 +0,0 @@ -dnl AMD K7 mpn_mod_1 -- mpn by limb remainder. -dnl -dnl K7: 17.0 cycles/limb. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_mod_1 (mp_srcptr src, mp_size_t size, mp_limb_t divisor); -C mp_limb_t mpn_mod_1c (mp_srcptr src, mp_size_t size, mp_limb_t divisor, -C mp_limb_t carry); -C -C The code here is the same as mpn_divrem_1, but with the quotient -C discarded. See mpn/x86/k7/mmx/divrem_1.c for some comments. - - -dnl MUL_THRESHOLD is the size at which the multiply by inverse method is -dnl used, rather than plain "divl"s. Minimum value 2. -dnl -dnl The inverse takes about 50 cycles to calculate, but after that the -dnl multiply is 17 c/l versus division at 41 c/l. -dnl -dnl Using mul or div is about the same speed at 3 limbs, so the threshold -dnl is set to 4 to get the smaller div code used at 3. - -deflit(MUL_THRESHOLD, 4) - - -defframe(PARAM_CARRY, 16) -defframe(PARAM_DIVISOR,12) -defframe(PARAM_SIZE, 8) -defframe(PARAM_SRC, 4) - -defframe(SAVE_EBX, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EDI, -12) -defframe(SAVE_EBP, -16) - -defframe(VAR_NORM, -20) -defframe(VAR_INVERSE, -24) -defframe(VAR_SRC_STOP,-28) - -deflit(STACK_SPACE, 28) - - .text - ALIGN(32) - -PROLOGUE(mpn_mod_1c) -deflit(`FRAME',0) - movl PARAM_CARRY, %edx - movl PARAM_SIZE, %ecx - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %ebp, SAVE_EBP - movl PARAM_DIVISOR, %ebp - - movl %esi, SAVE_ESI - movl PARAM_SRC, %esi - jmp LF(mpn_mod_1,start_1c) - -EPILOGUE() - - - ALIGN(32) -PROLOGUE(mpn_mod_1) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - movl $0, %edx C initial carry (if can't skip a div) - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %esi, SAVE_ESI - movl PARAM_SRC, %esi - - movl %ebp, SAVE_EBP - movl PARAM_DIVISOR, %ebp - - orl %ecx, %ecx - jz L(divide_done) - - movl -4(%esi,%ecx,4), %eax C src high limb - - cmpl %ebp, %eax C carry flag if high<divisor - - cmovc( %eax, %edx) C src high limb as initial carry - sbbl $0, %ecx C size-1 to skip one div - jz L(divide_done) - - - ALIGN(16) -L(start_1c): - C eax - C ebx - C ecx size - C edx carry - C esi src - C edi - C ebp divisor - - cmpl $MUL_THRESHOLD, %ecx - jae L(mul_by_inverse) - - - -C With a MUL_THRESHOLD of 4, this "loop" only ever does 1 to 3 iterations, -C but it's already fast and compact, and there's nothing to gain by -C expanding it out. -C -C Using PARAM_DIVISOR in the divl is a couple of cycles faster than %ebp. - - orl %ecx, %ecx - jz L(divide_done) - - -L(divide_top): - C eax scratch (quotient) - C ebx - C ecx counter, limbs, decrementing - C edx scratch (remainder) - C esi src - C edi - C ebp - - movl -4(%esi,%ecx,4), %eax - - divl PARAM_DIVISOR - - decl %ecx - jnz L(divide_top) - - -L(divide_done): - movl SAVE_ESI, %esi - movl SAVE_EBP, %ebp - addl $STACK_SPACE, %esp - - movl %edx, %eax - - ret - - - -C ----------------------------------------------------------------------------- - -L(mul_by_inverse): - C eax - C ebx - C ecx size - C edx carry - C esi src - C edi - C ebp divisor - - bsrl %ebp, %eax C 31-l - - movl %ebx, SAVE_EBX - leal -4(%esi), %ebx - - movl %ebx, VAR_SRC_STOP - movl %edi, SAVE_EDI - - movl %ecx, %ebx C size - movl $31, %ecx - - movl %edx, %edi C carry - movl $-1, %edx - - C - - xorl %eax, %ecx C l - incl %eax C 32-l - - shll %cl, %ebp C d normalized - movl %ecx, VAR_NORM - - movd %eax, %mm7 - - movl $-1, %eax - subl %ebp, %edx C (b-d)-1 so edx:eax = b*(b-d)-1 - - divl %ebp C floor (b*(b-d)-1) / d - - C - - movl %eax, VAR_INVERSE - leal -12(%esi,%ebx,4), %eax C &src[size-3] - - movl 8(%eax), %esi C src high limb - movl 4(%eax), %edx C src second highest limb - - shldl( %cl, %esi, %edi) C n2 = carry,high << l - - shldl( %cl, %edx, %esi) C n10 = high,second << l - - movl %eax, %ecx C &src[size-3] - - -ifelse(MUL_THRESHOLD,2,` - cmpl $2, %ebx - je L(inverse_two_left) -') - - -C The dependent chain here is the same as in mpn_divrem_1, but a few -C instructions are saved by not needing to store the quotient limbs. -C Unfortunately this doesn't get the code down to the theoretical 16 c/l. -C -C There's four dummy instructions in the loop, all of which are necessary -C for the claimed 17 c/l. It's a 1 to 3 cycle slowdown if any are removed, -C or changed from load to store or vice versa. They're not completely -C random, since they correspond to what mpn_divrem_1 has, but there's no -C obvious reason why they're necessary. Presumably they induce something -C good in the out of order execution, perhaps through some load/store -C ordering and/or decoding effects. -C -C The q1==0xFFFFFFFF case is handled here the same as in mpn_divrem_1. On -C on special data that comes out as q1==0xFFFFFFFF always, the loop runs at -C about 13.5 c/l. - - ALIGN(32) -L(inverse_top): - C eax scratch - C ebx scratch (nadj, q1) - C ecx src pointer, decrementing - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 scratch (src qword) - C mm7 rshift for normalization - - cmpl $0x80000000, %esi C n1 as 0=c, 1=nc - movl %edi, %eax C n2 - movl PARAM_SIZE, %ebx C dummy - - leal (%ebp,%esi), %ebx - cmovc( %esi, %ebx) C nadj = n10 + (-n1 & d), ignoring overflow - sbbl $-1, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - movq (%ecx), %mm0 C next src limb and the one below it - subl $4, %ecx - - movl %ecx, PARAM_SIZE C dummy - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - movl %ebp, %eax C d - - C - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - jz L(q1_ff) - nop C dummy - - mull %ebx C (q1+1)*d - - psrlq %mm7, %mm0 - leal 0(%ecx), %ecx C dummy - - C - - C - - subl %eax, %esi - movl VAR_SRC_STOP, %eax - - C - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - movd %mm0, %esi - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - cmpl %eax, %ecx - jne L(inverse_top) - - -L(inverse_loop_done): - - -C ----------------------------------------------------------------------------- - -L(inverse_two_left): - C eax scratch - C ebx scratch (nadj, q1) - C ecx &src[-1] - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 scratch (src dword) - C mm7 rshift - - cmpl $0x80000000, %esi C n1 as 0=c, 1=nc - movl %edi, %eax C n2 - - leal (%ebp,%esi), %ebx - cmovc( %esi, %ebx) C nadj = n10 + (-n1 & d), ignoring overflow - sbbl $-1, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - movd 4(%ecx), %mm0 C src low limb - - C - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - movl %ebp, %eax C d - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - - sbbl $0, %ebx - - mull %ebx C (q1+1)*d - - psllq $32, %mm0 - - psrlq %mm7, %mm0 - - C - - subl %eax, %esi - - C - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - movd %mm0, %esi - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - - -C One limb left - - C eax scratch - C ebx scratch (nadj, q1) - C ecx - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 src limb, shifted - C mm7 rshift - - cmpl $0x80000000, %esi C n1 as 0=c, 1=nc - movl %edi, %eax C n2 - - leal (%ebp,%esi), %ebx - cmovc( %esi, %ebx) C nadj = n10 + (-n1 & d), ignoring overflow - sbbl $-1, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - movl VAR_NORM, %ecx C for final denorm - - C - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - movl %ebp, %eax C d - - C - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - - sbbl $0, %ebx - - mull %ebx C (q1+1)*d - - movl SAVE_EBX, %ebx - - C - - C - - subl %eax, %esi - - movl %esi, %eax C remainder - movl SAVE_ESI, %esi - - sbbl %edx, %edi C n - (q1+1)*d - leal (%ebp,%eax), %edx - movl SAVE_EBP, %ebp - - cmovc( %edx, %eax) C n - q1*d if underflow from using q1+1 - movl SAVE_EDI, %edi - - shrl %cl, %eax C denorm remainder - addl $STACK_SPACE, %esp - emms - - ret - - -C ----------------------------------------------------------------------------- -C -C Special case for q1=0xFFFFFFFF, giving q=0xFFFFFFFF meaning the low dword -C of q*d is simply -d and the remainder n-q*d = n10+d - -L(q1_ff): - C eax (divisor) - C ebx (q1+1 == 0) - C ecx src pointer - C edx - C esi n10 - C edi (n2) - C ebp divisor - - movl VAR_SRC_STOP, %edx - leal (%ebp,%esi), %edi C n-q*d remainder -> next n2 - psrlq %mm7, %mm0 - - movd %mm0, %esi C next n10 - - cmpl %ecx, %edx - jne L(inverse_top) - jmp L(inverse_loop_done) - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/mmx/popham.asm b/rts/gmp/mpn/x86/k7/mmx/popham.asm deleted file mode 100644 index fa7c8c04a5..0000000000 --- a/rts/gmp/mpn/x86/k7/mmx/popham.asm +++ /dev/null @@ -1,239 +0,0 @@ -dnl AMD K7 mpn_popcount, mpn_hamdist -- population count and hamming -dnl distance. -dnl -dnl K7: popcount 5.0 cycles/limb, hamdist 6.0 cycles/limb - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl Only recent versions of gas know psadbw, in particular gas 2.9.1 on -dnl FreeBSD 3.3 and 3.4 doesn't recognise it. - -define(psadbw_mm4_mm0, -`ifelse(m4_ifdef_anyof_p(`HAVE_TARGET_CPU_athlon', - `HAVE_TARGET_CPU_pentium3'),1, - `.byte 0x0f,0xf6,0xc4 C psadbw %mm4, %mm0', - -`m4_warning(`warning, using simulated and only partly functional psadbw, use for testing only -') C this works enough for the sum of bytes done below, making it - C possible to test on an older cpu - leal -8(%esp), %esp - movq %mm4, (%esp) - movq %mm0, %mm4 -forloop(i,1,7, -` psrlq $ 8, %mm4 - paddb %mm4, %mm0 -') - pushl $ 0 - pushl $ 0xFF - pand (%esp), %mm0 - movq 8(%esp), %mm4 - leal 16(%esp), %esp -')') - - -C unsigned long mpn_popcount (mp_srcptr src, mp_size_t size); -C unsigned long mpn_hamdist (mp_srcptr src, mp_srcptr src2, mp_size_t size); -C -C The code here is almost certainly not optimal, but is already a 3x speedup -C over the generic C code. The main improvement would be to interleave -C processing of two qwords in the loop so as to fully exploit the available -C execution units, possibly leading to 3.25 c/l (13 cycles for 4 limbs). -C -C The loop is based on the example "Efficient 64-bit population count using -C MMX instructions" in the Athlon Optimization Guide, AMD document 22007, -C page 158 of rev E (reference in mpn/x86/k7/README). - -ifdef(`OPERATION_popcount',, -`ifdef(`OPERATION_hamdist',, -`m4_error(`Need OPERATION_popcount or OPERATION_hamdist defined -')')') - -define(HAM, -m4_assert_numargs(1) -`ifdef(`OPERATION_hamdist',`$1')') - -define(POP, -m4_assert_numargs(1) -`ifdef(`OPERATION_popcount',`$1')') - -HAM(` -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC2, 8) -defframe(PARAM_SRC, 4) -define(M4_function,mpn_hamdist) -') -POP(` -defframe(PARAM_SIZE, 8) -defframe(PARAM_SRC, 4) -define(M4_function,mpn_popcount) -') - -MULFUNC_PROLOGUE(mpn_popcount mpn_hamdist) - - -ifdef(`PIC',,` - dnl non-PIC - - DATA - ALIGN(8) - -define(LS, -m4_assert_numargs(1) -`LF(M4_function,`$1')') - -LS(rodata_AAAAAAAAAAAAAAAA): - .long 0xAAAAAAAA - .long 0xAAAAAAAA - -LS(rodata_3333333333333333): - .long 0x33333333 - .long 0x33333333 - -LS(rodata_0F0F0F0F0F0F0F0F): - .long 0x0F0F0F0F - .long 0x0F0F0F0F -') - - .text - ALIGN(32) - -PROLOGUE(M4_function) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - orl %ecx, %ecx - jz L(zero) - -ifdef(`PIC',` - movl $0xAAAAAAAA, %eax - movl $0x33333333, %edx - - movd %eax, %mm7 - movd %edx, %mm6 - - movl $0x0F0F0F0F, %eax - - punpckldq %mm7, %mm7 - punpckldq %mm6, %mm6 - - movd %eax, %mm5 - movd %edx, %mm4 - - punpckldq %mm5, %mm5 - -',` - movq LS(rodata_AAAAAAAAAAAAAAAA), %mm7 - movq LS(rodata_3333333333333333), %mm6 - movq LS(rodata_0F0F0F0F0F0F0F0F), %mm5 -') - pxor %mm4, %mm4 - -define(REG_AAAAAAAAAAAAAAAA,%mm7) -define(REG_3333333333333333,%mm6) -define(REG_0F0F0F0F0F0F0F0F,%mm5) -define(REG_0000000000000000,%mm4) - - - movl PARAM_SRC, %eax -HAM(` movl PARAM_SRC2, %edx') - - pxor %mm2, %mm2 C total - - shrl %ecx - jnc L(top) - - movd (%eax,%ecx,8), %mm1 - -HAM(` movd 0(%edx,%ecx,8), %mm0 - pxor %mm0, %mm1 -') - orl %ecx, %ecx - jmp L(loaded) - - - ALIGN(16) -L(top): - C eax src - C ebx - C ecx counter, qwords, decrementing - C edx [hamdist] src2 - C - C mm0 (scratch) - C mm1 (scratch) - C mm2 total (low dword) - C mm3 - C mm4 \ - C mm5 | special constants - C mm6 | - C mm7 / - - movq -8(%eax,%ecx,8), %mm1 - -HAM(` pxor -8(%edx,%ecx,8), %mm1') - decl %ecx - -L(loaded): - movq %mm1, %mm0 - pand REG_AAAAAAAAAAAAAAAA, %mm1 - - psrlq $1, %mm1 - - psubd %mm1, %mm0 C bit pairs - - - movq %mm0, %mm1 - psrlq $2, %mm0 - - pand REG_3333333333333333, %mm0 - pand REG_3333333333333333, %mm1 - - paddd %mm1, %mm0 C nibbles - - - movq %mm0, %mm1 - psrlq $4, %mm0 - - pand REG_0F0F0F0F0F0F0F0F, %mm0 - pand REG_0F0F0F0F0F0F0F0F, %mm1 - - paddd %mm1, %mm0 C bytes - - - psadbw_mm4_mm0 - - paddd %mm0, %mm2 C add to total - jnz L(top) - - - movd %mm2, %eax - emms - ret - - -L(zero): - movl $0, %eax - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/mmx/rshift.asm b/rts/gmp/mpn/x86/k7/mmx/rshift.asm deleted file mode 100644 index abb546cd5b..0000000000 --- a/rts/gmp/mpn/x86/k7/mmx/rshift.asm +++ /dev/null @@ -1,471 +0,0 @@ -dnl AMD K7 mpn_rshift -- mpn right shift. -dnl -dnl K7: 1.21 cycles/limb (at 16 limbs/loop). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K7: UNROLL_COUNT cycles/limb -dnl 4 1.51 -dnl 8 1.26 -dnl 16 1.21 -dnl 32 1.2 -dnl Maximum possible with the current code is 64. - -deflit(UNROLL_COUNT, 16) - - -C mp_limb_t mpn_rshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C -C Shift src,size right by shift many bits and store the result in dst,size. -C Zeros are shifted in at the left. The bits shifted out at the right are -C the return value. -C -C This code uses 64-bit MMX operations, which makes it possible to handle -C two limbs at a time, for a theoretical 1.0 cycles/limb. Plain integer -C code, on the other hand, suffers from shrd being a vector path decode and -C running at 3 cycles back-to-back. -C -C Full speed depends on source and destination being aligned, and some hairy -C setups and finish-ups are done to arrange this for the loop. - -ifdef(`PIC',` -deflit(UNROLL_THRESHOLD, 10) -',` -deflit(UNROLL_THRESHOLD, 10) -') - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - -defframe(SAVE_EDI, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EBX, -12) -deflit(SAVE_SIZE, 12) - - .text - ALIGN(32) - -PROLOGUE(mpn_rshift) -deflit(`FRAME',0) - - movl PARAM_SIZE, %eax - movl PARAM_SRC, %edx - subl $SAVE_SIZE, %esp -deflit(`FRAME',SAVE_SIZE) - - movl PARAM_SHIFT, %ecx - movl %edi, SAVE_EDI - - movl PARAM_DST, %edi - decl %eax - jnz L(more_than_one_limb) - - movl (%edx), %edx C src limb - - shrdl( %cl, %edx, %eax) C eax was decremented to zero - - shrl %cl, %edx - - movl %edx, (%edi) C dst limb - movl SAVE_EDI, %edi - addl $SAVE_SIZE, %esp - - ret - - -C ----------------------------------------------------------------------------- -L(more_than_one_limb): - C eax size-1 - C ebx - C ecx shift - C edx src - C esi - C edi dst - C ebp - - movd PARAM_SHIFT, %mm6 C rshift - movd (%edx), %mm5 C src low limb - cmp $UNROLL_THRESHOLD-1, %eax - - jae L(unroll) - leal (%edx,%eax,4), %edx C &src[size-1] - leal -4(%edi,%eax,4), %edi C &dst[size-2] - - movd (%edx), %mm4 C src high limb - negl %eax - - -L(simple_top): - C eax loop counter, limbs, negative - C ebx - C ecx shift - C edx carry - C edx &src[size-1] - C edi &dst[size-2] - C ebp - C - C mm0 scratch - C mm4 src high limb - C mm5 src low limb - C mm6 shift - - movq (%edx,%eax,4), %mm0 - incl %eax - - psrlq %mm6, %mm0 - - movd %mm0, (%edi,%eax,4) - jnz L(simple_top) - - - psllq $32, %mm5 - psrlq %mm6, %mm4 - - psrlq %mm6, %mm5 - movd %mm4, 4(%edi) C dst high limb - - movd %mm5, %eax C return value - - movl SAVE_EDI, %edi - addl $SAVE_SIZE, %esp - emms - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(unroll): - C eax size-1 - C ebx - C ecx shift - C edx src - C esi - C edi dst - C ebp - C - C mm5 src low limb - C mm6 rshift - - testb $4, %dl - movl %esi, SAVE_ESI - movl %ebx, SAVE_EBX - - psllq $32, %mm5 - jz L(start_src_aligned) - - - C src isn't aligned, process low limb separately (marked xxx) and - C step src and dst by one limb, making src aligned. - C - C source edx - C --+-------+-------+-------+ - C | xxx | - C --+-------+-------+-------+ - C 4mod8 0mod8 4mod8 - C - C dest edi - C --+-------+-------+ - C | | xxx | - C --+-------+-------+ - - movq (%edx), %mm0 C src low two limbs - addl $4, %edx - movl %eax, PARAM_SIZE C size-1 - - addl $4, %edi - decl %eax C size-2 is new size-1 - - psrlq %mm6, %mm0 - movl %edi, PARAM_DST C new dst - - movd %mm0, -4(%edi) -L(start_src_aligned): - - - movq (%edx), %mm1 C src low two limbs - decl %eax C size-2, two last limbs handled at end - testl $4, %edi - - psrlq %mm6, %mm5 - jz L(start_dst_aligned) - - - C dst isn't aligned, add 4 to make it so, and pretend the shift is - C 32 bits extra. Low limb of dst (marked xxx) handled here separately. - C - C source edx - C --+-------+-------+ - C | mm1 | - C --+-------+-------+ - C 4mod8 0mod8 - C - C dest edi - C --+-------+-------+-------+ - C | xxx | - C --+-------+-------+-------+ - C 4mod8 0mod8 4mod8 - - movq %mm1, %mm0 - psrlq %mm6, %mm1 - addl $32, %ecx C shift+32 - - movd %mm1, (%edi) - movq %mm0, %mm1 - addl $4, %edi C new dst - - movd %ecx, %mm6 -L(start_dst_aligned): - - - movq %mm1, %mm2 C copy of src low two limbs - negl %ecx - andl $-2, %eax C round size down to even - - movl %eax, %ebx - negl %eax - addl $64, %ecx - - andl $UNROLL_MASK, %eax - decl %ebx - - shll %eax - - movd %ecx, %mm7 C lshift = 64-rshift - -ifdef(`PIC',` - call L(pic_calc) -L(here): -',` - leal L(entry) (%eax,%eax,4), %esi - negl %eax -') - shrl $UNROLL_LOG2, %ebx C loop counter - - leal ifelse(UNROLL_BYTES,256,128+) 8(%edx,%eax,2), %edx - leal ifelse(UNROLL_BYTES,256,128) (%edi,%eax,2), %edi - movl PARAM_SIZE, %eax C for use at end - - jmp *%esi - - -ifdef(`PIC',` -L(pic_calc): - C See README.family about old gas bugs - leal (%eax,%eax,4), %esi - addl $L(entry)-L(here), %esi - addl (%esp), %esi - negl %eax - - ret -') - - -C ----------------------------------------------------------------------------- - ALIGN(64) -L(top): - C eax size, for use at end - C ebx loop counter - C ecx lshift - C edx src - C esi was computed jump - C edi dst - C ebp - C - C mm0 scratch - C mm1 \ carry (alternating) - C mm2 / - C mm6 rshift - C mm7 lshift - C - C 10 code bytes/limb - C - C The two chunks differ in whether mm1 or mm2 hold the carry. - C The computed jump puts the initial carry in both mm1 and mm2. - -L(entry): -deflit(CHUNK_COUNT, 4) -forloop(i, 0, UNROLL_COUNT/CHUNK_COUNT-1, ` - deflit(`disp0', eval(i*CHUNK_COUNT*4 ifelse(UNROLL_BYTES,256,-128))) - deflit(`disp1', eval(disp0 + 8)) - - movq disp0(%edx), %mm0 - psrlq %mm6, %mm2 - - movq %mm0, %mm1 - psllq %mm7, %mm0 - - por %mm2, %mm0 - movq %mm0, disp0(%edi) - - - movq disp1(%edx), %mm0 - psrlq %mm6, %mm1 - - movq %mm0, %mm2 - psllq %mm7, %mm0 - - por %mm1, %mm0 - movq %mm0, disp1(%edi) -') - - addl $UNROLL_BYTES, %edx - addl $UNROLL_BYTES, %edi - decl %ebx - - jns L(top) - - -deflit(`disp0', ifelse(UNROLL_BYTES,256,-128)) -deflit(`disp1', eval(disp0-0 + 8)) - - testb $1, %al - psrlq %mm6, %mm2 C wanted rshifted in all cases below - movl SAVE_ESI, %esi - - movd %mm5, %eax C return value - - movl SAVE_EBX, %ebx - jz L(end_even) - - - C Size odd, destination was aligned. - C - C source - C edx - C +-------+---------------+-- - C | | mm2 | - C +-------+---------------+-- - C - C dest edi - C +-------+---------------+---------------+-- - C | | | written | - C +-------+---------------+---------------+-- - C - C mm6 = shift - C mm7 = ecx = 64-shift - - - C Size odd, destination was unaligned. - C - C source - C edx - C +-------+---------------+-- - C | | mm2 | - C +-------+---------------+-- - C - C dest edi - C +---------------+---------------+-- - C | | written | - C +---------------+---------------+-- - C - C mm6 = shift+32 - C mm7 = ecx = 64-(shift+32) - - - C In both cases there's one extra limb of src to fetch and combine - C with mm2 to make a qword to store, and in the aligned case there's - C a further extra limb of dst to be formed. - - - movd disp0(%edx), %mm0 - movq %mm0, %mm1 - - psllq %mm7, %mm0 - testb $32, %cl - - por %mm2, %mm0 - psrlq %mm6, %mm1 - - movq %mm0, disp0(%edi) - jz L(finish_odd_unaligned) - - movd %mm1, disp1(%edi) -L(finish_odd_unaligned): - - movl SAVE_EDI, %edi - addl $SAVE_SIZE, %esp - emms - - ret - - -L(end_even): - - C Size even, destination was aligned. - C - C source - C +---------------+-- - C | mm2 | - C +---------------+-- - C - C dest edi - C +---------------+---------------+-- - C | | mm3 | - C +---------------+---------------+-- - C - C mm6 = shift - C mm7 = ecx = 64-shift - - - C Size even, destination was unaligned. - C - C source - C +---------------+-- - C | mm2 | - C +---------------+-- - C - C dest edi - C +-------+---------------+-- - C | | mm3 | - C +-------+---------------+-- - C - C mm6 = shift+32 - C mm7 = 64-(shift+32) - - - C The movd for the unaligned case is the same data as the movq for - C the aligned case, it's just a choice between whether one or two - C limbs should be written. - - - testb $32, %cl - movd %mm2, disp0(%edi) - - jz L(end_even_unaligned) - - movq %mm2, disp0(%edi) -L(end_even_unaligned): - - movl SAVE_EDI, %edi - addl $SAVE_SIZE, %esp - emms - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/mul_1.asm b/rts/gmp/mpn/x86/k7/mul_1.asm deleted file mode 100644 index 07f7085b10..0000000000 --- a/rts/gmp/mpn/x86/k7/mul_1.asm +++ /dev/null @@ -1,265 +0,0 @@ -dnl AMD K7 mpn_mul_1 -- mpn by limb multiply. -dnl -dnl K7: 3.4 cycles/limb (at 16 limbs/loop). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K7: UNROLL_COUNT cycles/limb -dnl 8 3.9 -dnl 16 3.4 -dnl 32 3.4 -dnl 64 3.35 -dnl Maximum possible with the current code is 64. - -deflit(UNROLL_COUNT, 16) - - -C mp_limb_t mpn_mul_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t multiplier); -C mp_limb_t mpn_mul_1c (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t multiplier, mp_limb_t carry); -C -C Multiply src,size by mult and store the result in dst,size. -C Return the carry limb from the top of the result. -C -C mpn_mul_1c() accepts an initial carry for the calculation, it's added into -C the low limb of the destination. -C -C Variations on the unrolled loop have been tried, with the current -C registers or with the counter on the stack to free up ecx. The current -C code is the fastest found. -C -C An interesting effect is that removing the stores "movl %ebx, disp0(%edi)" -C from the unrolled loop actually slows it down to 5.0 cycles/limb. Code -C with this change can be tested on sizes of the form UNROLL_COUNT*n+1 -C without having to change the computed jump. There's obviously something -C fishy going on, perhaps with what execution units the mul needs. - -defframe(PARAM_CARRY, 20) -defframe(PARAM_MULTIPLIER,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - -defframe(SAVE_EBP, -4) -defframe(SAVE_EDI, -8) -defframe(SAVE_ESI, -12) -defframe(SAVE_EBX, -16) -deflit(STACK_SPACE, 16) - -dnl Must have UNROLL_THRESHOLD >= 2, since the unrolled loop can't handle 1. -ifdef(`PIC',` -deflit(UNROLL_THRESHOLD, 7) -',` -deflit(UNROLL_THRESHOLD, 5) -') - - .text - ALIGN(32) -PROLOGUE(mpn_mul_1c) -deflit(`FRAME',0) - movl PARAM_CARRY, %edx - jmp LF(mpn_mul_1,start_nc) -EPILOGUE() - - -PROLOGUE(mpn_mul_1) -deflit(`FRAME',0) - xorl %edx, %edx C initial carry -L(start_nc): - movl PARAM_SIZE, %ecx - subl $STACK_SPACE, %esp -deflit(`FRAME', STACK_SPACE) - - movl %edi, SAVE_EDI - movl %ebx, SAVE_EBX - movl %edx, %ebx - - movl %esi, SAVE_ESI - movl PARAM_SRC, %esi - cmpl $UNROLL_THRESHOLD, %ecx - - movl PARAM_DST, %edi - movl %ebp, SAVE_EBP - jae L(unroll) - - leal (%esi,%ecx,4), %esi - leal (%edi,%ecx,4), %edi - negl %ecx - - movl PARAM_MULTIPLIER, %ebp - -L(simple): - C eax scratch - C ebx carry - C ecx counter (negative) - C edx scratch - C esi src - C edi dst - C ebp multiplier - - movl (%esi,%ecx,4), %eax - - mull %ebp - - addl %ebx, %eax - movl %eax, (%edi,%ecx,4) - movl $0, %ebx - - adcl %edx, %ebx - incl %ecx - jnz L(simple) - - movl %ebx, %eax - movl SAVE_EBX, %ebx - movl SAVE_ESI, %esi - - movl SAVE_EDI, %edi - movl SAVE_EBP, %ebp - addl $STACK_SPACE, %esp - - ret - - -C ----------------------------------------------------------------------------- -C The mov to load the next source limb is done well ahead of the mul, this -C is necessary for full speed. It leads to one limb handled separately -C after the loop. -C -C When unrolling to 32 or more, an offset of +4 is used on the src pointer, -C to avoid having an 0x80 displacement in the code for the last limb in the -C unrolled loop. This is for a fair comparison between 16 and 32 unrolling. - -ifelse(eval(UNROLL_COUNT >= 32),1,` -deflit(SRC_OFFSET,4) -',` -deflit(SRC_OFFSET,) -') - - C this is offset 0x62, so close enough to aligned -L(unroll): - C eax - C ebx initial carry - C ecx size - C edx - C esi src - C edi dst - C ebp -deflit(`FRAME', STACK_SPACE) - - leal -1(%ecx), %edx C one limb handled at end - leal -2(%ecx), %ecx C and ecx is one less than edx - movl %ebp, SAVE_EBP - - negl %edx - shrl $UNROLL_LOG2, %ecx C unrolled loop counter - movl (%esi), %eax C src low limb - - andl $UNROLL_MASK, %edx - movl PARAM_DST, %edi - - movl %edx, %ebp - shll $4, %edx - - C 17 code bytes per limb -ifdef(`PIC',` - call L(add_eip_to_edx) -L(here): -',` - leal L(entry) (%edx,%ebp), %edx -') - negl %ebp - - leal ifelse(UNROLL_BYTES,256,128+) SRC_OFFSET(%esi,%ebp,4), %esi - leal ifelse(UNROLL_BYTES,256,128) (%edi,%ebp,4), %edi - movl PARAM_MULTIPLIER, %ebp - - jmp *%edx - - -ifdef(`PIC',` -L(add_eip_to_edx): - C See README.family about old gas bugs - leal (%edx,%ebp), %edx - addl $L(entry)-L(here), %edx - addl (%esp), %edx - ret -') - - -C ---------------------------------------------------------------------------- - ALIGN(32) -L(top): - C eax next src limb - C ebx carry - C ecx counter - C edx scratch - C esi src+4 - C edi dst - C ebp multiplier - C - C 17 code bytes per limb processed - -L(entry): -forloop(i, 0, UNROLL_COUNT-1, ` - deflit(`disp_dst', eval(i*4 ifelse(UNROLL_BYTES,256,-128))) - deflit(`disp_src', eval(disp_dst + 4-(SRC_OFFSET-0))) - - mull %ebp - - addl %eax, %ebx -Zdisp( movl, disp_src,(%esi), %eax) -Zdisp( movl, %ebx, disp_dst,(%edi)) - - movl $0, %ebx - adcl %edx, %ebx -') - - decl %ecx - - leal UNROLL_BYTES(%esi), %esi - leal UNROLL_BYTES(%edi), %edi - jns L(top) - - -deflit(`disp0', ifelse(UNROLL_BYTES,256,-128)) - - mull %ebp - - addl %eax, %ebx - movl $0, %eax - movl SAVE_ESI, %esi - - movl %ebx, disp0(%edi) - movl SAVE_EBX, %ebx - movl SAVE_EDI, %edi - - adcl %edx, %eax - movl SAVE_EBP, %ebp - addl $STACK_SPACE, %esp - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/mul_basecase.asm b/rts/gmp/mpn/x86/k7/mul_basecase.asm deleted file mode 100644 index c4be62e633..0000000000 --- a/rts/gmp/mpn/x86/k7/mul_basecase.asm +++ /dev/null @@ -1,593 +0,0 @@ -dnl AMD K7 mpn_mul_basecase -- multiply two mpn numbers. -dnl -dnl K7: approx 4.42 cycles per cross product at around 20x20 limbs (16 -dnl limbs/loop unrolling). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl K7 UNROLL_COUNT cycles/product (at around 20x20) -dnl 8 4.67 -dnl 16 4.59 -dnl 32 4.42 -dnl Maximum possible with the current code is 32. -dnl -dnl At 32 the typical 13-26 limb sizes from the karatsuba code will get -dnl done with a straight run through a block of code, no inner loop. Using -dnl 32 gives 1k of code, but the k7 has a 64k L1 code cache. - -deflit(UNROLL_COUNT, 32) - - -C void mpn_mul_basecase (mp_ptr wp, -C mp_srcptr xp, mp_size_t xsize, -C mp_srcptr yp, mp_size_t ysize); -C -C Calculate xp,xsize multiplied by yp,ysize, storing the result in -C wp,xsize+ysize. -C -C This routine is essentially the same as mpn/generic/mul_basecase.c, but -C it's faster because it does most of the mpn_addmul_1() startup -C calculations only once. The saving is 15-25% on typical sizes coming from -C the Karatsuba multiply code. - -ifdef(`PIC',` -deflit(UNROLL_THRESHOLD, 5) -',` -deflit(UNROLL_THRESHOLD, 5) -') - -defframe(PARAM_YSIZE,20) -defframe(PARAM_YP, 16) -defframe(PARAM_XSIZE,12) -defframe(PARAM_XP, 8) -defframe(PARAM_WP, 4) - - .text - ALIGN(32) -PROLOGUE(mpn_mul_basecase) -deflit(`FRAME',0) - - movl PARAM_XSIZE, %ecx - movl PARAM_YP, %eax - - movl PARAM_XP, %edx - movl (%eax), %eax C yp low limb - - cmpl $2, %ecx - ja L(xsize_more_than_two) - je L(two_by_something) - - - C one limb by one limb - - mull (%edx) - - movl PARAM_WP, %ecx - movl %eax, (%ecx) - movl %edx, 4(%ecx) - ret - - -C ----------------------------------------------------------------------------- -L(two_by_something): -deflit(`FRAME',0) - decl PARAM_YSIZE - pushl %ebx defframe_pushl(`SAVE_EBX') - movl %eax, %ecx C yp low limb - - movl PARAM_WP, %ebx - pushl %esi defframe_pushl(`SAVE_ESI') - movl %edx, %esi C xp - - movl (%edx), %eax C xp low limb - jnz L(two_by_two) - - - C two limbs by one limb - - mull %ecx - - movl %eax, (%ebx) - movl 4(%esi), %eax - movl %edx, %esi C carry - - mull %ecx - - addl %eax, %esi - - movl %esi, 4(%ebx) - movl SAVE_ESI, %esi - - adcl $0, %edx - - movl %edx, 8(%ebx) - movl SAVE_EBX, %ebx - addl $FRAME, %esp - - ret - - - -C ----------------------------------------------------------------------------- -C Could load yp earlier into another register. - - ALIGN(16) -L(two_by_two): - C eax xp low limb - C ebx wp - C ecx yp low limb - C edx - C esi xp - C edi - C ebp - -dnl FRAME carries on from previous - - mull %ecx C xp[0] * yp[0] - - push %edi defframe_pushl(`SAVE_EDI') - movl %edx, %edi C carry, for wp[1] - - movl %eax, (%ebx) - movl 4(%esi), %eax - - mull %ecx C xp[1] * yp[0] - - addl %eax, %edi - movl PARAM_YP, %ecx - - adcl $0, %edx - movl 4(%ecx), %ecx C yp[1] - movl %edi, 4(%ebx) - - movl 4(%esi), %eax C xp[1] - movl %edx, %edi C carry, for wp[2] - - mull %ecx C xp[1] * yp[1] - - addl %eax, %edi - - adcl $0, %edx - movl (%esi), %eax C xp[0] - - movl %edx, %esi C carry, for wp[3] - - mull %ecx C xp[0] * yp[1] - - addl %eax, 4(%ebx) - adcl %edx, %edi - movl %edi, 8(%ebx) - - adcl $0, %esi - movl SAVE_EDI, %edi - movl %esi, 12(%ebx) - - movl SAVE_ESI, %esi - movl SAVE_EBX, %ebx - addl $FRAME, %esp - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(xsize_more_than_two): - -C The first limb of yp is processed with a simple mpn_mul_1 style loop -C inline. Unrolling this doesn't seem worthwhile since it's only run once -C (whereas the addmul below is run ysize-1 many times). A call to the -C actual mpn_mul_1 will be slowed down by the call and parameter pushing and -C popping, and doesn't seem likely to be worthwhile on the typical 13-26 -C limb operations the Karatsuba code calls here with. - - C eax yp[0] - C ebx - C ecx xsize - C edx xp - C esi - C edi - C ebp - -dnl FRAME doesn't carry on from previous, no pushes yet here -defframe(`SAVE_EBX',-4) -defframe(`SAVE_ESI',-8) -defframe(`SAVE_EDI',-12) -defframe(`SAVE_EBP',-16) -deflit(`FRAME',0) - - subl $16, %esp -deflit(`FRAME',16) - - movl %edi, SAVE_EDI - movl PARAM_WP, %edi - - movl %ebx, SAVE_EBX - movl %ebp, SAVE_EBP - movl %eax, %ebp - - movl %esi, SAVE_ESI - xorl %ebx, %ebx - leal (%edx,%ecx,4), %esi C xp end - - leal (%edi,%ecx,4), %edi C wp end of mul1 - negl %ecx - - -L(mul1): - C eax scratch - C ebx carry - C ecx counter, negative - C edx scratch - C esi xp end - C edi wp end of mul1 - C ebp multiplier - - movl (%esi,%ecx,4), %eax - - mull %ebp - - addl %ebx, %eax - movl %eax, (%edi,%ecx,4) - movl $0, %ebx - - adcl %edx, %ebx - incl %ecx - jnz L(mul1) - - - movl PARAM_YSIZE, %edx - movl PARAM_XSIZE, %ecx - - movl %ebx, (%edi) C final carry - decl %edx - - jnz L(ysize_more_than_one) - - - movl SAVE_EDI, %edi - movl SAVE_EBX, %ebx - - movl SAVE_EBP, %ebp - movl SAVE_ESI, %esi - addl $FRAME, %esp - - ret - - -L(ysize_more_than_one): - cmpl $UNROLL_THRESHOLD, %ecx - movl PARAM_YP, %eax - - jae L(unroll) - - -C ----------------------------------------------------------------------------- - C simple addmul looping - C - C eax yp - C ebx - C ecx xsize - C edx ysize-1 - C esi xp end - C edi wp end of mul1 - C ebp - - leal 4(%eax,%edx,4), %ebp C yp end - negl %ecx - negl %edx - - movl (%esi,%ecx,4), %eax C xp low limb - movl %edx, PARAM_YSIZE C -(ysize-1) - incl %ecx - - xorl %ebx, %ebx C initial carry - movl %ecx, PARAM_XSIZE C -(xsize-1) - movl %ebp, PARAM_YP - - movl (%ebp,%edx,4), %ebp C yp second lowest limb - multiplier - jmp L(simple_outer_entry) - - - C this is offset 0x121 so close enough to aligned -L(simple_outer_top): - C ebp ysize counter, negative - - movl PARAM_YP, %edx - movl PARAM_XSIZE, %ecx C -(xsize-1) - xorl %ebx, %ebx C carry - - movl %ebp, PARAM_YSIZE - addl $4, %edi C next position in wp - - movl (%edx,%ebp,4), %ebp C yp limb - multiplier - movl -4(%esi,%ecx,4), %eax C xp low limb - - -L(simple_outer_entry): - -L(simple_inner): - C eax xp limb - C ebx carry limb - C ecx loop counter (negative) - C edx scratch - C esi xp end - C edi wp end - C ebp multiplier - - mull %ebp - - addl %eax, %ebx - adcl $0, %edx - - addl %ebx, (%edi,%ecx,4) - movl (%esi,%ecx,4), %eax - adcl $0, %edx - - incl %ecx - movl %edx, %ebx - jnz L(simple_inner) - - - mull %ebp - - movl PARAM_YSIZE, %ebp - addl %eax, %ebx - - adcl $0, %edx - addl %ebx, (%edi) - - adcl $0, %edx - incl %ebp - - movl %edx, 4(%edi) - jnz L(simple_outer_top) - - - movl SAVE_EBX, %ebx - movl SAVE_ESI, %esi - - movl SAVE_EDI, %edi - movl SAVE_EBP, %ebp - addl $FRAME, %esp - - ret - - - -C ----------------------------------------------------------------------------- -C -C The unrolled loop is the same as in mpn_addmul_1(), see that code for some -C comments. -C -C VAR_ADJUST is the negative of how many limbs the leals in the inner loop -C increment xp and wp. This is used to adjust back xp and wp, and rshifted -C to given an initial VAR_COUNTER at the top of the outer loop. -C -C VAR_COUNTER is for the unrolled loop, running from VAR_ADJUST/UNROLL_COUNT -C up to -1, inclusive. -C -C VAR_JMP is the computed jump into the unrolled loop. -C -C VAR_XP_LOW is the least significant limb of xp, which is needed at the -C start of the unrolled loop. -C -C PARAM_YSIZE is the outer loop counter, going from -(ysize-1) up to -1, -C inclusive. -C -C PARAM_YP is offset appropriately so that the PARAM_YSIZE counter can be -C added to give the location of the next limb of yp, which is the multiplier -C in the unrolled loop. -C -C The trick with VAR_ADJUST means it's only necessary to do one fetch in the -C outer loop to take care of xp, wp and the inner loop counter. - -defframe(VAR_COUNTER, -20) -defframe(VAR_ADJUST, -24) -defframe(VAR_JMP, -28) -defframe(VAR_XP_LOW, -32) -deflit(VAR_EXTRA_SPACE, 16) - - -L(unroll): - C eax yp - C ebx - C ecx xsize - C edx ysize-1 - C esi xp end - C edi wp end of mul1 - C ebp - - movl PARAM_XP, %esi - movl 4(%eax), %ebp C multiplier (yp second limb) - leal 4(%eax,%edx,4), %eax C yp adjust for ysize indexing - - movl PARAM_WP, %edi - movl %eax, PARAM_YP - negl %edx - - movl %edx, PARAM_YSIZE - leal UNROLL_COUNT-2(%ecx), %ebx C (xsize-1)+UNROLL_COUNT-1 - decl %ecx C xsize-1 - - movl (%esi), %eax C xp low limb - andl $-UNROLL_MASK-1, %ebx - negl %ecx - - subl $VAR_EXTRA_SPACE, %esp -deflit(`FRAME',16+VAR_EXTRA_SPACE) - negl %ebx - andl $UNROLL_MASK, %ecx - - movl %ebx, VAR_ADJUST - movl %ecx, %edx - shll $4, %ecx - - sarl $UNROLL_LOG2, %ebx - - C 17 code bytes per limb -ifdef(`PIC',` - call L(pic_calc) -L(unroll_here): -',` - leal L(unroll_entry) (%ecx,%edx,1), %ecx -') - negl %edx - - movl %eax, VAR_XP_LOW - movl %ecx, VAR_JMP - leal 4(%edi,%edx,4), %edi C wp and xp, adjust for unrolling, - leal 4(%esi,%edx,4), %esi C and start at second limb - jmp L(unroll_outer_entry) - - -ifdef(`PIC',` -L(pic_calc): - C See README.family about old gas bugs - leal (%ecx,%edx,1), %ecx - addl $L(unroll_entry)-L(unroll_here), %ecx - addl (%esp), %ecx - ret -') - - -C -------------------------------------------------------------------------- - ALIGN(32) -L(unroll_outer_top): - C ebp ysize counter, negative - - movl VAR_ADJUST, %ebx - movl PARAM_YP, %edx - - movl VAR_XP_LOW, %eax - movl %ebp, PARAM_YSIZE C store incremented ysize counter - - leal 4(%edi,%ebx,4), %edi - leal (%esi,%ebx,4), %esi - sarl $UNROLL_LOG2, %ebx - - movl (%edx,%ebp,4), %ebp C yp next multiplier - movl VAR_JMP, %ecx - -L(unroll_outer_entry): - mull %ebp - - testb $1, %cl C and clear carry bit - movl %ebx, VAR_COUNTER - movl $0, %ebx - - movl $0, %ecx - cmovz( %eax, %ecx) C eax into low carry, zero into high carry limb - cmovnz( %eax, %ebx) - - C Extra fetch of VAR_JMP is bad, but registers are tight - jmp *VAR_JMP - - -C ----------------------------------------------------------------------------- - ALIGN(32) -L(unroll_top): - C eax xp limb - C ebx carry high - C ecx carry low - C edx scratch - C esi xp+8 - C edi wp - C ebp yp multiplier limb - C - C VAR_COUNTER loop counter, negative - C - C 17 bytes each limb - -L(unroll_entry): - -deflit(CHUNK_COUNT,2) -forloop(`i', 0, UNROLL_COUNT/CHUNK_COUNT-1, ` - deflit(`disp0', eval(i*CHUNK_COUNT*4 ifelse(UNROLL_BYTES,256,-128))) - deflit(`disp1', eval(disp0 + 4)) - -Zdisp( movl, disp0,(%esi), %eax) - adcl %edx, %ebx - - mull %ebp - -Zdisp( addl, %ecx, disp0,(%edi)) - movl $0, %ecx - - adcl %eax, %ebx - - - movl disp1(%esi), %eax - adcl %edx, %ecx - - mull %ebp - - addl %ebx, disp1(%edi) - movl $0, %ebx - - adcl %eax, %ecx -') - - - incl VAR_COUNTER - leal UNROLL_BYTES(%esi), %esi - leal UNROLL_BYTES(%edi), %edi - - jnz L(unroll_top) - - - C eax - C ebx zero - C ecx low - C edx high - C esi - C edi wp, pointing at second last limb) - C ebp - C - C carry flag to be added to high - -deflit(`disp0', ifelse(UNROLL_BYTES,256,-128)) -deflit(`disp1', eval(disp0-0 + 4)) - - movl PARAM_YSIZE, %ebp - adcl $0, %edx - addl %ecx, disp0(%edi) - - adcl $0, %edx - incl %ebp - - movl %edx, disp1(%edi) - jnz L(unroll_outer_top) - - - movl SAVE_ESI, %esi - movl SAVE_EBP, %ebp - - movl SAVE_EDI, %edi - movl SAVE_EBX, %ebx - addl $FRAME, %esp - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/k7/sqr_basecase.asm b/rts/gmp/mpn/x86/k7/sqr_basecase.asm deleted file mode 100644 index 84861ea66b..0000000000 --- a/rts/gmp/mpn/x86/k7/sqr_basecase.asm +++ /dev/null @@ -1,627 +0,0 @@ -dnl AMD K7 mpn_sqr_basecase -- square an mpn number. -dnl -dnl K7: approx 2.3 cycles/crossproduct, or 4.55 cycles/triangular product -dnl (measured on the speed difference between 25 and 50 limbs, which is -dnl roughly the Karatsuba recursing range). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl These are the same as mpn/x86/k6/sqr_basecase.asm, see that code for -dnl some comments. - -deflit(KARATSUBA_SQR_THRESHOLD_MAX, 66) - -ifdef(`KARATSUBA_SQR_THRESHOLD_OVERRIDE', -`define(`KARATSUBA_SQR_THRESHOLD',KARATSUBA_SQR_THRESHOLD_OVERRIDE)') - -m4_config_gmp_mparam(`KARATSUBA_SQR_THRESHOLD') -deflit(UNROLL_COUNT, eval(KARATSUBA_SQR_THRESHOLD-3)) - - -C void mpn_sqr_basecase (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C With a KARATSUBA_SQR_THRESHOLD around 50 this code is about 1500 bytes, -C which is quite a bit, but is considered good value since squares big -C enough to use most of the code will be spending quite a few cycles in it. - - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(32) -PROLOGUE(mpn_sqr_basecase) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - movl PARAM_SRC, %eax - cmpl $2, %ecx - - movl PARAM_DST, %edx - je L(two_limbs) - ja L(three_or_more) - - -C------------------------------------------------------------------------------ -C one limb only - C eax src - C ecx size - C edx dst - - movl (%eax), %eax - movl %edx, %ecx - - mull %eax - - movl %edx, 4(%ecx) - movl %eax, (%ecx) - ret - - -C------------------------------------------------------------------------------ -C -C Using the read/modify/write "add"s seems to be faster than saving and -C restoring registers. Perhaps the loads for the first set hide under the -C mul latency and the second gets store to load forwarding. - - ALIGN(16) -L(two_limbs): - C eax src - C ebx - C ecx size - C edx dst -deflit(`FRAME',0) - - pushl %ebx FRAME_pushl() - movl %eax, %ebx C src - movl (%eax), %eax - - movl %edx, %ecx C dst - - mull %eax C src[0]^2 - - movl %eax, (%ecx) C dst[0] - movl 4(%ebx), %eax - - movl %edx, 4(%ecx) C dst[1] - - mull %eax C src[1]^2 - - movl %eax, 8(%ecx) C dst[2] - movl (%ebx), %eax - - movl %edx, 12(%ecx) C dst[3] - - mull 4(%ebx) C src[0]*src[1] - - popl %ebx - - addl %eax, 4(%ecx) - adcl %edx, 8(%ecx) - adcl $0, 12(%ecx) - ASSERT(nc) - - addl %eax, 4(%ecx) - adcl %edx, 8(%ecx) - adcl $0, 12(%ecx) - ASSERT(nc) - - ret - - -C------------------------------------------------------------------------------ -defframe(SAVE_EBX, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EDI, -12) -defframe(SAVE_EBP, -16) -deflit(STACK_SPACE, 16) - -L(three_or_more): - subl $STACK_SPACE, %esp - cmpl $4, %ecx - jae L(four_or_more) -deflit(`FRAME',STACK_SPACE) - - -C------------------------------------------------------------------------------ -C Three limbs -C -C Writing out the loads and stores separately at the end of this code comes -C out about 10 cycles faster than using adcls to memory. - - C eax src - C ecx size - C edx dst - - movl %ebx, SAVE_EBX - movl %eax, %ebx C src - movl (%eax), %eax - - movl %edx, %ecx C dst - movl %esi, SAVE_ESI - movl %edi, SAVE_EDI - - mull %eax C src[0] ^ 2 - - movl %eax, (%ecx) - movl 4(%ebx), %eax - movl %edx, 4(%ecx) - - mull %eax C src[1] ^ 2 - - movl %eax, 8(%ecx) - movl 8(%ebx), %eax - movl %edx, 12(%ecx) - - mull %eax C src[2] ^ 2 - - movl %eax, 16(%ecx) - movl (%ebx), %eax - movl %edx, 20(%ecx) - - mull 4(%ebx) C src[0] * src[1] - - movl %eax, %esi - movl (%ebx), %eax - movl %edx, %edi - - mull 8(%ebx) C src[0] * src[2] - - addl %eax, %edi - movl %ebp, SAVE_EBP - movl $0, %ebp - - movl 4(%ebx), %eax - adcl %edx, %ebp - - mull 8(%ebx) C src[1] * src[2] - - xorl %ebx, %ebx - addl %eax, %ebp - - adcl $0, %edx - - C eax - C ebx zero, will be dst[5] - C ecx dst - C edx dst[4] - C esi dst[1] - C edi dst[2] - C ebp dst[3] - - adcl $0, %edx - addl %esi, %esi - - adcl %edi, %edi - movl 4(%ecx), %eax - - adcl %ebp, %ebp - - adcl %edx, %edx - - adcl $0, %ebx - addl %eax, %esi - movl 8(%ecx), %eax - - adcl %eax, %edi - movl 12(%ecx), %eax - movl %esi, 4(%ecx) - - adcl %eax, %ebp - movl 16(%ecx), %eax - movl %edi, 8(%ecx) - - movl SAVE_ESI, %esi - movl SAVE_EDI, %edi - - adcl %eax, %edx - movl 20(%ecx), %eax - movl %ebp, 12(%ecx) - - adcl %ebx, %eax - ASSERT(nc) - movl SAVE_EBX, %ebx - movl SAVE_EBP, %ebp - - movl %edx, 16(%ecx) - movl %eax, 20(%ecx) - addl $FRAME, %esp - - ret - - -C------------------------------------------------------------------------------ -L(four_or_more): - -C First multiply src[0]*src[1..size-1] and store at dst[1..size]. -C Further products are added in rather than stored. - - C eax src - C ebx - C ecx size - C edx dst - C esi - C edi - C ebp - -defframe(`VAR_COUNTER',-20) -defframe(`VAR_JMP', -24) -deflit(EXTRA_STACK_SPACE, 8) - - movl %ebx, SAVE_EBX - movl %edi, SAVE_EDI - leal (%edx,%ecx,4), %edi C &dst[size] - - movl %esi, SAVE_ESI - movl %ebp, SAVE_EBP - leal (%eax,%ecx,4), %esi C &src[size] - - movl (%eax), %ebp C multiplier - movl $0, %ebx - decl %ecx - - negl %ecx - subl $EXTRA_STACK_SPACE, %esp -FRAME_subl_esp(EXTRA_STACK_SPACE) - -L(mul_1): - C eax scratch - C ebx carry - C ecx counter - C edx scratch - C esi &src[size] - C edi &dst[size] - C ebp multiplier - - movl (%esi,%ecx,4), %eax - - mull %ebp - - addl %ebx, %eax - movl %eax, (%edi,%ecx,4) - movl $0, %ebx - - adcl %edx, %ebx - incl %ecx - jnz L(mul_1) - - -C Add products src[n]*src[n+1..size-1] at dst[2*n-1...], for each n=1..size-2. -C -C The last two products, which are the bottom right corner of the product -C triangle, are left to the end. These are src[size-3]*src[size-2,size-1] -C and src[size-2]*src[size-1]. If size is 4 then it's only these corner -C cases that need to be done. -C -C The unrolled code is the same as in mpn_addmul_1, see that routine for -C some comments. -C -C VAR_COUNTER is the outer loop, running from -size+4 to -1, inclusive. -C -C VAR_JMP is the computed jump into the unrolled code, stepped by one code -C chunk each outer loop. -C -C K7 does branch prediction on indirect jumps, which is bad since it's a -C different target each time. There seems no way to avoid this. - -dnl This value also hard coded in some shifts and adds -deflit(CODE_BYTES_PER_LIMB, 17) - -dnl With the unmodified &src[size] and &dst[size] pointers, the -dnl displacements in the unrolled code fit in a byte for UNROLL_COUNT -dnl values up to 31, but above that an offset must be added to them. - -deflit(OFFSET, -ifelse(eval(UNROLL_COUNT>31),1, -eval((UNROLL_COUNT-31)*4), -0)) - -dnl Because the last chunk of code is generated differently, a label placed -dnl at the end doesn't work. Instead calculate the implied end using the -dnl start and how many chunks of code there are. - -deflit(UNROLL_INNER_END, -`L(unroll_inner_start)+eval(UNROLL_COUNT*CODE_BYTES_PER_LIMB)') - - C eax - C ebx carry - C ecx - C edx - C esi &src[size] - C edi &dst[size] - C ebp - - movl PARAM_SIZE, %ecx - movl %ebx, (%edi) - - subl $4, %ecx - jz L(corner) - - negl %ecx -ifelse(OFFSET,0,,`subl $OFFSET, %edi') -ifelse(OFFSET,0,,`subl $OFFSET, %esi') - - movl %ecx, %edx - shll $4, %ecx - -ifdef(`PIC',` - call L(pic_calc) -L(here): -',` - leal UNROLL_INNER_END-eval(2*CODE_BYTES_PER_LIMB)(%ecx,%edx), %ecx -') - - - C The calculated jump mustn't come out to before the start of the - C code available. This is the limit UNROLL_COUNT puts on the src - C operand size, but checked here directly using the jump address. - ASSERT(ae, - `movl_text_address(L(unroll_inner_start), %eax) - cmpl %eax, %ecx') - - -C------------------------------------------------------------------------------ - ALIGN(16) -L(unroll_outer_top): - C eax - C ebx high limb to store - C ecx VAR_JMP - C edx VAR_COUNTER, limbs, negative - C esi &src[size], constant - C edi dst ptr, high of last addmul - C ebp - - movl -12+OFFSET(%esi,%edx,4), %ebp C next multiplier - movl -8+OFFSET(%esi,%edx,4), %eax C first of multiplicand - - movl %edx, VAR_COUNTER - - mull %ebp - -define(cmovX,`ifelse(eval(UNROLL_COUNT%2),0,`cmovz($@)',`cmovnz($@)')') - - testb $1, %cl - movl %edx, %ebx C high carry - movl %ecx, %edx C jump - - movl %eax, %ecx C low carry - cmovX( %ebx, %ecx) C high carry reverse - cmovX( %eax, %ebx) C low carry reverse - - leal CODE_BYTES_PER_LIMB(%edx), %eax - xorl %edx, %edx - leal 4(%edi), %edi - - movl %eax, VAR_JMP - - jmp *%eax - - -ifdef(`PIC',` -L(pic_calc): - addl (%esp), %ecx - addl $UNROLL_INNER_END-eval(2*CODE_BYTES_PER_LIMB)-L(here), %ecx - addl %edx, %ecx - ret -') - - - C Must be an even address to preserve the significance of the low - C bit of the jump address indicating which way around ecx/ebx should - C start. - ALIGN(2) - -L(unroll_inner_start): - C eax next limb - C ebx carry high - C ecx carry low - C edx scratch - C esi src - C edi dst - C ebp multiplier - -forloop(`i', UNROLL_COUNT, 1, ` - deflit(`disp_src', eval(-i*4 + OFFSET)) - deflit(`disp_dst', eval(disp_src - 4)) - - m4_assert(`disp_src>=-128 && disp_src<128') - m4_assert(`disp_dst>=-128 && disp_dst<128') - -ifelse(eval(i%2),0,` -Zdisp( movl, disp_src,(%esi), %eax) - adcl %edx, %ebx - - mull %ebp - -Zdisp( addl, %ecx, disp_dst,(%edi)) - movl $0, %ecx - - adcl %eax, %ebx - -',` - dnl this bit comes out last -Zdisp( movl, disp_src,(%esi), %eax) - adcl %edx, %ecx - - mull %ebp - -dnl Zdisp( addl %ebx, disp_src,(%edi)) - addl %ebx, disp_dst(%edi) -ifelse(forloop_last,0, -` movl $0, %ebx') - - adcl %eax, %ecx -') -') - - C eax next limb - C ebx carry high - C ecx carry low - C edx scratch - C esi src - C edi dst - C ebp multiplier - - adcl $0, %edx - addl %ecx, -4+OFFSET(%edi) - movl VAR_JMP, %ecx - - adcl $0, %edx - - movl %edx, m4_empty_if_zero(OFFSET) (%edi) - movl VAR_COUNTER, %edx - - incl %edx - jnz L(unroll_outer_top) - - -ifelse(OFFSET,0,,` - addl $OFFSET, %esi - addl $OFFSET, %edi -') - - -C------------------------------------------------------------------------------ -L(corner): - C esi &src[size] - C edi &dst[2*size-5] - - movl -12(%esi), %ebp - movl -8(%esi), %eax - movl %eax, %ecx - - mull %ebp - - addl %eax, -4(%edi) - movl -4(%esi), %eax - - adcl $0, %edx - movl %edx, %ebx - movl %eax, %esi - - mull %ebp - - addl %ebx, %eax - - adcl $0, %edx - addl %eax, (%edi) - movl %esi, %eax - - adcl $0, %edx - movl %edx, %ebx - - mull %ecx - - addl %ebx, %eax - movl %eax, 4(%edi) - - adcl $0, %edx - movl %edx, 8(%edi) - - - -C Left shift of dst[1..2*size-2], high bit shifted out becomes dst[2*size-1]. - -L(lshift_start): - movl PARAM_SIZE, %eax - movl PARAM_DST, %edi - xorl %ecx, %ecx C clear carry - - leal (%edi,%eax,8), %edi - notl %eax C -size-1, preserve carry - - leal 2(%eax), %eax C -(size-1) - -L(lshift): - C eax counter, negative - C ebx - C ecx - C edx - C esi - C edi dst, pointing just after last limb - C ebp - - rcll -4(%edi,%eax,8) - rcll (%edi,%eax,8) - incl %eax - jnz L(lshift) - - setc %al - - movl PARAM_SRC, %esi - movl %eax, -4(%edi) C dst most significant limb - - movl PARAM_SIZE, %ecx - - -C Now add in the squares on the diagonal, src[0]^2, src[1]^2, ..., -C src[size-1]^2. dst[0] hasn't yet been set at all yet, and just gets the -C low limb of src[0]^2. - - movl (%esi), %eax C src[0] - - mull %eax - - leal (%esi,%ecx,4), %esi C src point just after last limb - negl %ecx - - movl %eax, (%edi,%ecx,8) C dst[0] - incl %ecx - -L(diag): - C eax scratch - C ebx scratch - C ecx counter, negative - C edx carry - C esi src just after last limb - C edi dst just after last limb - C ebp - - movl (%esi,%ecx,4), %eax - movl %edx, %ebx - - mull %eax - - addl %ebx, -4(%edi,%ecx,8) - adcl %eax, (%edi,%ecx,8) - adcl $0, %edx - - incl %ecx - jnz L(diag) - - - movl SAVE_ESI, %esi - movl SAVE_EBX, %ebx - - addl %edx, -4(%edi) C dst most significant limb - movl SAVE_EDI, %edi - - movl SAVE_EBP, %ebp - addl $FRAME, %esp - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/lshift.asm b/rts/gmp/mpn/x86/lshift.asm deleted file mode 100644 index 4735335cbe..0000000000 --- a/rts/gmp/mpn/x86/lshift.asm +++ /dev/null @@ -1,90 +0,0 @@ -dnl x86 mpn_lshift -- mpn left shift. - -dnl Copyright (C) 1992, 1994, 1996, 1999, 2000 Free Software Foundation, -dnl Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_lshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(8) -PROLOGUE(mpn_lshift) - - pushl %edi - pushl %esi - pushl %ebx -deflit(`FRAME',12) - - movl PARAM_DST,%edi - movl PARAM_SRC,%esi - movl PARAM_SIZE,%edx - movl PARAM_SHIFT,%ecx - - subl $4,%esi C adjust src - - movl (%esi,%edx,4),%ebx C read most significant limb - xorl %eax,%eax - shldl( %cl, %ebx, %eax) C compute carry limb - decl %edx - jz L(end) - pushl %eax C push carry limb onto stack - testb $1,%dl - jnz L(1) C enter loop in the middle - movl %ebx,%eax - - ALIGN(8) -L(oop): movl (%esi,%edx,4),%ebx C load next lower limb - shldl( %cl, %ebx, %eax) C compute result limb - movl %eax,(%edi,%edx,4) C store it - decl %edx -L(1): movl (%esi,%edx,4),%eax - shldl( %cl, %eax, %ebx) - movl %ebx,(%edi,%edx,4) - decl %edx - jnz L(oop) - - shll %cl,%eax C compute least significant limb - movl %eax,(%edi) C store it - - popl %eax C pop carry limb - - popl %ebx - popl %esi - popl %edi - ret - -L(end): shll %cl,%ebx C compute least significant limb - movl %ebx,(%edi) C store it - - popl %ebx - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/mod_1.asm b/rts/gmp/mpn/x86/mod_1.asm deleted file mode 100644 index 3908161b3e..0000000000 --- a/rts/gmp/mpn/x86/mod_1.asm +++ /dev/null @@ -1,141 +0,0 @@ -dnl x86 mpn_mod_1 -- mpn by limb remainder. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -dnl cycles/limb -dnl K6 20 -dnl P5 44 -dnl P6 39 -dnl 486 approx 42 maybe -dnl -dnl The following have their own optimized mod_1 implementations, but for -dnl reference the code here runs as follows. -dnl -dnl P6MMX 39 -dnl K7 41 - - -include(`../config.m4') - - -C mp_limb_t mpn_mod_1 (mp_srcptr src, mp_size_t size, mp_limb_t divisor); -C mp_limb_t mpn_mod_1c (mp_srcptr src, mp_size_t size, mp_limb_t divisor, -C mp_limb_t carry); -C -C Divide src,size by divisor and return the remainder. The quotient is -C discarded. -C -C See mpn/x86/divrem_1.asm for some comments. - -defframe(PARAM_CARRY, 16) -defframe(PARAM_DIVISOR,12) -defframe(PARAM_SIZE, 8) -defframe(PARAM_SRC, 4) - - .text - ALIGN(16) - -PROLOGUE(mpn_mod_1c) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - pushl %ebx FRAME_pushl() - - movl PARAM_SRC, %ebx - pushl %esi FRAME_pushl() - - movl PARAM_DIVISOR, %esi - orl %ecx, %ecx - - movl PARAM_CARRY, %edx - jnz LF(mpn_mod_1,top) - - popl %esi - movl %edx, %eax - - popl %ebx - - ret - -EPILOGUE() - - -PROLOGUE(mpn_mod_1) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - pushl %ebx FRAME_pushl() - - movl PARAM_SRC, %ebx - pushl %esi FRAME_pushl() - - orl %ecx, %ecx - jz L(done_zero) - - movl PARAM_DIVISOR, %esi - movl -4(%ebx,%ecx,4), %eax C src high limb - - cmpl %esi, %eax - - sbbl %edx, %edx C -1 if high<divisor - - addl %edx, %ecx C skip one division if high<divisor - jz L(done_eax) - - andl %eax, %edx C carry if high<divisor - - -L(top): - C eax scratch (quotient) - C ebx src - C ecx counter - C edx carry (remainder) - C esi divisor - C edi - C ebp - - movl -4(%ebx,%ecx,4), %eax - - divl %esi - - loop_or_decljnz L(top) - - - movl %edx, %eax -L(done_eax): - popl %esi - - popl %ebx - - ret - - -L(done_zero): - popl %esi - xorl %eax, %eax - - popl %ebx - - ret - - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/mul_1.asm b/rts/gmp/mpn/x86/mul_1.asm deleted file mode 100644 index 8817f291bc..0000000000 --- a/rts/gmp/mpn/x86/mul_1.asm +++ /dev/null @@ -1,130 +0,0 @@ -dnl x86 mpn_mul_1 (for 386, 486, and Pentium Pro) -- Multiply a limb vector -dnl with a limb and store the result in a second limb vector. -dnl -dnl cycles/limb -dnl P6: 5.5 -dnl -dnl The following CPUs have their own optimized code, but for reference the -dnl code here runs as follows. -dnl -dnl cycles/limb -dnl P5: 12.5 -dnl K6: 10.5 -dnl K7: 4.5 - - -dnl Copyright (C) 1992, 1994, 1997, 1998, 1999, 2000 Free Software -dnl Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_mul_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t multiplier); - -defframe(PARAM_MULTIPLIER,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - TEXT - ALIGN(8) -PROLOGUE(mpn_mul_1) -deflit(`FRAME',0) - - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp -deflit(`FRAME',16) - - movl PARAM_DST,%edi - movl PARAM_SRC,%esi - movl PARAM_SIZE,%ecx - - xorl %ebx,%ebx - andl $3,%ecx - jz L(end0) - -L(oop0): - movl (%esi),%eax - mull PARAM_MULTIPLIER - leal 4(%esi),%esi - addl %ebx,%eax - movl $0,%ebx - adcl %ebx,%edx - movl %eax,(%edi) - movl %edx,%ebx C propagate carry into cylimb - - leal 4(%edi),%edi - decl %ecx - jnz L(oop0) - -L(end0): - movl PARAM_SIZE,%ecx - shrl $2,%ecx - jz L(end) - - - ALIGN(8) -L(oop): movl (%esi),%eax - mull PARAM_MULTIPLIER - addl %eax,%ebx - movl $0,%ebp - adcl %edx,%ebp - - movl 4(%esi),%eax - mull PARAM_MULTIPLIER - movl %ebx,(%edi) - addl %eax,%ebp C new lo + cylimb - movl $0,%ebx - adcl %edx,%ebx - - movl 8(%esi),%eax - mull PARAM_MULTIPLIER - movl %ebp,4(%edi) - addl %eax,%ebx C new lo + cylimb - movl $0,%ebp - adcl %edx,%ebp - - movl 12(%esi),%eax - mull PARAM_MULTIPLIER - movl %ebx,8(%edi) - addl %eax,%ebp C new lo + cylimb - movl $0,%ebx - adcl %edx,%ebx - - movl %ebp,12(%edi) - - leal 16(%esi),%esi - leal 16(%edi),%edi - decl %ecx - jnz L(oop) - -L(end): movl %ebx,%eax - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/mul_basecase.asm b/rts/gmp/mpn/x86/mul_basecase.asm deleted file mode 100644 index 3a9b73895b..0000000000 --- a/rts/gmp/mpn/x86/mul_basecase.asm +++ /dev/null @@ -1,209 +0,0 @@ -dnl x86 mpn_mul_basecase -- Multiply two limb vectors and store the result -dnl in a third limb vector. - - -dnl Copyright (C) 1996, 1997, 1998, 1999, 2000 Free Software Foundation, -dnl Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C void mpn_mul_basecase (mp_ptr wp, -C mp_srcptr xp, mp_size_t xsize, -C mp_srcptr yp, mp_size_t ysize); -C -C This was written in a haste since the Pentium optimized code that was used -C for all x86 machines was slow for the Pentium II. This code would benefit -C from some cleanup. -C -C To shave off some percentage of the run-time, one should make 4 variants -C of the Louter loop, for the four different outcomes of un mod 4. That -C would avoid Loop0 altogether. Code expansion would be > 4-fold for that -C part of the function, but since it is not very large, that would be -C acceptable. -C -C The mul loop (at L(oopM)) might need some tweaking. It's current speed is -C unknown. - -defframe(PARAM_YSIZE,20) -defframe(PARAM_YP, 16) -defframe(PARAM_XSIZE,12) -defframe(PARAM_XP, 8) -defframe(PARAM_WP, 4) - -defframe(VAR_MULTIPLIER, -4) -defframe(VAR_COUNTER, -8) -deflit(VAR_STACK_SPACE, 8) - - .text - ALIGN(8) - -PROLOGUE(mpn_mul_basecase) -deflit(`FRAME',0) - - subl $VAR_STACK_SPACE,%esp - pushl %esi - pushl %ebp - pushl %edi -deflit(`FRAME',eval(VAR_STACK_SPACE+12)) - - movl PARAM_XP,%esi - movl PARAM_WP,%edi - movl PARAM_YP,%ebp - - movl (%esi),%eax C load xp[0] - mull (%ebp) C multiply by yp[0] - movl %eax,(%edi) C store to wp[0] - movl PARAM_XSIZE,%ecx C xsize - decl %ecx C If xsize = 1, ysize = 1 too - jz L(done) - - pushl %ebx -FRAME_pushl() - movl %edx,%ebx - - leal 4(%esi),%esi - leal 4(%edi),%edi - -L(oopM): - movl (%esi),%eax C load next limb at xp[j] - leal 4(%esi),%esi - mull (%ebp) - addl %ebx,%eax - movl %edx,%ebx - adcl $0,%ebx - movl %eax,(%edi) - leal 4(%edi),%edi - decl %ecx - jnz L(oopM) - - movl %ebx,(%edi) C most significant limb of product - addl $4,%edi C increment wp - movl PARAM_XSIZE,%eax - shll $2,%eax - subl %eax,%edi - subl %eax,%esi - - movl PARAM_YSIZE,%eax C ysize - decl %eax - jz L(skip) - movl %eax,VAR_COUNTER C set index i to ysize - -L(outer): - movl PARAM_YP,%ebp C yp - addl $4,%ebp C make ebp point to next v limb - movl %ebp,PARAM_YP - movl (%ebp),%eax C copy y limb ... - movl %eax,VAR_MULTIPLIER C ... to stack slot - movl PARAM_XSIZE,%ecx - - xorl %ebx,%ebx - andl $3,%ecx - jz L(end0) - -L(oop0): - movl (%esi),%eax - mull VAR_MULTIPLIER - leal 4(%esi),%esi - addl %ebx,%eax - movl $0,%ebx - adcl %ebx,%edx - addl %eax,(%edi) - adcl %edx,%ebx C propagate carry into cylimb - - leal 4(%edi),%edi - decl %ecx - jnz L(oop0) - -L(end0): - movl PARAM_XSIZE,%ecx - shrl $2,%ecx - jz L(endX) - - ALIGN(8) -L(oopX): - movl (%esi),%eax - mull VAR_MULTIPLIER - addl %eax,%ebx - movl $0,%ebp - adcl %edx,%ebp - - movl 4(%esi),%eax - mull VAR_MULTIPLIER - addl %ebx,(%edi) - adcl %eax,%ebp C new lo + cylimb - movl $0,%ebx - adcl %edx,%ebx - - movl 8(%esi),%eax - mull VAR_MULTIPLIER - addl %ebp,4(%edi) - adcl %eax,%ebx C new lo + cylimb - movl $0,%ebp - adcl %edx,%ebp - - movl 12(%esi),%eax - mull VAR_MULTIPLIER - addl %ebx,8(%edi) - adcl %eax,%ebp C new lo + cylimb - movl $0,%ebx - adcl %edx,%ebx - - addl %ebp,12(%edi) - adcl $0,%ebx C propagate carry into cylimb - - leal 16(%esi),%esi - leal 16(%edi),%edi - decl %ecx - jnz L(oopX) - -L(endX): - movl %ebx,(%edi) - addl $4,%edi - - C we incremented wp and xp in the loop above; compensate - movl PARAM_XSIZE,%eax - shll $2,%eax - subl %eax,%edi - subl %eax,%esi - - movl VAR_COUNTER,%eax - decl %eax - movl %eax,VAR_COUNTER - jnz L(outer) - -L(skip): - popl %ebx - popl %edi - popl %ebp - popl %esi - addl $8,%esp - ret - -L(done): - movl %edx,4(%edi) C store to wp[1] - popl %edi - popl %ebp - popl %esi - addl $8,%esp - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/p6/README b/rts/gmp/mpn/x86/p6/README deleted file mode 100644 index 7dbc905a0d..0000000000 --- a/rts/gmp/mpn/x86/p6/README +++ /dev/null @@ -1,95 +0,0 @@ - - INTEL P6 MPN SUBROUTINES - - - -This directory contains code optimized for Intel P6 class CPUs, meaning -PentiumPro, Pentium II and Pentium III. The mmx and p3mmx subdirectories -have routines using MMX instructions. - - - -STATUS - -Times for the loops, with all code and data in L1 cache, are as follows. -Some of these might be able to be improved. - - cycles/limb - - mpn_add_n/sub_n 3.7 - - mpn_copyi 0.75 - mpn_copyd 2.4 - - mpn_divrem_1 39.0 - mpn_mod_1 39.0 - mpn_divexact_by3 8.5 - - mpn_mul_1 5.5 - mpn_addmul/submul_1 6.35 - - mpn_l/rshift 2.5 - - mpn_mul_basecase 8.2 cycles/crossproduct (approx) - mpn_sqr_basecase 4.0 cycles/crossproduct (approx) - or 7.75 cycles/triangleproduct (approx) - -Pentium II and III have MMX and get the following improvements. - - mpn_divrem_1 25.0 integer part, 17.5 fractional part - mpn_mod_1 24.0 - - mpn_l/rshift 1.75 - - - - -NOTES - -Write-allocate L1 data cache means prefetching of destinations is unnecessary. - -Mispredicted branches have a penalty of between 9 and 15 cycles, and even up -to 26 cycles depending how far speculative execution has gone. The 9 cycle -minimum penalty comes from the issue pipeline being 9 stages. - -A copy with rep movs seems to copy 16 bytes at a time, since speeds for 4, -5, 6 or 7 limb operations are all the same. The 0.75 cycles/limb would be 3 -cycles per 16 byte block. - - - - -CODING - -Instructions in general code have been shown grouped if they can execute -together, which means up to three instructions with no successive -dependencies, and with only the first being a multiple micro-op. - -P6 has out-of-order execution, so the groupings are really only showing -dependent paths where some shuffling might allow some latencies to be -hidden. - - - - -REFERENCES - -"Intel Architecture Optimization Reference Manual", 1999, revision 001 dated -02/99, order number 245127 (order number 730795-001 is in the document too). -Available on-line: - - http://download.intel.com/design/PentiumII/manuals/245127.htm - -"Intel Architecture Optimization Manual", 1997, order number 242816. This -is an older document mostly about P5 and not as good as the above. -Available on-line: - - http://download.intel.com/design/PentiumII/manuals/242816.htm - - - ----------------- -Local variables: -mode: text -fill-column: 76 -End: diff --git a/rts/gmp/mpn/x86/p6/aorsmul_1.asm b/rts/gmp/mpn/x86/p6/aorsmul_1.asm deleted file mode 100644 index feb364ec0b..0000000000 --- a/rts/gmp/mpn/x86/p6/aorsmul_1.asm +++ /dev/null @@ -1,300 +0,0 @@ -dnl Intel P6 mpn_addmul_1/mpn_submul_1 -- add or subtract mpn multiple. -dnl -dnl P6: 6.35 cycles/limb (at 16 limbs/loop). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl P6 UNROLL_COUNT cycles/limb -dnl 8 6.7 -dnl 16 6.35 -dnl 32 6.3 -dnl 64 6.3 -dnl Maximum possible with the current code is 64. - -deflit(UNROLL_COUNT, 16) - - -ifdef(`OPERATION_addmul_1', ` - define(M4_inst, addl) - define(M4_function_1, mpn_addmul_1) - define(M4_function_1c, mpn_addmul_1c) - define(M4_description, add it to) - define(M4_desc_retval, carry) -',`ifdef(`OPERATION_submul_1', ` - define(M4_inst, subl) - define(M4_function_1, mpn_submul_1) - define(M4_function_1c, mpn_submul_1c) - define(M4_description, subtract it from) - define(M4_desc_retval, borrow) -',`m4_error(`Need OPERATION_addmul_1 or OPERATION_submul_1 -')')') - -MULFUNC_PROLOGUE(mpn_addmul_1 mpn_addmul_1c mpn_submul_1 mpn_submul_1c) - - -C mp_limb_t M4_function_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t mult); -C mp_limb_t M4_function_1c (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t mult, mp_limb_t carry); -C -C Calculate src,size multiplied by mult and M4_description dst,size. -C Return the M4_desc_retval limb from the top of the result. -C -C This code is pretty much the same as the K6 code. The unrolled loop is -C the same, but there's just a few scheduling tweaks in the setups and the -C simple loop. -C -C A number of variations have been tried for the unrolled loop, with one or -C two carries, and with loads scheduled earlier, but nothing faster than 6 -C cycles/limb has been found. - -ifdef(`PIC',` -deflit(UNROLL_THRESHOLD, 5) -',` -deflit(UNROLL_THRESHOLD, 5) -') - -defframe(PARAM_CARRY, 20) -defframe(PARAM_MULTIPLIER,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(32) - -PROLOGUE(M4_function_1c) - pushl %ebx -deflit(`FRAME',4) - movl PARAM_CARRY, %ebx - jmp LF(M4_function_1,start_nc) -EPILOGUE() - -PROLOGUE(M4_function_1) - push %ebx -deflit(`FRAME',4) - xorl %ebx, %ebx C initial carry - -L(start_nc): - movl PARAM_SIZE, %ecx - pushl %esi -deflit(`FRAME',8) - - movl PARAM_SRC, %esi - pushl %edi -deflit(`FRAME',12) - - movl PARAM_DST, %edi - pushl %ebp -deflit(`FRAME',16) - cmpl $UNROLL_THRESHOLD, %ecx - - movl PARAM_MULTIPLIER, %ebp - jae L(unroll) - - - C simple loop - C this is offset 0x22, so close enough to aligned -L(simple): - C eax scratch - C ebx carry - C ecx counter - C edx scratch - C esi src - C edi dst - C ebp multiplier - - movl (%esi), %eax - addl $4, %edi - - mull %ebp - - addl %ebx, %eax - adcl $0, %edx - - M4_inst %eax, -4(%edi) - movl %edx, %ebx - - adcl $0, %ebx - decl %ecx - - leal 4(%esi), %esi - jnz L(simple) - - - popl %ebp - popl %edi - - popl %esi - movl %ebx, %eax - - popl %ebx - ret - - - -C------------------------------------------------------------------------------ -C VAR_JUMP holds the computed jump temporarily because there's not enough -C registers when doing the mul for the initial two carry limbs. -C -C The add/adc for the initial carry in %ebx is necessary only for the -C mpn_add/submul_1c entry points. Duplicating the startup code to -C eliminiate this for the plain mpn_add/submul_1 doesn't seem like a good -C idea. - -dnl overlapping with parameters already fetched -define(VAR_COUNTER,`PARAM_SIZE') -define(VAR_JUMP, `PARAM_DST') - - C this is offset 0x43, so close enough to aligned -L(unroll): - C eax - C ebx initial carry - C ecx size - C edx - C esi src - C edi dst - C ebp - - movl %ecx, %edx - decl %ecx - - subl $2, %edx - negl %ecx - - shrl $UNROLL_LOG2, %edx - andl $UNROLL_MASK, %ecx - - movl %edx, VAR_COUNTER - movl %ecx, %edx - - C 15 code bytes per limb -ifdef(`PIC',` - call L(pic_calc) -L(here): -',` - shll $4, %edx - negl %ecx - - leal L(entry) (%edx,%ecx,1), %edx -') - movl (%esi), %eax C src low limb - - movl %edx, VAR_JUMP - leal ifelse(UNROLL_BYTES,256,128+) 4(%esi,%ecx,4), %esi - - mull %ebp - - addl %ebx, %eax C initial carry (from _1c) - adcl $0, %edx - - movl %edx, %ebx C high carry - leal ifelse(UNROLL_BYTES,256,128) (%edi,%ecx,4), %edi - - movl VAR_JUMP, %edx - testl $1, %ecx - movl %eax, %ecx C low carry - - cmovnz( %ebx, %ecx) C high,low carry other way around - cmovnz( %eax, %ebx) - - jmp *%edx - - -ifdef(`PIC',` -L(pic_calc): - shll $4, %edx - negl %ecx - - C See README.family about old gas bugs - leal (%edx,%ecx,1), %edx - addl $L(entry)-L(here), %edx - - addl (%esp), %edx - - ret -') - - -C ----------------------------------------------------------- - ALIGN(32) -L(top): -deflit(`FRAME',16) - C eax scratch - C ebx carry hi - C ecx carry lo - C edx scratch - C esi src - C edi dst - C ebp multiplier - C - C VAR_COUNTER loop counter - C - C 15 code bytes per limb - - addl $UNROLL_BYTES, %edi - -L(entry): -deflit(CHUNK_COUNT,2) -forloop(`i', 0, UNROLL_COUNT/CHUNK_COUNT-1, ` - deflit(`disp0', eval(i*4*CHUNK_COUNT ifelse(UNROLL_BYTES,256,-128))) - deflit(`disp1', eval(disp0 + 4)) - -Zdisp( movl, disp0,(%esi), %eax) - mull %ebp -Zdisp( M4_inst,%ecx, disp0,(%edi)) - adcl %eax, %ebx - movl %edx, %ecx - adcl $0, %ecx - - movl disp1(%esi), %eax - mull %ebp - M4_inst %ebx, disp1(%edi) - adcl %eax, %ecx - movl %edx, %ebx - adcl $0, %ebx -') - - decl VAR_COUNTER - leal UNROLL_BYTES(%esi), %esi - - jns L(top) - - -deflit(`disp0', eval(UNROLL_BYTES ifelse(UNROLL_BYTES,256,-128))) - - M4_inst %ecx, disp0(%edi) - movl %ebx, %eax - - popl %ebp - popl %edi - - popl %esi - popl %ebx - adcl $0, %eax - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/p6/diveby3.asm b/rts/gmp/mpn/x86/p6/diveby3.asm deleted file mode 100644 index a77703ea89..0000000000 --- a/rts/gmp/mpn/x86/p6/diveby3.asm +++ /dev/null @@ -1,37 +0,0 @@ -dnl Intel P6 mpn_divexact_by3 -- mpn division by 3, expecting no remainder. -dnl -dnl P6: 8.5 cycles/limb - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -dnl The P5 code runs well on P6, in fact better than anything else found so -dnl far. An imul is 4 cycles, meaning the two cmp/sbbl pairs on the -dnl dependent path are taking 4.5 cycles. -dnl -dnl The destination cache line prefetching is unnecessary on P6, but -dnl removing it is a 2 cycle slowdown (approx), so it must be inducing -dnl something good in the out of order execution. - -include(`../config.m4') - -MULFUNC_PROLOGUE(mpn_divexact_by3c) -include_mpn(`x86/pentium/diveby3.asm') diff --git a/rts/gmp/mpn/x86/p6/gmp-mparam.h b/rts/gmp/mpn/x86/p6/gmp-mparam.h deleted file mode 100644 index d7bfb6d60c..0000000000 --- a/rts/gmp/mpn/x86/p6/gmp-mparam.h +++ /dev/null @@ -1,96 +0,0 @@ -/* Intel P6 gmp-mparam.h -- Compiler/machine parameter header file. - -Copyright (C) 1991, 1993, 1994, 1999, 2000 Free Software Foundation, Inc. - -This file is part of the GNU MP Library. - -The GNU MP Library is free software; you can redistribute it and/or modify -it under the terms of the GNU Lesser General Public License as published by -the Free Software Foundation; either version 2.1 of the License, or (at your -option) any later version. - -The GNU MP Library is distributed in the hope that it will be useful, but -WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY -or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public -License for more details. - -You should have received a copy of the GNU Lesser General Public License -along with the GNU MP Library; see the file COPYING.LIB. If not, write to -the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, -MA 02111-1307, USA. */ - - -#define BITS_PER_MP_LIMB 32 -#define BYTES_PER_MP_LIMB 4 -#define BITS_PER_LONGINT 32 -#define BITS_PER_INT 32 -#define BITS_PER_SHORTINT 16 -#define BITS_PER_CHAR 8 - - -#ifndef UMUL_TIME -#define UMUL_TIME 5 /* cycles */ -#endif -#ifndef UDIV_TIME -#define UDIV_TIME 39 /* cycles */ -#endif - -#ifndef COUNT_TRAILING_ZEROS_TIME -#define COUNT_TRAILING_ZEROS_TIME 2 /* cycles */ -#endif - - -/* Generated by tuneup.c, 2000-07-06. */ - -#ifndef KARATSUBA_MUL_THRESHOLD -#define KARATSUBA_MUL_THRESHOLD 23 -#endif -#ifndef TOOM3_MUL_THRESHOLD -#define TOOM3_MUL_THRESHOLD 139 -#endif - -#ifndef KARATSUBA_SQR_THRESHOLD -#define KARATSUBA_SQR_THRESHOLD 52 -#endif -#ifndef TOOM3_SQR_THRESHOLD -#define TOOM3_SQR_THRESHOLD 166 -#endif - -#ifndef BZ_THRESHOLD -#define BZ_THRESHOLD 116 -#endif - -#ifndef FIB_THRESHOLD -#define FIB_THRESHOLD 66 -#endif - -#ifndef POWM_THRESHOLD -#define POWM_THRESHOLD 20 -#endif - -#ifndef GCD_ACCEL_THRESHOLD -#define GCD_ACCEL_THRESHOLD 4 -#endif -#ifndef GCDEXT_THRESHOLD -#define GCDEXT_THRESHOLD 54 -#endif - -#ifndef FFT_MUL_TABLE -#define FFT_MUL_TABLE { 592, 1440, 2688, 5632, 14336, 40960, 0 } -#endif -#ifndef FFT_MODF_MUL_THRESHOLD -#define FFT_MODF_MUL_THRESHOLD 608 -#endif -#ifndef FFT_MUL_THRESHOLD -#define FFT_MUL_THRESHOLD 5888 -#endif - -#ifndef FFT_SQR_TABLE -#define FFT_SQR_TABLE { 656, 1504, 2944, 6656, 18432, 57344, 0 } -#endif -#ifndef FFT_MODF_SQR_THRESHOLD -#define FFT_MODF_SQR_THRESHOLD 672 -#endif -#ifndef FFT_SQR_THRESHOLD -#define FFT_SQR_THRESHOLD 5888 -#endif diff --git a/rts/gmp/mpn/x86/p6/mmx/divrem_1.asm b/rts/gmp/mpn/x86/p6/mmx/divrem_1.asm deleted file mode 100644 index f1b011b623..0000000000 --- a/rts/gmp/mpn/x86/p6/mmx/divrem_1.asm +++ /dev/null @@ -1,677 +0,0 @@ -dnl Intel Pentium-II mpn_divrem_1 -- mpn by limb division. -dnl -dnl P6MMX: 25.0 cycles/limb integer part, 17.5 cycles/limb fraction part. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_divrem_1 (mp_ptr dst, mp_size_t xsize, -C mp_srcptr src, mp_size_t size, -C mp_limb_t divisor); -C mp_limb_t mpn_divrem_1c (mp_ptr dst, mp_size_t xsize, -C mp_srcptr src, mp_size_t size, -C mp_limb_t divisor, mp_limb_t carry); -C -C This code is a lightly reworked version of mpn/x86/k7/mmx/divrem_1.asm, -C see that file for some comments. It's likely what's here can be improved. - - -dnl MUL_THRESHOLD is the value of xsize+size at which the multiply by -dnl inverse method is used, rather than plain "divl"s. Minimum value 1. -dnl -dnl The different speeds of the integer and fraction parts means that using -dnl xsize+size isn't quite right. The threshold wants to be a bit higher -dnl for the integer part and a bit lower for the fraction part. (Or what's -dnl really wanted is to speed up the integer part!) -dnl -dnl The threshold is set to make the integer part right. At 4 limbs the -dnl div and mul are about the same there, but on the fractional part the -dnl mul is much faster. - -deflit(MUL_THRESHOLD, 4) - - -defframe(PARAM_CARRY, 24) -defframe(PARAM_DIVISOR,20) -defframe(PARAM_SIZE, 16) -defframe(PARAM_SRC, 12) -defframe(PARAM_XSIZE, 8) -defframe(PARAM_DST, 4) - -defframe(SAVE_EBX, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EDI, -12) -defframe(SAVE_EBP, -16) - -defframe(VAR_NORM, -20) -defframe(VAR_INVERSE, -24) -defframe(VAR_SRC, -28) -defframe(VAR_DST, -32) -defframe(VAR_DST_STOP,-36) - -deflit(STACK_SPACE, 36) - - .text - ALIGN(16) - -PROLOGUE(mpn_divrem_1c) -deflit(`FRAME',0) - movl PARAM_CARRY, %edx - - movl PARAM_SIZE, %ecx - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %ebx, SAVE_EBX - movl PARAM_XSIZE, %ebx - - movl %edi, SAVE_EDI - movl PARAM_DST, %edi - - movl %ebp, SAVE_EBP - movl PARAM_DIVISOR, %ebp - - movl %esi, SAVE_ESI - movl PARAM_SRC, %esi - - leal -4(%edi,%ebx,4), %edi - jmp LF(mpn_divrem_1,start_1c) - -EPILOGUE() - - - C offset 0x31, close enough to aligned -PROLOGUE(mpn_divrem_1) -deflit(`FRAME',0) - - movl PARAM_SIZE, %ecx - movl $0, %edx C initial carry (if can't skip a div) - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %ebp, SAVE_EBP - movl PARAM_DIVISOR, %ebp - - movl %ebx, SAVE_EBX - movl PARAM_XSIZE, %ebx - - movl %esi, SAVE_ESI - movl PARAM_SRC, %esi - orl %ecx, %ecx - - movl %edi, SAVE_EDI - movl PARAM_DST, %edi - - leal -4(%edi,%ebx,4), %edi C &dst[xsize-1] - jz L(no_skip_div) - - movl -4(%esi,%ecx,4), %eax C src high limb - cmpl %ebp, %eax C one less div if high<divisor - jnb L(no_skip_div) - - movl $0, (%edi,%ecx,4) C dst high limb - decl %ecx C size-1 - movl %eax, %edx C src high limb as initial carry -L(no_skip_div): - - -L(start_1c): - C eax - C ebx xsize - C ecx size - C edx carry - C esi src - C edi &dst[xsize-1] - C ebp divisor - - leal (%ebx,%ecx), %eax C size+xsize - cmpl $MUL_THRESHOLD, %eax - jae L(mul_by_inverse) - - orl %ecx, %ecx - jz L(divide_no_integer) - -L(divide_integer): - C eax scratch (quotient) - C ebx xsize - C ecx counter - C edx scratch (remainder) - C esi src - C edi &dst[xsize-1] - C ebp divisor - - movl -4(%esi,%ecx,4), %eax - - divl %ebp - - movl %eax, (%edi,%ecx,4) - decl %ecx - jnz L(divide_integer) - - -L(divide_no_integer): - movl PARAM_DST, %edi - orl %ebx, %ebx - jnz L(divide_fraction) - -L(divide_done): - movl SAVE_ESI, %esi - - movl SAVE_EDI, %edi - - movl SAVE_EBX, %ebx - movl %edx, %eax - - movl SAVE_EBP, %ebp - addl $STACK_SPACE, %esp - - ret - - -L(divide_fraction): - C eax scratch (quotient) - C ebx counter - C ecx - C edx scratch (remainder) - C esi - C edi dst - C ebp divisor - - movl $0, %eax - - divl %ebp - - movl %eax, -4(%edi,%ebx,4) - decl %ebx - jnz L(divide_fraction) - - jmp L(divide_done) - - - -C ----------------------------------------------------------------------------- - -L(mul_by_inverse): - C eax - C ebx xsize - C ecx size - C edx carry - C esi src - C edi &dst[xsize-1] - C ebp divisor - - leal 12(%edi), %ebx - - movl %ebx, VAR_DST_STOP - leal 4(%edi,%ecx,4), %edi C &dst[xsize+size] - - movl %edi, VAR_DST - movl %ecx, %ebx C size - - bsrl %ebp, %ecx C 31-l - movl %edx, %edi C carry - - leal 1(%ecx), %eax C 32-l - xorl $31, %ecx C l - - movl %ecx, VAR_NORM - movl $-1, %edx - - shll %cl, %ebp C d normalized - movd %eax, %mm7 - - movl $-1, %eax - subl %ebp, %edx C (b-d)-1 giving edx:eax = b*(b-d)-1 - - divl %ebp C floor (b*(b-d)-1) / d - - movl %eax, VAR_INVERSE - orl %ebx, %ebx C size - leal -12(%esi,%ebx,4), %eax C &src[size-3] - - movl %eax, VAR_SRC - jz L(start_zero) - - movl 8(%eax), %esi C src high limb - cmpl $1, %ebx - jz L(start_one) - -L(start_two_or_more): - movl 4(%eax), %edx C src second highest limb - - shldl( %cl, %esi, %edi) C n2 = carry,high << l - - shldl( %cl, %edx, %esi) C n10 = high,second << l - - cmpl $2, %ebx - je L(integer_two_left) - jmp L(integer_top) - - -L(start_one): - shldl( %cl, %esi, %edi) C n2 = carry,high << l - - shll %cl, %esi C n10 = high << l - jmp L(integer_one_left) - - -L(start_zero): - shll %cl, %edi C n2 = carry << l - movl $0, %esi C n10 = 0 - - C we're here because xsize+size>=MUL_THRESHOLD, so with size==0 then - C must have xsize!=0 - jmp L(fraction_some) - - - -C ----------------------------------------------------------------------------- -C -C This loop runs at about 25 cycles, which is probably sub-optimal, and -C certainly more than the dependent chain would suggest. A better loop, or -C a better rough analysis of what's possible, would be welcomed. -C -C In the current implementation, the following successively dependent -C micro-ops seem to exist. -C -C uops -C n2+n1 1 (addl) -C mul 5 -C q1+1 3 (addl/adcl) -C mul 5 -C sub 3 (subl/sbbl) -C addback 2 (cmov) -C --- -C 19 -C -C Lack of registers hinders explicit scheduling and it might be that the -C normal out of order execution isn't able to hide enough under the mul -C latencies. -C -C Using sarl/negl to pick out n1 for the n2+n1 stage is a touch faster than -C cmov (and takes one uop off the dependent chain). A sarl/andl/addl -C combination was tried for the addback (despite the fact it would lengthen -C the dependent chain) but found to be no faster. - - - ALIGN(16) -L(integer_top): - C eax scratch - C ebx scratch (nadj, q1) - C ecx scratch (src, dst) - C edx scratch - C esi n10 - C edi n2 - C ebp d - C - C mm0 scratch (src qword) - C mm7 rshift for normalization - - movl %esi, %eax - movl %ebp, %ebx - - sarl $31, %eax C -n1 - movl VAR_SRC, %ecx - - andl %eax, %ebx C -n1 & d - negl %eax C n1 - - addl %esi, %ebx C nadj = n10 + (-n1 & d), ignoring overflow - addl %edi, %eax C n2+n1 - movq (%ecx), %mm0 C next src limb and the one below it - - mull VAR_INVERSE C m*(n2+n1) - - subl $4, %ecx - - movl %ecx, VAR_SRC - - C - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - movl %ebp, %eax C d - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - jz L(q1_ff) - - mull %ebx C (q1+1)*d - - movl VAR_DST, %ecx - psrlq %mm7, %mm0 - - C - - C - - C - - subl %eax, %esi - movl VAR_DST_STOP, %eax - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - movd %mm0, %esi - - sbbl $0, %ebx C q - subl $4, %ecx - - movl %ebx, (%ecx) - cmpl %eax, %ecx - - movl %ecx, VAR_DST - jne L(integer_top) - - -L(integer_loop_done): - - -C ----------------------------------------------------------------------------- -C -C Here, and in integer_one_left below, an sbbl $0 is used rather than a jz -C q1_ff special case. This make the code a bit smaller and simpler, and -C costs only 2 cycles (each). - -L(integer_two_left): - C eax scratch - C ebx scratch (nadj, q1) - C ecx scratch (src, dst) - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 src limb, shifted - C mm7 rshift - - - movl %esi, %eax - movl %ebp, %ebx - - sarl $31, %eax C -n1 - movl PARAM_SRC, %ecx - - andl %eax, %ebx C -n1 & d - negl %eax C n1 - - addl %esi, %ebx C nadj = n10 + (-n1 & d), ignoring overflow - addl %edi, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - movd (%ecx), %mm0 C src low limb - - movl VAR_DST_STOP, %ecx - - C - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - movl %ebp, %eax C d - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - - sbbl $0, %ebx - - mull %ebx C (q1+1)*d - - psllq $32, %mm0 - - psrlq %mm7, %mm0 - - C - - C - - subl %eax, %esi - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - movd %mm0, %esi - - sbbl $0, %ebx C q - - movl %ebx, -4(%ecx) - - -C ----------------------------------------------------------------------------- -L(integer_one_left): - C eax scratch - C ebx scratch (nadj, q1) - C ecx scratch (dst) - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 src limb, shifted - C mm7 rshift - - - movl %esi, %eax - movl %ebp, %ebx - - sarl $31, %eax C -n1 - movl VAR_DST_STOP, %ecx - - andl %eax, %ebx C -n1 & d - negl %eax C n1 - - addl %esi, %ebx C nadj = n10 + (-n1 & d), ignoring overflow - addl %edi, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - C - - C - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - movl %ebp, %eax C d - - C - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - - sbbl $0, %ebx C q1 if q1+1 overflowed - - mull %ebx - - C - - C - - C - - C - - subl %eax, %esi - movl PARAM_XSIZE, %eax - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - - sbbl $0, %ebx C q - - movl %ebx, -8(%ecx) - subl $8, %ecx - - - - orl %eax, %eax C xsize - jnz L(fraction_some) - - movl %edi, %eax -L(fraction_done): - movl VAR_NORM, %ecx - movl SAVE_EBP, %ebp - - movl SAVE_EDI, %edi - - movl SAVE_ESI, %esi - - movl SAVE_EBX, %ebx - addl $STACK_SPACE, %esp - - shrl %cl, %eax - emms - - ret - - -C ----------------------------------------------------------------------------- -C -C Special case for q1=0xFFFFFFFF, giving q=0xFFFFFFFF meaning the low dword -C of q*d is simply -d and the remainder n-q*d = n10+d - -L(q1_ff): - C eax (divisor) - C ebx (q1+1 == 0) - C ecx - C edx - C esi n10 - C edi n2 - C ebp divisor - - movl VAR_DST, %ecx - movl VAR_DST_STOP, %edx - subl $4, %ecx - - movl %ecx, VAR_DST - psrlq %mm7, %mm0 - leal (%ebp,%esi), %edi C n-q*d remainder -> next n2 - - movl $-1, (%ecx) - movd %mm0, %esi C next n10 - - cmpl %ecx, %edx - jne L(integer_top) - - jmp L(integer_loop_done) - - - -C ----------------------------------------------------------------------------- -C -C In the current implementation, the following successively dependent -C micro-ops seem to exist. -C -C uops -C mul 5 -C q1+1 1 (addl) -C mul 5 -C sub 3 (negl/sbbl) -C addback 2 (cmov) -C --- -C 16 -C -C The loop in fact runs at about 17.5 cycles. Using a sarl/andl/addl for -C the addback was found to be a touch slower. - - - ALIGN(16) -L(fraction_some): - C eax - C ebx - C ecx - C edx - C esi - C edi carry - C ebp divisor - - movl PARAM_DST, %esi - movl VAR_DST_STOP, %ecx - movl %edi, %eax - - subl $8, %ecx - - - ALIGN(16) -L(fraction_top): - C eax n2, then scratch - C ebx scratch (nadj, q1) - C ecx dst, decrementing - C edx scratch - C esi dst stop point - C edi n2 - C ebp divisor - - mull VAR_INVERSE C m*n2 - - movl %ebp, %eax C d - subl $4, %ecx C dst - leal 1(%edi), %ebx - - C - - C - - C - - addl %edx, %ebx C 1 + high(n2<<32 + m*n2) = q1+1 - - mull %ebx C (q1+1)*d - - C - - C - - C - - C - - negl %eax C low of n - (q1+1)*d - - sbbl %edx, %edi C high of n - (q1+1)*d, caring only about carry - leal (%ebp,%eax), %edx - - cmovc( %edx, %eax) C n - q1*d if underflow from using q1+1 - - sbbl $0, %ebx C q - movl %eax, %edi C remainder->n2 - cmpl %esi, %ecx - - movl %ebx, (%ecx) C previous q - jne L(fraction_top) - - - jmp L(fraction_done) - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/p6/mmx/mod_1.asm b/rts/gmp/mpn/x86/p6/mmx/mod_1.asm deleted file mode 100644 index e7d8d94d33..0000000000 --- a/rts/gmp/mpn/x86/p6/mmx/mod_1.asm +++ /dev/null @@ -1,444 +0,0 @@ -dnl Intel Pentium-II mpn_mod_1 -- mpn by limb remainder. -dnl -dnl P6MMX: 24.0 cycles/limb. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_mod_1 (mp_srcptr src, mp_size_t size, mp_limb_t divisor); -C mp_limb_t mpn_mod_1c (mp_srcptr src, mp_size_t size, mp_limb_t divisor, -C mp_limb_t carry); -C -C The code here very similar to mpn_divrem_1, but with the quotient -C discarded. What's here probably isn't optimal. -C -C See mpn/x86/p6/mmx/divrem_1.c and mpn/x86/k7/mmx/mod_1.asm for some -C comments. - - -dnl MUL_THRESHOLD is the size at which the multiply by inverse method is -dnl used, rather than plain "divl"s. Minimum value 2. - -deflit(MUL_THRESHOLD, 4) - - -defframe(PARAM_CARRY, 16) -defframe(PARAM_DIVISOR,12) -defframe(PARAM_SIZE, 8) -defframe(PARAM_SRC, 4) - -defframe(SAVE_EBX, -4) -defframe(SAVE_ESI, -8) -defframe(SAVE_EDI, -12) -defframe(SAVE_EBP, -16) - -defframe(VAR_NORM, -20) -defframe(VAR_INVERSE, -24) -defframe(VAR_SRC_STOP,-28) - -deflit(STACK_SPACE, 28) - - .text - ALIGN(16) - -PROLOGUE(mpn_mod_1c) -deflit(`FRAME',0) - movl PARAM_CARRY, %edx - movl PARAM_SIZE, %ecx - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %ebp, SAVE_EBP - movl PARAM_DIVISOR, %ebp - - movl %esi, SAVE_ESI - movl PARAM_SRC, %esi - jmp LF(mpn_mod_1,start_1c) - -EPILOGUE() - - - ALIGN(16) -PROLOGUE(mpn_mod_1) -deflit(`FRAME',0) - - movl $0, %edx C initial carry (if can't skip a div) - movl PARAM_SIZE, %ecx - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %esi, SAVE_ESI - movl PARAM_SRC, %esi - - movl %ebp, SAVE_EBP - movl PARAM_DIVISOR, %ebp - - orl %ecx, %ecx - jz L(divide_done) - - movl -4(%esi,%ecx,4), %eax C src high limb - - cmpl %ebp, %eax C carry flag if high<divisor - - cmovc( %eax, %edx) C src high limb as initial carry - sbbl $0, %ecx C size-1 to skip one div - jz L(divide_done) - - - ALIGN(16) -L(start_1c): - C eax - C ebx - C ecx size - C edx carry - C esi src - C edi - C ebp divisor - - cmpl $MUL_THRESHOLD, %ecx - jae L(mul_by_inverse) - - - orl %ecx, %ecx - jz L(divide_done) - - -L(divide_top): - C eax scratch (quotient) - C ebx - C ecx counter, limbs, decrementing - C edx scratch (remainder) - C esi src - C edi - C ebp - - movl -4(%esi,%ecx,4), %eax - - divl %ebp - - decl %ecx - jnz L(divide_top) - - -L(divide_done): - movl SAVE_ESI, %esi - movl %edx, %eax - - movl SAVE_EBP, %ebp - addl $STACK_SPACE, %esp - - ret - - - -C ----------------------------------------------------------------------------- - -L(mul_by_inverse): - C eax - C ebx - C ecx size - C edx carry - C esi src - C edi - C ebp divisor - - movl %ebx, SAVE_EBX - leal -4(%esi), %ebx - - movl %ebx, VAR_SRC_STOP - movl %ecx, %ebx C size - - movl %edi, SAVE_EDI - movl %edx, %edi C carry - - bsrl %ebp, %ecx C 31-l - movl $-1, %edx - - leal 1(%ecx), %eax C 32-l - xorl $31, %ecx C l - - movl %ecx, VAR_NORM - shll %cl, %ebp C d normalized - - movd %eax, %mm7 - movl $-1, %eax - subl %ebp, %edx C (b-d)-1 so edx:eax = b*(b-d)-1 - - divl %ebp C floor (b*(b-d)-1) / d - - C - - movl %eax, VAR_INVERSE - leal -12(%esi,%ebx,4), %eax C &src[size-3] - - movl 8(%eax), %esi C src high limb - movl 4(%eax), %edx C src second highest limb - - shldl( %cl, %esi, %edi) C n2 = carry,high << l - - shldl( %cl, %edx, %esi) C n10 = high,second << l - - movl %eax, %ecx C &src[size-3] - - -ifelse(MUL_THRESHOLD,2,` - cmpl $2, %ebx - je L(inverse_two_left) -') - - -C The dependent chain here is the same as in mpn_divrem_1, but a few -C instructions are saved by not needing to store the quotient limbs. This -C gets it down to 24 c/l, which is still a bit away from a theoretical 19 -C c/l. - - ALIGN(16) -L(inverse_top): - C eax scratch - C ebx scratch (nadj, q1) - C ecx src pointer, decrementing - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 scratch (src qword) - C mm7 rshift for normalization - - - movl %esi, %eax - movl %ebp, %ebx - - sarl $31, %eax C -n1 - - andl %eax, %ebx C -n1 & d - negl %eax C n1 - - addl %esi, %ebx C nadj = n10 + (-n1 & d), ignoring overflow - addl %edi, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - movq (%ecx), %mm0 C next src limb and the one below it - subl $4, %ecx - - C - - C - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - movl %ebp, %eax C d - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - jz L(q1_ff) - - mull %ebx C (q1+1)*d - - psrlq %mm7, %mm0 - movl VAR_SRC_STOP, %ebx - - C - - C - - C - - subl %eax, %esi - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - movd %mm0, %esi - cmpl %ebx, %ecx - - jne L(inverse_top) - - -L(inverse_loop_done): - - -C ----------------------------------------------------------------------------- - -L(inverse_two_left): - C eax scratch - C ebx scratch (nadj, q1) - C ecx &src[-1] - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 scratch (src dword) - C mm7 rshift - - movl %esi, %eax - movl %ebp, %ebx - - sarl $31, %eax C -n1 - - andl %eax, %ebx C -n1 & d - negl %eax C n1 - - addl %esi, %ebx C nadj = n10 + (-n1 & d), ignoring overflow - addl %edi, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - movd 4(%ecx), %mm0 C src low limb - - C - - C - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - - sbbl $0, %ebx - movl %ebp, %eax C d - - mull %ebx C (q1+1)*d - - psllq $32, %mm0 - - psrlq %mm7, %mm0 - - C - - C - - subl %eax, %esi - - sbbl %edx, %edi C n - (q1+1)*d - movl %esi, %edi C remainder -> n2 - leal (%ebp,%esi), %edx - - cmovc( %edx, %edi) C n - q1*d if underflow from using q1+1 - movd %mm0, %esi - - -C One limb left - - C eax scratch - C ebx scratch (nadj, q1) - C ecx - C edx scratch - C esi n10 - C edi n2 - C ebp divisor - C - C mm0 src limb, shifted - C mm7 rshift - - movl %esi, %eax - movl %ebp, %ebx - - sarl $31, %eax C -n1 - - andl %eax, %ebx C -n1 & d - negl %eax C n1 - - addl %esi, %ebx C nadj = n10 + (-n1 & d), ignoring overflow - addl %edi, %eax C n2+n1 - - mull VAR_INVERSE C m*(n2+n1) - - movl VAR_NORM, %ecx C for final denorm - - C - - C - - C - - addl %ebx, %eax C m*(n2+n1) + nadj, low giving carry flag - leal 1(%edi), %ebx C n2<<32 + m*(n2+n1)) - - adcl %edx, %ebx C 1 + high(n2<<32 + m*(n2+n1) + nadj) = q1+1 - - sbbl $0, %ebx - movl %ebp, %eax C d - - mull %ebx C (q1+1)*d - - movl SAVE_EBX, %ebx - - C - - C - - C - - subl %eax, %esi - - sbbl %edx, %edi C n - (q1+1)*d - leal (%ebp,%esi), %edx - movl SAVE_EBP, %ebp - - movl %esi, %eax C remainder - movl SAVE_ESI, %esi - - cmovc( %edx, %eax) C n - q1*d if underflow from using q1+1 - movl SAVE_EDI, %edi - - shrl %cl, %eax C denorm remainder - addl $STACK_SPACE, %esp - emms - - ret - - -C ----------------------------------------------------------------------------- -C -C Special case for q1=0xFFFFFFFF, giving q=0xFFFFFFFF meaning the low dword -C of q*d is simply -d and the remainder n-q*d = n10+d - -L(q1_ff): - C eax (divisor) - C ebx (q1+1 == 0) - C ecx src pointer - C edx - C esi n10 - C edi (n2) - C ebp divisor - - leal (%ebp,%esi), %edi C n-q*d remainder -> next n2 - movl VAR_SRC_STOP, %edx - psrlq %mm7, %mm0 - - movd %mm0, %esi C next n10 - cmpl %ecx, %edx - jne L(inverse_top) - - jmp L(inverse_loop_done) - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/p6/mmx/popham.asm b/rts/gmp/mpn/x86/p6/mmx/popham.asm deleted file mode 100644 index 50f9a11218..0000000000 --- a/rts/gmp/mpn/x86/p6/mmx/popham.asm +++ /dev/null @@ -1,31 +0,0 @@ -dnl Intel Pentium-II mpn_popcount, mpn_hamdist -- population count and -dnl hamming distance. -dnl -dnl P6MMX: popcount 11 cycles/limb (approx), hamdist 11.5 cycles/limb -dnl (approx) - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - -MULFUNC_PROLOGUE(mpn_popcount mpn_hamdist) -include_mpn(`x86/k6/mmx/popham.asm') diff --git a/rts/gmp/mpn/x86/p6/p3mmx/popham.asm b/rts/gmp/mpn/x86/p6/p3mmx/popham.asm deleted file mode 100644 index e63fbf334b..0000000000 --- a/rts/gmp/mpn/x86/p6/p3mmx/popham.asm +++ /dev/null @@ -1,30 +0,0 @@ -dnl Intel Pentium-III mpn_popcount, mpn_hamdist -- population count and -dnl hamming distance. - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -dnl Haven't actually measured it, but the K7 code with the psadbw should be -dnl good on P-III. - -include(`../config.m4') - -MULFUNC_PROLOGUE(mpn_popcount mpn_hamdist) -include_mpn(`x86/k7/mmx/popham.asm') diff --git a/rts/gmp/mpn/x86/p6/sqr_basecase.asm b/rts/gmp/mpn/x86/p6/sqr_basecase.asm deleted file mode 100644 index 174c78406a..0000000000 --- a/rts/gmp/mpn/x86/p6/sqr_basecase.asm +++ /dev/null @@ -1,641 +0,0 @@ -dnl Intel P6 mpn_sqr_basecase -- square an mpn number. -dnl -dnl P6: approx 4.0 cycles per cross product, or 7.75 cycles per triangular -dnl product (measured on the speed difference between 20 and 40 limbs, -dnl which is the Karatsuba recursing range). - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -dnl These are the same as in mpn/x86/k6/sqr_basecase.asm, see that file for -dnl a description. The only difference here is that UNROLL_COUNT can go up -dnl to 64 (not 63) making KARATSUBA_SQR_THRESHOLD_MAX 67. - -deflit(KARATSUBA_SQR_THRESHOLD_MAX, 67) - -ifdef(`KARATSUBA_SQR_THRESHOLD_OVERRIDE', -`define(`KARATSUBA_SQR_THRESHOLD',KARATSUBA_SQR_THRESHOLD_OVERRIDE)') - -m4_config_gmp_mparam(`KARATSUBA_SQR_THRESHOLD') -deflit(UNROLL_COUNT, eval(KARATSUBA_SQR_THRESHOLD-3)) - - -C void mpn_sqr_basecase (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C The algorithm is basically the same as mpn/generic/sqr_basecase.c, but a -C lot of function call overheads are avoided, especially when the given size -C is small. -C -C The code size might look a bit excessive, but not all of it is executed so -C it won't all get into the code cache. The 1x1, 2x2 and 3x3 special cases -C clearly apply only to those sizes; mid sizes like 10x10 only need part of -C the unrolled addmul; and big sizes like 40x40 that do use the full -C unrolling will least be making good use of it, because 40x40 will take -C something like 7000 cycles. - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(32) -PROLOGUE(mpn_sqr_basecase) -deflit(`FRAME',0) - - movl PARAM_SIZE, %edx - - movl PARAM_SRC, %eax - - cmpl $2, %edx - movl PARAM_DST, %ecx - je L(two_limbs) - - movl (%eax), %eax - ja L(three_or_more) - - -C ----------------------------------------------------------------------------- -C one limb only - C eax src limb - C ebx - C ecx dst - C edx - - mull %eax - - movl %eax, (%ecx) - movl %edx, 4(%ecx) - - ret - - -C ----------------------------------------------------------------------------- -L(two_limbs): - C eax src - C ebx - C ecx dst - C edx - -defframe(SAVE_ESI, -4) -defframe(SAVE_EBX, -8) -defframe(SAVE_EDI, -12) -defframe(SAVE_EBP, -16) -deflit(`STACK_SPACE',16) - - subl $STACK_SPACE, %esp -deflit(`FRAME',STACK_SPACE) - - movl %esi, SAVE_ESI - movl %eax, %esi - movl (%eax), %eax - - mull %eax C src[0]^2 - - movl %eax, (%ecx) C dst[0] - movl 4(%esi), %eax - - movl %ebx, SAVE_EBX - movl %edx, %ebx C dst[1] - - mull %eax C src[1]^2 - - movl %edi, SAVE_EDI - movl %eax, %edi C dst[2] - movl (%esi), %eax - - movl %ebp, SAVE_EBP - movl %edx, %ebp C dst[3] - - mull 4(%esi) C src[0]*src[1] - - addl %eax, %ebx - movl SAVE_ESI, %esi - - adcl %edx, %edi - - adcl $0, %ebp - addl %ebx, %eax - movl SAVE_EBX, %ebx - - adcl %edi, %edx - movl SAVE_EDI, %edi - - adcl $0, %ebp - - movl %eax, 4(%ecx) - - movl %ebp, 12(%ecx) - movl SAVE_EBP, %ebp - - movl %edx, 8(%ecx) - addl $FRAME, %esp - - ret - - -C ----------------------------------------------------------------------------- -L(three_or_more): - C eax src low limb - C ebx - C ecx dst - C edx size -deflit(`FRAME',0) - - pushl %esi defframe_pushl(`SAVE_ESI') - cmpl $4, %edx - - movl PARAM_SRC, %esi - jae L(four_or_more) - - -C ----------------------------------------------------------------------------- -C three limbs - - C eax src low limb - C ebx - C ecx dst - C edx - C esi src - C edi - C ebp - - pushl %ebp defframe_pushl(`SAVE_EBP') - pushl %edi defframe_pushl(`SAVE_EDI') - - mull %eax C src[0] ^ 2 - - movl %eax, (%ecx) - movl %edx, 4(%ecx) - - movl 4(%esi), %eax - xorl %ebp, %ebp - - mull %eax C src[1] ^ 2 - - movl %eax, 8(%ecx) - movl %edx, 12(%ecx) - movl 8(%esi), %eax - - pushl %ebx defframe_pushl(`SAVE_EBX') - - mull %eax C src[2] ^ 2 - - movl %eax, 16(%ecx) - movl %edx, 20(%ecx) - - movl (%esi), %eax - - mull 4(%esi) C src[0] * src[1] - - movl %eax, %ebx - movl %edx, %edi - - movl (%esi), %eax - - mull 8(%esi) C src[0] * src[2] - - addl %eax, %edi - movl %edx, %ebp - - adcl $0, %ebp - movl 4(%esi), %eax - - mull 8(%esi) C src[1] * src[2] - - xorl %esi, %esi - addl %eax, %ebp - - C eax - C ebx dst[1] - C ecx dst - C edx dst[4] - C esi zero, will be dst[5] - C edi dst[2] - C ebp dst[3] - - adcl $0, %edx - addl %ebx, %ebx - - adcl %edi, %edi - - adcl %ebp, %ebp - - adcl %edx, %edx - movl 4(%ecx), %eax - - adcl $0, %esi - addl %ebx, %eax - - movl %eax, 4(%ecx) - movl 8(%ecx), %eax - - adcl %edi, %eax - movl 12(%ecx), %ebx - - adcl %ebp, %ebx - movl 16(%ecx), %edi - - movl %eax, 8(%ecx) - movl SAVE_EBP, %ebp - - movl %ebx, 12(%ecx) - movl SAVE_EBX, %ebx - - adcl %edx, %edi - movl 20(%ecx), %eax - - movl %edi, 16(%ecx) - movl SAVE_EDI, %edi - - adcl %esi, %eax C no carry out of this - movl SAVE_ESI, %esi - - movl %eax, 20(%ecx) - addl $FRAME, %esp - - ret - - - -C ----------------------------------------------------------------------------- -defframe(VAR_COUNTER,-20) -defframe(VAR_JMP, -24) -deflit(`STACK_SPACE',24) - -L(four_or_more): - C eax src low limb - C ebx - C ecx - C edx size - C esi src - C edi - C ebp -deflit(`FRAME',4) dnl %esi already pushed - -C First multiply src[0]*src[1..size-1] and store at dst[1..size]. - - subl $STACK_SPACE-FRAME, %esp -deflit(`FRAME',STACK_SPACE) - movl $1, %ecx - - movl %edi, SAVE_EDI - movl PARAM_DST, %edi - - movl %ebx, SAVE_EBX - subl %edx, %ecx C -(size-1) - - movl %ebp, SAVE_EBP - movl $0, %ebx C initial carry - - leal (%esi,%edx,4), %esi C &src[size] - movl %eax, %ebp C multiplier - - leal -4(%edi,%edx,4), %edi C &dst[size-1] - - -C This loop runs at just over 6 c/l. - -L(mul_1): - C eax scratch - C ebx carry - C ecx counter, limbs, negative, -(size-1) to -1 - C edx scratch - C esi &src[size] - C edi &dst[size-1] - C ebp multiplier - - movl %ebp, %eax - - mull (%esi,%ecx,4) - - addl %ebx, %eax - movl $0, %ebx - - adcl %edx, %ebx - movl %eax, 4(%edi,%ecx,4) - - incl %ecx - jnz L(mul_1) - - - movl %ebx, 4(%edi) - - -C Addmul src[n]*src[n+1..size-1] at dst[2*n-1...], for each n=1..size-2. -C -C The last two addmuls, which are the bottom right corner of the product -C triangle, are left to the end. These are src[size-3]*src[size-2,size-1] -C and src[size-2]*src[size-1]. If size is 4 then it's only these corner -C cases that need to be done. -C -C The unrolled code is the same as mpn_addmul_1(), see that routine for some -C comments. -C -C VAR_COUNTER is the outer loop, running from -(size-4) to -1, inclusive. -C -C VAR_JMP is the computed jump into the unrolled code, stepped by one code -C chunk each outer loop. - -dnl This is also hard-coded in the address calculation below. -deflit(CODE_BYTES_PER_LIMB, 15) - -dnl With &src[size] and &dst[size-1] pointers, the displacements in the -dnl unrolled code fit in a byte for UNROLL_COUNT values up to 32, but above -dnl that an offset must be added to them. -deflit(OFFSET, -ifelse(eval(UNROLL_COUNT>32),1, -eval((UNROLL_COUNT-32)*4), -0)) - - C eax - C ebx carry - C ecx - C edx - C esi &src[size] - C edi &dst[size-1] - C ebp - - movl PARAM_SIZE, %ecx - - subl $4, %ecx - jz L(corner) - - movl %ecx, %edx - negl %ecx - - shll $4, %ecx -ifelse(OFFSET,0,,`subl $OFFSET, %esi') - -ifdef(`PIC',` - call L(pic_calc) -L(here): -',` - leal L(unroll_inner_end)-eval(2*CODE_BYTES_PER_LIMB)(%ecx,%edx), %ecx -') - negl %edx - -ifelse(OFFSET,0,,`subl $OFFSET, %edi') - - C The calculated jump mustn't be before the start of the available - C code. This is the limit that UNROLL_COUNT puts on the src operand - C size, but checked here using the jump address directly. - - ASSERT(ae, - `movl_text_address( L(unroll_inner_start), %eax) - cmpl %eax, %ecx') - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(unroll_outer_top): - C eax - C ebx high limb to store - C ecx VAR_JMP - C edx VAR_COUNTER, limbs, negative - C esi &src[size], constant - C edi dst ptr, second highest limb of last addmul - C ebp - - movl -12+OFFSET(%esi,%edx,4), %ebp C multiplier - movl %edx, VAR_COUNTER - - movl -8+OFFSET(%esi,%edx,4), %eax C first limb of multiplicand - - mull %ebp - -define(cmovX,`ifelse(eval(UNROLL_COUNT%2),1,`cmovz($@)',`cmovnz($@)')') - - testb $1, %cl - - movl %edx, %ebx C high carry - leal 4(%edi), %edi - - movl %ecx, %edx C jump - - movl %eax, %ecx C low carry - leal CODE_BYTES_PER_LIMB(%edx), %edx - - cmovX( %ebx, %ecx) C high carry reverse - cmovX( %eax, %ebx) C low carry reverse - movl %edx, VAR_JMP - jmp *%edx - - - C Must be on an even address here so the low bit of the jump address - C will indicate which way around ecx/ebx should start. - - ALIGN(2) - -L(unroll_inner_start): - C eax scratch - C ebx carry high - C ecx carry low - C edx scratch - C esi src pointer - C edi dst pointer - C ebp multiplier - C - C 15 code bytes each limb - C ecx/ebx reversed on each chunk - -forloop(`i', UNROLL_COUNT, 1, ` - deflit(`disp_src', eval(-i*4 + OFFSET)) - deflit(`disp_dst', eval(disp_src)) - - m4_assert(`disp_src>=-128 && disp_src<128') - m4_assert(`disp_dst>=-128 && disp_dst<128') - -ifelse(eval(i%2),0,` -Zdisp( movl, disp_src,(%esi), %eax) - mull %ebp -Zdisp( addl, %ebx, disp_dst,(%edi)) - adcl %eax, %ecx - movl %edx, %ebx - adcl $0, %ebx -',` - dnl this one comes out last -Zdisp( movl, disp_src,(%esi), %eax) - mull %ebp -Zdisp( addl, %ecx, disp_dst,(%edi)) - adcl %eax, %ebx - movl %edx, %ecx - adcl $0, %ecx -') -') -L(unroll_inner_end): - - addl %ebx, m4_empty_if_zero(OFFSET)(%edi) - - movl VAR_COUNTER, %edx - adcl $0, %ecx - - movl %ecx, m4_empty_if_zero(OFFSET+4)(%edi) - movl VAR_JMP, %ecx - - incl %edx - jnz L(unroll_outer_top) - - -ifelse(OFFSET,0,,` - addl $OFFSET, %esi - addl $OFFSET, %edi -') - - -C ----------------------------------------------------------------------------- - ALIGN(16) -L(corner): - C eax - C ebx - C ecx - C edx - C esi &src[size] - C edi &dst[2*size-5] - C ebp - - movl -12(%esi), %eax - - mull -8(%esi) - - addl %eax, (%edi) - movl -12(%esi), %eax - movl $0, %ebx - - adcl %edx, %ebx - - mull -4(%esi) - - addl %eax, %ebx - movl -8(%esi), %eax - - adcl $0, %edx - - addl %ebx, 4(%edi) - movl $0, %ebx - - adcl %edx, %ebx - - mull -4(%esi) - - movl PARAM_SIZE, %ecx - addl %ebx, %eax - - adcl $0, %edx - - movl %eax, 8(%edi) - - movl %edx, 12(%edi) - movl PARAM_DST, %edi - - -C Left shift of dst[1..2*size-2], the bit shifted out becomes dst[2*size-1]. - - subl $1, %ecx C size-1 - xorl %eax, %eax C ready for final adcl, and clear carry - - movl %ecx, %edx - movl PARAM_SRC, %esi - - -L(lshift): - C eax - C ebx - C ecx counter, size-1 to 1 - C edx size-1 (for later use) - C esi src (for later use) - C edi dst, incrementing - C ebp - - rcll 4(%edi) - rcll 8(%edi) - - leal 8(%edi), %edi - decl %ecx - jnz L(lshift) - - - adcl %eax, %eax - - movl %eax, 4(%edi) C dst most significant limb - movl (%esi), %eax C src[0] - - leal 4(%esi,%edx,4), %esi C &src[size] - subl %edx, %ecx C -(size-1) - - -C Now add in the squares on the diagonal, src[0]^2, src[1]^2, ..., -C src[size-1]^2. dst[0] hasn't yet been set at all yet, and just gets the -C low limb of src[0]^2. - - - mull %eax - - movl %eax, (%edi,%ecx,8) C dst[0] - - -L(diag): - C eax scratch - C ebx scratch - C ecx counter, negative - C edx carry - C esi &src[size] - C edi dst[2*size-2] - C ebp - - movl (%esi,%ecx,4), %eax - movl %edx, %ebx - - mull %eax - - addl %ebx, 4(%edi,%ecx,8) - adcl %eax, 8(%edi,%ecx,8) - adcl $0, %edx - - incl %ecx - jnz L(diag) - - - movl SAVE_ESI, %esi - movl SAVE_EBX, %ebx - - addl %edx, 4(%edi) C dst most significant limb - - movl SAVE_EDI, %edi - movl SAVE_EBP, %ebp - addl $FRAME, %esp - ret - - - -C ----------------------------------------------------------------------------- -ifdef(`PIC',` -L(pic_calc): - addl (%esp), %ecx - addl $L(unroll_inner_end)-L(here)-eval(2*CODE_BYTES_PER_LIMB), %ecx - addl %edx, %ecx - ret -') - - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/pentium/README b/rts/gmp/mpn/x86/pentium/README deleted file mode 100644 index 3b9ec8ac6f..0000000000 --- a/rts/gmp/mpn/x86/pentium/README +++ /dev/null @@ -1,77 +0,0 @@ - - INTEL PENTIUM P5 MPN SUBROUTINES - - -This directory contains mpn functions optimized for Intel Pentium (P5,P54) -processors. The mmx subdirectory has code for Pentium with MMX (P55). - - -STATUS - - cycles/limb - - mpn_add_n/sub_n 2.375 - - mpn_copyi/copyd 1.0 - - mpn_divrem_1 44.0 - mpn_mod_1 44.0 - mpn_divexact_by3 15.0 - - mpn_l/rshift 5.375 normal (6.0 on P54) - 1.875 special shift by 1 bit - - mpn_mul_1 13.0 - mpn_add/submul_1 14.0 - - mpn_mul_basecase 14.2 cycles/crossproduct (approx) - - mpn_sqr_basecase 8 cycles/crossproduct (approx) - or 15.5 cycles/triangleproduct (approx) - -Pentium MMX gets the following improvements - - mpn_l/rshift 1.75 - - -1. mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the -documentation indicates that they should take only 43/8 = 5.375 cycles/limb, -or 5 cycles/limb asymptotically. The P55 runs them at the expected speed. - -2. mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop -overhead and other delays (cache refill?), they run at or near 2.5 cycles/limb. - -3. mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they -should. Intel documentation says a mul instruction is 10 cycles, but it -measures 9 and the routines using it run with it as 9. - - - -RELEVANT OPTIMIZATION ISSUES - -1. Pentium doesn't allocate cache lines on writes, unlike most other modern -processors. Since the functions in the mpn class do array writes, we have to -handle allocating the destination cache lines by reading a word from it in the -loops, to achieve the best performance. - -2. Pairing of memory operations requires that the two issued operations refer -to different cache banks. The simplest way to insure this is to read/write -two words from the same object. If we make operations on different objects, -they might or might not be to the same cache bank. - - - -REFERENCES - -"Intel Architecture Optimization Manual", 1997, order number 242816. This -is mostly about P5, the parts about P6 aren't relevant. Available on-line: - - http://download.intel.com/design/PentiumII/manuals/242816.htm - - - ----------------- -Local variables: -mode: text -fill-column: 76 -End: diff --git a/rts/gmp/mpn/x86/pentium/aors_n.asm b/rts/gmp/mpn/x86/pentium/aors_n.asm deleted file mode 100644 index a61082a456..0000000000 --- a/rts/gmp/mpn/x86/pentium/aors_n.asm +++ /dev/null @@ -1,196 +0,0 @@ -dnl Intel Pentium mpn_add_n/mpn_sub_n -- mpn addition and subtraction. -dnl -dnl P5: 2.375 cycles/limb - - -dnl Copyright (C) 1992, 1994, 1995, 1996, 1999, 2000 Free Software -dnl Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -ifdef(`OPERATION_add_n',` - define(M4_inst, adcl) - define(M4_function_n, mpn_add_n) - define(M4_function_nc, mpn_add_nc) - -',`ifdef(`OPERATION_sub_n',` - define(M4_inst, sbbl) - define(M4_function_n, mpn_sub_n) - define(M4_function_nc, mpn_sub_nc) - -',`m4_error(`Need OPERATION_add_n or OPERATION_sub_n -')')') - -MULFUNC_PROLOGUE(mpn_add_n mpn_add_nc mpn_sub_n mpn_sub_nc) - - -C mp_limb_t M4_function_n (mp_ptr dst, mp_srcptr src1, mp_srcptr src2, -C mp_size_t size); -C mp_limb_t M4_function_nc (mp_ptr dst, mp_srcptr src1, mp_srcptr src2, -C mp_size_t size, mp_limb_t carry); - -defframe(PARAM_CARRY,20) -defframe(PARAM_SIZE, 16) -defframe(PARAM_SRC2, 12) -defframe(PARAM_SRC1, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(8) -PROLOGUE(M4_function_nc) - - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp -deflit(`FRAME',16) - - movl PARAM_DST,%edi - movl PARAM_SRC1,%esi - movl PARAM_SRC2,%ebp - movl PARAM_SIZE,%ecx - - movl (%ebp),%ebx - - decl %ecx - movl %ecx,%edx - shrl $3,%ecx - andl $7,%edx - testl %ecx,%ecx C zero carry flag - jz L(endgo) - - pushl %edx -FRAME_pushl() - movl PARAM_CARRY,%eax - shrl $1,%eax C shift bit 0 into carry - jmp LF(M4_function_n,oop) - -L(endgo): -deflit(`FRAME',16) - movl PARAM_CARRY,%eax - shrl $1,%eax C shift bit 0 into carry - jmp LF(M4_function_n,end) - -EPILOGUE() - - - ALIGN(8) -PROLOGUE(M4_function_n) - - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp -deflit(`FRAME',16) - - movl PARAM_DST,%edi - movl PARAM_SRC1,%esi - movl PARAM_SRC2,%ebp - movl PARAM_SIZE,%ecx - - movl (%ebp),%ebx - - decl %ecx - movl %ecx,%edx - shrl $3,%ecx - andl $7,%edx - testl %ecx,%ecx C zero carry flag - jz L(end) - pushl %edx -FRAME_pushl() - - ALIGN(8) -L(oop): movl 28(%edi),%eax C fetch destination cache line - leal 32(%edi),%edi - -L(1): movl (%esi),%eax - movl 4(%esi),%edx - M4_inst %ebx,%eax - movl 4(%ebp),%ebx - M4_inst %ebx,%edx - movl 8(%ebp),%ebx - movl %eax,-32(%edi) - movl %edx,-28(%edi) - -L(2): movl 8(%esi),%eax - movl 12(%esi),%edx - M4_inst %ebx,%eax - movl 12(%ebp),%ebx - M4_inst %ebx,%edx - movl 16(%ebp),%ebx - movl %eax,-24(%edi) - movl %edx,-20(%edi) - -L(3): movl 16(%esi),%eax - movl 20(%esi),%edx - M4_inst %ebx,%eax - movl 20(%ebp),%ebx - M4_inst %ebx,%edx - movl 24(%ebp),%ebx - movl %eax,-16(%edi) - movl %edx,-12(%edi) - -L(4): movl 24(%esi),%eax - movl 28(%esi),%edx - M4_inst %ebx,%eax - movl 28(%ebp),%ebx - M4_inst %ebx,%edx - movl 32(%ebp),%ebx - movl %eax,-8(%edi) - movl %edx,-4(%edi) - - leal 32(%esi),%esi - leal 32(%ebp),%ebp - decl %ecx - jnz L(oop) - - popl %edx -FRAME_popl() -L(end): - decl %edx C test %edx w/o clobbering carry - js L(end2) - incl %edx -L(oop2): - leal 4(%edi),%edi - movl (%esi),%eax - M4_inst %ebx,%eax - movl 4(%ebp),%ebx - movl %eax,-4(%edi) - leal 4(%esi),%esi - leal 4(%ebp),%ebp - decl %edx - jnz L(oop2) -L(end2): - movl (%esi),%eax - M4_inst %ebx,%eax - movl %eax,(%edi) - - sbbl %eax,%eax - negl %eax - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/pentium/aorsmul_1.asm b/rts/gmp/mpn/x86/pentium/aorsmul_1.asm deleted file mode 100644 index 147b55610f..0000000000 --- a/rts/gmp/mpn/x86/pentium/aorsmul_1.asm +++ /dev/null @@ -1,99 +0,0 @@ -dnl Intel Pentium mpn_addmul_1 -- mpn by limb multiplication. -dnl -dnl P5: 14.0 cycles/limb - - -dnl Copyright (C) 1992, 1994, 1996, 1999, 2000 Free Software Foundation, -dnl Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. */ - - -include(`../config.m4') - - -ifdef(`OPERATION_addmul_1', ` - define(M4_inst, addl) - define(M4_function_1, mpn_addmul_1) - -',`ifdef(`OPERATION_submul_1', ` - define(M4_inst, subl) - define(M4_function_1, mpn_submul_1) - -',`m4_error(`Need OPERATION_addmul_1 or OPERATION_submul_1 -')')') - -MULFUNC_PROLOGUE(mpn_addmul_1 mpn_submul_1) - - -C mp_limb_t M4_function_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t mult); - -defframe(PARAM_MULTIPLIER,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(8) - -PROLOGUE(M4_function_1) - - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp -deflit(`FRAME',16) - - movl PARAM_DST, %edi - movl PARAM_SRC, %esi - movl PARAM_SIZE, %ecx - movl PARAM_MULTIPLIER, %ebp - - leal (%edi,%ecx,4), %edi - leal (%esi,%ecx,4), %esi - negl %ecx - xorl %ebx, %ebx - ALIGN(8) - -L(oop): adcl $0, %ebx - movl (%esi,%ecx,4), %eax - - mull %ebp - - addl %ebx, %eax - movl (%edi,%ecx,4), %ebx - - adcl $0, %edx - M4_inst %eax, %ebx - - movl %ebx, (%edi,%ecx,4) - incl %ecx - - movl %edx, %ebx - jnz L(oop) - - adcl $0, %ebx - movl %ebx, %eax - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/pentium/diveby3.asm b/rts/gmp/mpn/x86/pentium/diveby3.asm deleted file mode 100644 index dbac81642f..0000000000 --- a/rts/gmp/mpn/x86/pentium/diveby3.asm +++ /dev/null @@ -1,183 +0,0 @@ -dnl Intel P5 mpn_divexact_by3 -- mpn division by 3, expecting no remainder. -dnl -dnl P5: 15.0 cycles/limb - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_divexact_by3c (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t carry); - -defframe(PARAM_CARRY,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - -dnl multiplicative inverse of 3, modulo 2^32 -deflit(INVERSE_3, 0xAAAAAAAB) - -dnl ceil(b/3), ceil(b*2/3) and floor(b*2/3) where b=2^32 -deflit(ONE_THIRD_CEIL, 0x55555556) -deflit(TWO_THIRDS_CEIL, 0xAAAAAAAB) -deflit(TWO_THIRDS_FLOOR, 0xAAAAAAAA) - - .text - ALIGN(8) - -PROLOGUE(mpn_divexact_by3c) -deflit(`FRAME',0) - - movl PARAM_SRC, %ecx - movl PARAM_SIZE, %edx - - decl %edx - jnz L(two_or_more) - - movl (%ecx), %edx - movl PARAM_CARRY, %eax C risk of cache bank clash here - - movl PARAM_DST, %ecx - subl %eax, %edx - - sbbl %eax, %eax C 0 or -1 - - imull $INVERSE_3, %edx, %edx - - negl %eax C 0 or 1 - cmpl $ONE_THIRD_CEIL, %edx - - sbbl $-1, %eax C +1 if edx>=ceil(b/3) - cmpl $TWO_THIRDS_CEIL, %edx - - sbbl $-1, %eax C +1 if edx>=ceil(b*2/3) - movl %edx, (%ecx) - - ret - - -L(two_or_more): - C eax - C ebx - C ecx src - C edx size-1 - C esi - C edi - C ebp - - pushl %ebx FRAME_pushl() - pushl %esi FRAME_pushl() - - pushl %edi FRAME_pushl() - pushl %ebp FRAME_pushl() - - movl PARAM_DST, %edi - movl PARAM_CARRY, %esi - - movl (%ecx), %eax C src low limb - xorl %ebx, %ebx - - sub %esi, %eax - movl $TWO_THIRDS_FLOOR, %esi - - leal (%ecx,%edx,4), %ecx C &src[size-1] - leal (%edi,%edx,4), %edi C &dst[size-1] - - adcl $0, %ebx C carry, 0 or 1 - negl %edx C -(size-1) - - -C The loop needs a source limb ready at the top, which leads to one limb -C handled separately at the end, and the special case above for size==1. -C There doesn't seem to be any scheduling that would keep the speed but move -C the source load and carry subtract up to the top. -C -C The destination cache line prefetching adds 1 cycle to the loop but is -C considered worthwhile. The slowdown is a factor of 1.07, but will prevent -C repeated write-throughs if the destination isn't in L1. A version using -C an outer loop to prefetch only every 8 limbs (a cache line) proved to be -C no faster, due to unavoidable branch mispreditions in the inner loop. -C -C setc is 2 cycles on P54, so an adcl is used instead. If the movl $0,%ebx -C could be avoided then the src limb fetch could pair up and save a cycle. -C This would probably mean going to a two limb loop with the carry limb -C alternately positive or negative, since an sbbl %ebx,%ebx will leave a -C value which is in the opposite sense to the preceding sbbl/adcl %ebx,%eax. -C -C A register is used for TWO_THIRDS_FLOOR because a cmp can't be done as -C "cmpl %edx, $n" with the immediate as the second operand. -C -C The "4" source displacement is in the loop rather than the setup because -C this gets L(top) aligned to 8 bytes at no cost. - - ALIGN(8) -L(top): - C eax source limb, carry subtracted - C ebx carry (0 or 1) - C ecx &src[size-1] - C edx counter, limbs, negative - C esi TWO_THIRDS_FLOOR - C edi &dst[size-1] - C ebp scratch (result limb) - - imull $INVERSE_3, %eax, %ebp - - cmpl $ONE_THIRD_CEIL, %ebp - movl (%edi,%edx,4), %eax C dst cache line prefetch - - sbbl $-1, %ebx C +1 if ebp>=ceil(b/3) - cmpl %ebp, %esi - - movl 4(%ecx,%edx,4), %eax C next src limb - - sbbl %ebx, %eax C and further -1 if ebp>=ceil(b*2/3) - movl $0, %ebx - - adcl $0, %ebx C new carry - movl %ebp, (%edi,%edx,4) - - incl %edx - jnz L(top) - - - - imull $INVERSE_3, %eax, %edx - - cmpl $ONE_THIRD_CEIL, %edx - movl %edx, (%edi) - - sbbl $-1, %ebx C +1 if edx>=ceil(b/3) - cmpl $TWO_THIRDS_CEIL, %edx - - sbbl $-1, %ebx C +1 if edx>=ceil(b*2/3) - popl %ebp - - movl %ebx, %eax - popl %edi - - popl %esi - popl %ebx - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/pentium/gmp-mparam.h b/rts/gmp/mpn/x86/pentium/gmp-mparam.h deleted file mode 100644 index d3ed3d73ce..0000000000 --- a/rts/gmp/mpn/x86/pentium/gmp-mparam.h +++ /dev/null @@ -1,97 +0,0 @@ -/* Intel P54 gmp-mparam.h -- Compiler/machine parameter header file. - -Copyright (C) 1991, 1993, 1994, 1999, 2000 Free Software Foundation, Inc. - -This file is part of the GNU MP Library. - -The GNU MP Library is free software; you can redistribute it and/or modify -it under the terms of the GNU Lesser General Public License as published by -the Free Software Foundation; either version 2.1 of the License, or (at your -option) any later version. - -The GNU MP Library is distributed in the hope that it will be useful, but -WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY -or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public -License for more details. - -You should have received a copy of the GNU Lesser General Public License -along with the GNU MP Library; see the file COPYING.LIB. If not, write to -the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, -MA 02111-1307, USA. */ - - -#define BITS_PER_MP_LIMB 32 -#define BYTES_PER_MP_LIMB 4 -#define BITS_PER_LONGINT 32 -#define BITS_PER_INT 32 -#define BITS_PER_SHORTINT 16 -#define BITS_PER_CHAR 8 - - -#ifndef UMUL_TIME -#define UMUL_TIME 9 /* cycles */ -#endif -#ifndef UDIV_TIME -#define UDIV_TIME 41 /* cycles */ -#endif - -/* bsf takes 18-42 cycles, put an average for uniform random numbers */ -#ifndef COUNT_TRAILING_ZEROS_TIME -#define COUNT_TRAILING_ZEROS_TIME 20 /* cycles */ -#endif - - -/* Generated by tuneup.c, 2000-07-06. */ - -#ifndef KARATSUBA_MUL_THRESHOLD -#define KARATSUBA_MUL_THRESHOLD 14 -#endif -#ifndef TOOM3_MUL_THRESHOLD -#define TOOM3_MUL_THRESHOLD 179 -#endif - -#ifndef KARATSUBA_SQR_THRESHOLD -#define KARATSUBA_SQR_THRESHOLD 22 -#endif -#ifndef TOOM3_SQR_THRESHOLD -#define TOOM3_SQR_THRESHOLD 153 -#endif - -#ifndef BZ_THRESHOLD -#define BZ_THRESHOLD 46 -#endif - -#ifndef FIB_THRESHOLD -#define FIB_THRESHOLD 110 -#endif - -#ifndef POWM_THRESHOLD -#define POWM_THRESHOLD 13 -#endif - -#ifndef GCD_ACCEL_THRESHOLD -#define GCD_ACCEL_THRESHOLD 4 -#endif -#ifndef GCDEXT_THRESHOLD -#define GCDEXT_THRESHOLD 25 -#endif - -#ifndef FFT_MUL_TABLE -#define FFT_MUL_TABLE { 496, 928, 1920, 4608, 14336, 40960, 0 } -#endif -#ifndef FFT_MODF_MUL_THRESHOLD -#define FFT_MODF_MUL_THRESHOLD 512 -#endif -#ifndef FFT_MUL_THRESHOLD -#define FFT_MUL_THRESHOLD 3840 -#endif - -#ifndef FFT_SQR_TABLE -#define FFT_SQR_TABLE { 496, 1184, 1920, 5632, 14336, 40960, 0 } -#endif -#ifndef FFT_MODF_SQR_THRESHOLD -#define FFT_MODF_SQR_THRESHOLD 512 -#endif -#ifndef FFT_SQR_THRESHOLD -#define FFT_SQR_THRESHOLD 3840 -#endif diff --git a/rts/gmp/mpn/x86/pentium/lshift.asm b/rts/gmp/mpn/x86/pentium/lshift.asm deleted file mode 100644 index e1e35d4c57..0000000000 --- a/rts/gmp/mpn/x86/pentium/lshift.asm +++ /dev/null @@ -1,236 +0,0 @@ -dnl Intel Pentium mpn_lshift -- mpn left shift. -dnl -dnl cycles/limb -dnl P5,P54: 6.0 -dnl P55: 5.375 - - -dnl Copyright (C) 1992, 1994, 1995, 1996, 1999, 2000 Free Software -dnl Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_lshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C -C The main shift-by-N loop should run at 5.375 c/l and that's what P55 does, -C but P5 and P54 run only at 6.0 c/l, which is 4 cycles lost somewhere. - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(8) -PROLOGUE(mpn_lshift) - - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp -deflit(`FRAME',16) - - movl PARAM_DST,%edi - movl PARAM_SRC,%esi - movl PARAM_SIZE,%ebp - movl PARAM_SHIFT,%ecx - -C We can use faster code for shift-by-1 under certain conditions. - cmp $1,%ecx - jne L(normal) - leal 4(%esi),%eax - cmpl %edi,%eax - jnc L(special) C jump if s_ptr + 1 >= res_ptr - leal (%esi,%ebp,4),%eax - cmpl %eax,%edi - jnc L(special) C jump if res_ptr >= s_ptr + size - -L(normal): - leal -4(%edi,%ebp,4),%edi - leal -4(%esi,%ebp,4),%esi - - movl (%esi),%edx - subl $4,%esi - xorl %eax,%eax - shldl( %cl, %edx, %eax) C compute carry limb - pushl %eax C push carry limb onto stack - - decl %ebp - pushl %ebp - shrl $3,%ebp - jz L(end) - - movl (%edi),%eax C fetch destination cache line - - ALIGN(4) -L(oop): movl -28(%edi),%eax C fetch destination cache line - movl %edx,%ebx - - movl (%esi),%eax - movl -4(%esi),%edx - shldl( %cl, %eax, %ebx) - shldl( %cl, %edx, %eax) - movl %ebx,(%edi) - movl %eax,-4(%edi) - - movl -8(%esi),%ebx - movl -12(%esi),%eax - shldl( %cl, %ebx, %edx) - shldl( %cl, %eax, %ebx) - movl %edx,-8(%edi) - movl %ebx,-12(%edi) - - movl -16(%esi),%edx - movl -20(%esi),%ebx - shldl( %cl, %edx, %eax) - shldl( %cl, %ebx, %edx) - movl %eax,-16(%edi) - movl %edx,-20(%edi) - - movl -24(%esi),%eax - movl -28(%esi),%edx - shldl( %cl, %eax, %ebx) - shldl( %cl, %edx, %eax) - movl %ebx,-24(%edi) - movl %eax,-28(%edi) - - subl $32,%esi - subl $32,%edi - decl %ebp - jnz L(oop) - -L(end): popl %ebp - andl $7,%ebp - jz L(end2) -L(oop2): - movl (%esi),%eax - shldl( %cl,%eax,%edx) - movl %edx,(%edi) - movl %eax,%edx - subl $4,%esi - subl $4,%edi - decl %ebp - jnz L(oop2) - -L(end2): - shll %cl,%edx C compute least significant limb - movl %edx,(%edi) C store it - - popl %eax C pop carry limb - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - - -C We loop from least significant end of the arrays, which is only -C permissable if the source and destination don't overlap, since the -C function is documented to work for overlapping source and destination. - -L(special): - movl (%esi),%edx - addl $4,%esi - - decl %ebp - pushl %ebp - shrl $3,%ebp - - addl %edx,%edx - incl %ebp - decl %ebp - jz L(Lend) - - movl (%edi),%eax C fetch destination cache line - - ALIGN(4) -L(Loop): - movl 28(%edi),%eax C fetch destination cache line - movl %edx,%ebx - - movl (%esi),%eax - movl 4(%esi),%edx - adcl %eax,%eax - movl %ebx,(%edi) - adcl %edx,%edx - movl %eax,4(%edi) - - movl 8(%esi),%ebx - movl 12(%esi),%eax - adcl %ebx,%ebx - movl %edx,8(%edi) - adcl %eax,%eax - movl %ebx,12(%edi) - - movl 16(%esi),%edx - movl 20(%esi),%ebx - adcl %edx,%edx - movl %eax,16(%edi) - adcl %ebx,%ebx - movl %edx,20(%edi) - - movl 24(%esi),%eax - movl 28(%esi),%edx - adcl %eax,%eax - movl %ebx,24(%edi) - adcl %edx,%edx - movl %eax,28(%edi) - - leal 32(%esi),%esi C use leal not to clobber carry - leal 32(%edi),%edi - decl %ebp - jnz L(Loop) - -L(Lend): - popl %ebp - sbbl %eax,%eax C save carry in %eax - andl $7,%ebp - jz L(Lend2) - addl %eax,%eax C restore carry from eax -L(Loop2): - movl %edx,%ebx - movl (%esi),%edx - adcl %edx,%edx - movl %ebx,(%edi) - - leal 4(%esi),%esi C use leal not to clobber carry - leal 4(%edi),%edi - decl %ebp - jnz L(Loop2) - - jmp L(L1) -L(Lend2): - addl %eax,%eax C restore carry from eax -L(L1): movl %edx,(%edi) C store last limb - - sbbl %eax,%eax - negl %eax - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/pentium/mmx/gmp-mparam.h b/rts/gmp/mpn/x86/pentium/mmx/gmp-mparam.h deleted file mode 100644 index 2379077d0c..0000000000 --- a/rts/gmp/mpn/x86/pentium/mmx/gmp-mparam.h +++ /dev/null @@ -1,97 +0,0 @@ -/* Intel P55 gmp-mparam.h -- Compiler/machine parameter header file. - -Copyright (C) 1991, 1993, 1994, 1999, 2000 Free Software Foundation, Inc. - -This file is part of the GNU MP Library. - -The GNU MP Library is free software; you can redistribute it and/or modify -it under the terms of the GNU Lesser General Public License as published by -the Free Software Foundation; either version 2.1 of the License, or (at your -option) any later version. - -The GNU MP Library is distributed in the hope that it will be useful, but -WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY -or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public -License for more details. - -You should have received a copy of the GNU Lesser General Public License -along with the GNU MP Library; see the file COPYING.LIB. If not, write to -the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, -MA 02111-1307, USA. */ - - -#define BITS_PER_MP_LIMB 32 -#define BYTES_PER_MP_LIMB 4 -#define BITS_PER_LONGINT 32 -#define BITS_PER_INT 32 -#define BITS_PER_SHORTINT 16 -#define BITS_PER_CHAR 8 - - -#ifndef UMUL_TIME -#define UMUL_TIME 9 /* cycles */ -#endif -#ifndef UDIV_TIME -#define UDIV_TIME 41 /* cycles */ -#endif - -/* bsf takes 18-42 cycles, put an average for uniform random numbers */ -#ifndef COUNT_TRAILING_ZEROS_TIME -#define COUNT_TRAILING_ZEROS_TIME 20 /* cycles */ -#endif - - -/* Generated by tuneup.c, 2000-07-06. */ - -#ifndef KARATSUBA_MUL_THRESHOLD -#define KARATSUBA_MUL_THRESHOLD 14 -#endif -#ifndef TOOM3_MUL_THRESHOLD -#define TOOM3_MUL_THRESHOLD 99 -#endif - -#ifndef KARATSUBA_SQR_THRESHOLD -#define KARATSUBA_SQR_THRESHOLD 22 -#endif -#ifndef TOOM3_SQR_THRESHOLD -#define TOOM3_SQR_THRESHOLD 89 -#endif - -#ifndef BZ_THRESHOLD -#define BZ_THRESHOLD 40 -#endif - -#ifndef FIB_THRESHOLD -#define FIB_THRESHOLD 98 -#endif - -#ifndef POWM_THRESHOLD -#define POWM_THRESHOLD 13 -#endif - -#ifndef GCD_ACCEL_THRESHOLD -#define GCD_ACCEL_THRESHOLD 5 -#endif -#ifndef GCDEXT_THRESHOLD -#define GCDEXT_THRESHOLD 25 -#endif - -#ifndef FFT_MUL_TABLE -#define FFT_MUL_TABLE { 496, 1056, 1920, 4608, 14336, 40960, 0 } -#endif -#ifndef FFT_MODF_MUL_THRESHOLD -#define FFT_MODF_MUL_THRESHOLD 512 -#endif -#ifndef FFT_MUL_THRESHOLD -#define FFT_MUL_THRESHOLD 3840 -#endif - -#ifndef FFT_SQR_TABLE -#define FFT_SQR_TABLE { 496, 1184, 2176, 5632, 14336, 40960, 0 } -#endif -#ifndef FFT_MODF_SQR_THRESHOLD -#define FFT_MODF_SQR_THRESHOLD 512 -#endif -#ifndef FFT_SQR_THRESHOLD -#define FFT_SQR_THRESHOLD 4352 -#endif diff --git a/rts/gmp/mpn/x86/pentium/mmx/lshift.asm b/rts/gmp/mpn/x86/pentium/mmx/lshift.asm deleted file mode 100644 index 2225438658..0000000000 --- a/rts/gmp/mpn/x86/pentium/mmx/lshift.asm +++ /dev/null @@ -1,455 +0,0 @@ -dnl Intel P5 mpn_lshift -- mpn left shift. -dnl -dnl P5: 1.75 cycles/limb. - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_lshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C -C Shift src,size left by shift many bits and store the result in dst,size. -C Zeros are shifted in at the right. Return the bits shifted out at the -C left. -C -C The comments in mpn_rshift apply here too. - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - -dnl minimum 5, because the unrolled loop can't handle less -deflit(UNROLL_THRESHOLD, 5) - - .text - ALIGN(8) - -PROLOGUE(mpn_lshift) - - pushl %ebx - pushl %edi -deflit(`FRAME',8) - - movl PARAM_SIZE, %eax - movl PARAM_DST, %edx - - movl PARAM_SRC, %ebx - movl PARAM_SHIFT, %ecx - - cmp $UNROLL_THRESHOLD, %eax - jae L(unroll) - - movl -4(%ebx,%eax,4), %edi C src high limb - decl %eax - - jnz L(simple) - - shldl( %cl, %edi, %eax) C eax was decremented to zero - - shll %cl, %edi - - movl %edi, (%edx) C dst low limb - popl %edi C risk of data cache bank clash - - popl %ebx - - ret - - -C ----------------------------------------------------------------------------- -L(simple): - C eax size-1 - C ebx src - C ecx shift - C edx dst - C esi - C edi - C ebp -deflit(`FRAME',8) - - movd (%ebx,%eax,4), %mm5 C src high limb - - movd %ecx, %mm6 C lshift - negl %ecx - - psllq %mm6, %mm5 - addl $32, %ecx - - movd %ecx, %mm7 - psrlq $32, %mm5 C retval - - -L(simple_top): - C eax counter, limbs, negative - C ebx src - C ecx - C edx dst - C esi - C edi - C - C mm0 scratch - C mm5 return value - C mm6 shift - C mm7 32-shift - - movq -4(%ebx,%eax,4), %mm0 - decl %eax - - psrlq %mm7, %mm0 - - C - - movd %mm0, 4(%edx,%eax,4) - jnz L(simple_top) - - - movd (%ebx), %mm0 - - movd %mm5, %eax - psllq %mm6, %mm0 - - popl %edi - popl %ebx - - movd %mm0, (%edx) - - emms - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(8) -L(unroll): - C eax size - C ebx src - C ecx shift - C edx dst - C esi - C edi - C ebp -deflit(`FRAME',8) - - movd -4(%ebx,%eax,4), %mm5 C src high limb - leal (%ebx,%eax,4), %edi - - movd %ecx, %mm6 C lshift - andl $4, %edi - - psllq %mm6, %mm5 - jz L(start_src_aligned) - - - C src isn't aligned, process high limb separately (marked xxx) to - C make it so. - C - C source -8(ebx,%eax,4) - C | - C +-------+-------+-------+-- - C | | - C +-------+-------+-------+-- - C 0mod8 4mod8 0mod8 - C - C dest - C -4(edx,%eax,4) - C | - C +-------+-------+-- - C | xxx | | - C +-------+-------+-- - - movq -8(%ebx,%eax,4), %mm0 C unaligned load - - psllq %mm6, %mm0 - decl %eax - - psrlq $32, %mm0 - - C - - movd %mm0, (%edx,%eax,4) -L(start_src_aligned): - - movq -8(%ebx,%eax,4), %mm1 C src high qword - leal (%edx,%eax,4), %edi - - andl $4, %edi - psrlq $32, %mm5 C return value - - movq -16(%ebx,%eax,4), %mm3 C src second highest qword - jz L(start_dst_aligned) - - C dst isn't aligned, subtract 4 to make it so, and pretend the shift - C is 32 bits extra. High limb of dst (marked xxx) handled here - C separately. - C - C source -8(ebx,%eax,4) - C | - C +-------+-------+-- - C | mm1 | - C +-------+-------+-- - C 0mod8 4mod8 - C - C dest - C -4(edx,%eax,4) - C | - C +-------+-------+-------+-- - C | xxx | | - C +-------+-------+-------+-- - C 0mod8 4mod8 0mod8 - - movq %mm1, %mm0 - addl $32, %ecx C new shift - - psllq %mm6, %mm0 - - movd %ecx, %mm6 - psrlq $32, %mm0 - - C wasted cycle here waiting for %mm0 - - movd %mm0, -4(%edx,%eax,4) - subl $4, %edx -L(start_dst_aligned): - - - psllq %mm6, %mm1 - negl %ecx C -shift - - addl $64, %ecx C 64-shift - movq %mm3, %mm2 - - movd %ecx, %mm7 - subl $8, %eax C size-8 - - psrlq %mm7, %mm3 - - por %mm1, %mm3 C mm3 ready to store - jc L(finish) - - - C The comments in mpn_rshift apply here too. - - ALIGN(8) -L(unroll_loop): - C eax counter, limbs - C ebx src - C ecx - C edx dst - C esi - C edi - C - C mm0 - C mm1 - C mm2 src qword from 48(%ebx,%eax,4) - C mm3 dst qword ready to store to 56(%edx,%eax,4) - C - C mm5 return value - C mm6 lshift - C mm7 rshift - - movq 8(%ebx,%eax,4), %mm0 - psllq %mm6, %mm2 - - movq %mm0, %mm1 - psrlq %mm7, %mm0 - - movq %mm3, 24(%edx,%eax,4) C prev - por %mm2, %mm0 - - movq (%ebx,%eax,4), %mm3 C - psllq %mm6, %mm1 C - - movq %mm0, 16(%edx,%eax,4) - movq %mm3, %mm2 C - - psrlq %mm7, %mm3 C - subl $4, %eax - - por %mm1, %mm3 C - jnc L(unroll_loop) - - - -L(finish): - C eax -4 to -1 representing respectively 0 to 3 limbs remaining - - testb $2, %al - - jz L(finish_no_two) - - movq 8(%ebx,%eax,4), %mm0 - psllq %mm6, %mm2 - - movq %mm0, %mm1 - psrlq %mm7, %mm0 - - movq %mm3, 24(%edx,%eax,4) C prev - por %mm2, %mm0 - - movq %mm1, %mm2 - movq %mm0, %mm3 - - subl $2, %eax -L(finish_no_two): - - - C eax -4 or -3 representing respectively 0 or 1 limbs remaining - C - C mm2 src prev qword, from 48(%ebx,%eax,4) - C mm3 dst qword, for 56(%edx,%eax,4) - - testb $1, %al - movd %mm5, %eax C retval - - popl %edi - jz L(finish_zero) - - - C One extra src limb, destination was aligned. - C - C source ebx - C --+---------------+-------+ - C | mm2 | | - C --+---------------+-------+ - C - C dest edx+12 edx+4 edx - C --+---------------+---------------+-------+ - C | mm3 | | | - C --+---------------+---------------+-------+ - C - C mm6 = shift - C mm7 = ecx = 64-shift - - - C One extra src limb, destination was unaligned. - C - C source ebx - C --+---------------+-------+ - C | mm2 | | - C --+---------------+-------+ - C - C dest edx+12 edx+4 - C --+---------------+---------------+ - C | mm3 | | - C --+---------------+---------------+ - C - C mm6 = shift+32 - C mm7 = ecx = 64-(shift+32) - - - C In both cases there's one extra limb of src to fetch and combine - C with mm2 to make a qword at 4(%edx), and in the aligned case - C there's an extra limb of dst to be formed from that extra src limb - C left shifted. - - - movd (%ebx), %mm0 - psllq %mm6, %mm2 - - movq %mm3, 12(%edx) - psllq $32, %mm0 - - movq %mm0, %mm1 - psrlq %mm7, %mm0 - - por %mm2, %mm0 - psllq %mm6, %mm1 - - movq %mm0, 4(%edx) - psrlq $32, %mm1 - - andl $32, %ecx - popl %ebx - - jz L(finish_one_unaligned) - - movd %mm1, (%edx) -L(finish_one_unaligned): - - emms - - ret - - -L(finish_zero): - - C No extra src limbs, destination was aligned. - C - C source ebx - C --+---------------+ - C | mm2 | - C --+---------------+ - C - C dest edx+8 edx - C --+---------------+---------------+ - C | mm3 | | - C --+---------------+---------------+ - C - C mm6 = shift - C mm7 = ecx = 64-shift - - - C No extra src limbs, destination was unaligned. - C - C source ebx - C --+---------------+ - C | mm2 | - C --+---------------+ - C - C dest edx+8 edx+4 - C --+---------------+-------+ - C | mm3 | | - C --+---------------+-------+ - C - C mm6 = shift+32 - C mm7 = ecx = 64-(shift+32) - - - C The movd for the unaligned case writes the same data to 4(%edx) - C that the movq does for the aligned case. - - - movq %mm3, 8(%edx) - andl $32, %ecx - - psllq %mm6, %mm2 - jz L(finish_zero_unaligned) - - movq %mm2, (%edx) -L(finish_zero_unaligned): - - psrlq $32, %mm2 - popl %ebx - - movd %mm5, %eax C retval - - movd %mm2, 4(%edx) - - emms - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/pentium/mmx/popham.asm b/rts/gmp/mpn/x86/pentium/mmx/popham.asm deleted file mode 100644 index 587a07ab3d..0000000000 --- a/rts/gmp/mpn/x86/pentium/mmx/popham.asm +++ /dev/null @@ -1,30 +0,0 @@ -dnl Intel P55 mpn_popcount, mpn_hamdist -- population count and hamming -dnl distance. -dnl -dnl P55: popcount 11.5 cycles/limb, hamdist 12.0 cycles/limb - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - -MULFUNC_PROLOGUE(mpn_popcount mpn_hamdist) -include_mpn(`x86/k6/mmx/popham.asm') diff --git a/rts/gmp/mpn/x86/pentium/mmx/rshift.asm b/rts/gmp/mpn/x86/pentium/mmx/rshift.asm deleted file mode 100644 index 7672630d57..0000000000 --- a/rts/gmp/mpn/x86/pentium/mmx/rshift.asm +++ /dev/null @@ -1,460 +0,0 @@ -dnl Intel P5 mpn_rshift -- mpn right shift. -dnl -dnl P5: 1.75 cycles/limb. - - -dnl Copyright (C) 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_rshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C -C Shift src,size right by shift many bits and store the result in dst,size. -C Zeros are shifted in at the left. Return the bits shifted out at the -C right. -C -C It takes 6 mmx instructions to process 2 limbs, making 1.5 cycles/limb, -C and with a 4 limb loop and 1 cycle of loop overhead the total is 1.75 c/l. -C -C Full speed depends on source and destination being aligned. Unaligned mmx -C loads and stores on P5 don't pair and have a 2 cycle penalty. Some hairy -C setups and finish-ups are done to ensure alignment for the loop. -C -C MMX shifts work out a bit faster even for the simple loop. - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) -deflit(`FRAME',0) - -dnl Minimum 5, because the unrolled loop can't handle less. -deflit(UNROLL_THRESHOLD, 5) - - .text - ALIGN(8) - -PROLOGUE(mpn_rshift) - - pushl %ebx - pushl %edi -deflit(`FRAME',8) - - movl PARAM_SIZE, %eax - movl PARAM_DST, %edx - - movl PARAM_SRC, %ebx - movl PARAM_SHIFT, %ecx - - cmp $UNROLL_THRESHOLD, %eax - jae L(unroll) - - decl %eax - movl (%ebx), %edi C src low limb - - jnz L(simple) - - shrdl( %cl, %edi, %eax) C eax was decremented to zero - - shrl %cl, %edi - - movl %edi, (%edx) C dst low limb - popl %edi C risk of data cache bank clash - - popl %ebx - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(8) -L(simple): - C eax size-1 - C ebx src - C ecx shift - C edx dst - C esi - C edi - C ebp -deflit(`FRAME',8) - - movd (%ebx), %mm5 C src[0] - leal (%ebx,%eax,4), %ebx C &src[size-1] - - movd %ecx, %mm6 C rshift - leal -4(%edx,%eax,4), %edx C &dst[size-2] - - psllq $32, %mm5 - negl %eax - - -C This loop is 5 or 8 cycles, with every second load unaligned and a wasted -C cycle waiting for the mm0 result to be ready. For comparison a shrdl is 4 -C cycles and would be 8 in a simple loop. Using mmx helps the return value -C and last limb calculations too. - -L(simple_top): - C eax counter, limbs, negative - C ebx &src[size-1] - C ecx return value - C edx &dst[size-2] - C - C mm0 scratch - C mm5 return value - C mm6 shift - - movq (%ebx,%eax,4), %mm0 - incl %eax - - psrlq %mm6, %mm0 - - movd %mm0, (%edx,%eax,4) - jnz L(simple_top) - - - movd (%ebx), %mm0 - psrlq %mm6, %mm5 C return value - - psrlq %mm6, %mm0 - popl %edi - - movd %mm5, %eax - popl %ebx - - movd %mm0, 4(%edx) - - emms - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(8) -L(unroll): - C eax size - C ebx src - C ecx shift - C edx dst - C esi - C edi - C ebp -deflit(`FRAME',8) - - movd (%ebx), %mm5 C src[0] - movl $4, %edi - - movd %ecx, %mm6 C rshift - testl %edi, %ebx - - psllq $32, %mm5 - jz L(start_src_aligned) - - - C src isn't aligned, process low limb separately (marked xxx) and - C step src and dst by one limb, making src aligned. - C - C source ebx - C --+-------+-------+-------+ - C | xxx | - C --+-------+-------+-------+ - C 4mod8 0mod8 4mod8 - C - C dest edx - C --+-------+-------+ - C | | xxx | - C --+-------+-------+ - - movq (%ebx), %mm0 C unaligned load - - psrlq %mm6, %mm0 - addl $4, %ebx - - decl %eax - - movd %mm0, (%edx) - addl $4, %edx -L(start_src_aligned): - - - movq (%ebx), %mm1 - testl %edi, %edx - - psrlq %mm6, %mm5 C retval - jz L(start_dst_aligned) - - C dst isn't aligned, add 4 to make it so, and pretend the shift is - C 32 bits extra. Low limb of dst (marked xxx) handled here - C separately. - C - C source ebx - C --+-------+-------+ - C | mm1 | - C --+-------+-------+ - C 4mod8 0mod8 - C - C dest edx - C --+-------+-------+-------+ - C | xxx | - C --+-------+-------+-------+ - C 4mod8 0mod8 4mod8 - - movq %mm1, %mm0 - addl $32, %ecx C new shift - - psrlq %mm6, %mm0 - - movd %ecx, %mm6 - - movd %mm0, (%edx) - addl $4, %edx -L(start_dst_aligned): - - - movq 8(%ebx), %mm3 - negl %ecx - - movq %mm3, %mm2 C mm2 src qword - addl $64, %ecx - - movd %ecx, %mm7 - psrlq %mm6, %mm1 - - leal -12(%ebx,%eax,4), %ebx - leal -20(%edx,%eax,4), %edx - - psllq %mm7, %mm3 - subl $7, %eax C size-7 - - por %mm1, %mm3 C mm3 ready to store - negl %eax C -(size-7) - - jns L(finish) - - - C This loop is the important bit, the rest is just support. Careful - C instruction scheduling achieves the claimed 1.75 c/l. The - C relevant parts of the pairing rules are: - C - C - mmx loads and stores execute only in the U pipe - C - only one mmx shift in a pair - C - wait one cycle before storing an mmx register result - C - the usual address generation interlock - C - C Two qword calculations are slightly interleaved. The instructions - C marked "C" belong to the second qword, and the "C prev" one is for - C the second qword from the previous iteration. - - ALIGN(8) -L(unroll_loop): - C eax counter, limbs, negative - C ebx &src[size-12] - C ecx - C edx &dst[size-12] - C esi - C edi - C - C mm0 - C mm1 - C mm2 src qword from -8(%ebx,%eax,4) - C mm3 dst qword ready to store to -8(%edx,%eax,4) - C - C mm5 return value - C mm6 rshift - C mm7 lshift - - movq (%ebx,%eax,4), %mm0 - psrlq %mm6, %mm2 - - movq %mm0, %mm1 - psllq %mm7, %mm0 - - movq %mm3, -8(%edx,%eax,4) C prev - por %mm2, %mm0 - - movq 8(%ebx,%eax,4), %mm3 C - psrlq %mm6, %mm1 C - - movq %mm0, (%edx,%eax,4) - movq %mm3, %mm2 C - - psllq %mm7, %mm3 C - addl $4, %eax - - por %mm1, %mm3 C - js L(unroll_loop) - - -L(finish): - C eax 0 to 3 representing respectively 3 to 0 limbs remaining - - testb $2, %al - - jnz L(finish_no_two) - - movq (%ebx,%eax,4), %mm0 - psrlq %mm6, %mm2 - - movq %mm0, %mm1 - psllq %mm7, %mm0 - - movq %mm3, -8(%edx,%eax,4) C prev - por %mm2, %mm0 - - movq %mm1, %mm2 - movq %mm0, %mm3 - - addl $2, %eax -L(finish_no_two): - - - C eax 2 or 3 representing respectively 1 or 0 limbs remaining - C - C mm2 src prev qword, from -8(%ebx,%eax,4) - C mm3 dst qword, for -8(%edx,%eax,4) - - testb $1, %al - popl %edi - - movd %mm5, %eax C retval - jnz L(finish_zero) - - - C One extra limb, destination was aligned. - C - C source ebx - C +-------+---------------+-- - C | | mm2 | - C +-------+---------------+-- - C - C dest edx - C +-------+---------------+---------------+-- - C | | | mm3 | - C +-------+---------------+---------------+-- - C - C mm6 = shift - C mm7 = ecx = 64-shift - - - C One extra limb, destination was unaligned. - C - C source ebx - C +-------+---------------+-- - C | | mm2 | - C +-------+---------------+-- - C - C dest edx - C +---------------+---------------+-- - C | | mm3 | - C +---------------+---------------+-- - C - C mm6 = shift+32 - C mm7 = ecx = 64-(shift+32) - - - C In both cases there's one extra limb of src to fetch and combine - C with mm2 to make a qword at 8(%edx), and in the aligned case - C there's a further extra limb of dst to be formed. - - - movd 8(%ebx), %mm0 - psrlq %mm6, %mm2 - - movq %mm0, %mm1 - psllq %mm7, %mm0 - - movq %mm3, (%edx) - por %mm2, %mm0 - - psrlq %mm6, %mm1 - andl $32, %ecx - - popl %ebx - jz L(finish_one_unaligned) - - C dst was aligned, must store one extra limb - movd %mm1, 16(%edx) -L(finish_one_unaligned): - - movq %mm0, 8(%edx) - - emms - - ret - - -L(finish_zero): - - C No extra limbs, destination was aligned. - C - C source ebx - C +---------------+-- - C | mm2 | - C +---------------+-- - C - C dest edx+4 - C +---------------+---------------+-- - C | | mm3 | - C +---------------+---------------+-- - C - C mm6 = shift - C mm7 = ecx = 64-shift - - - C No extra limbs, destination was unaligned. - C - C source ebx - C +---------------+-- - C | mm2 | - C +---------------+-- - C - C dest edx+4 - C +-------+---------------+-- - C | | mm3 | - C +-------+---------------+-- - C - C mm6 = shift+32 - C mm7 = 64-(shift+32) - - - C The movd for the unaligned case is clearly the same data as the - C movq for the aligned case, it's just a choice between whether one - C or two limbs should be written. - - - movq %mm3, 4(%edx) - psrlq %mm6, %mm2 - - movd %mm2, 12(%edx) - andl $32, %ecx - - popl %ebx - jz L(finish_zero_unaligned) - - movq %mm2, 12(%edx) -L(finish_zero_unaligned): - - emms - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/pentium/mul_1.asm b/rts/gmp/mpn/x86/pentium/mul_1.asm deleted file mode 100644 index 08639eca09..0000000000 --- a/rts/gmp/mpn/x86/pentium/mul_1.asm +++ /dev/null @@ -1,79 +0,0 @@ -dnl Intel Pentium mpn_mul_1 -- mpn by limb multiplication. -dnl -dnl P5: 13.0 cycles/limb - -dnl Copyright (C) 1992, 1994, 1996, 1999, 2000 Free Software Foundation, -dnl Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. */ - - -include(`../config.m4') - - -C mp_limb_t mpn_mul_1 (mp_ptr dst, mp_srcptr src, mp_size_t size, -C mp_limb_t multiplier); - -defframe(PARAM_MULTIPLIER,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(8) -PROLOGUE(mpn_mul_1) - - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp -deflit(`FRAME',16) - - movl PARAM_DST, %edi - movl PARAM_SRC, %esi - movl PARAM_SIZE, %ecx - movl PARAM_MULTIPLIER, %ebp - - leal (%edi,%ecx,4), %edi - leal (%esi,%ecx,4), %esi - negl %ecx - xorl %ebx, %ebx - ALIGN(8) - -L(oop): adcl $0, %ebx - movl (%esi,%ecx,4), %eax - - mull %ebp - - addl %eax, %ebx - - movl %ebx, (%edi,%ecx,4) - incl %ecx - - movl %edx, %ebx - jnz L(oop) - - adcl $0, %ebx - movl %ebx, %eax - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/pentium/mul_basecase.asm b/rts/gmp/mpn/x86/pentium/mul_basecase.asm deleted file mode 100644 index d9f79a0831..0000000000 --- a/rts/gmp/mpn/x86/pentium/mul_basecase.asm +++ /dev/null @@ -1,135 +0,0 @@ -dnl Intel Pentium mpn_mul_basecase -- mpn by mpn multiplication. -dnl -dnl P5: 14.2 cycles/crossproduct (approx) - - -dnl Copyright (C) 1996, 1998, 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C void mpn_mul_basecase (mp_ptr wp, -C mp_srcptr xp, mp_size_t xsize, -C mp_srcptr yp, mp_size_t ysize); - -defframe(PARAM_YSIZE, 20) -defframe(PARAM_YP, 16) -defframe(PARAM_XSIZE, 12) -defframe(PARAM_XP, 8) -defframe(PARAM_WP, 4) - -defframe(VAR_COUNTER, -4) - - .text - ALIGN(8) -PROLOGUE(mpn_mul_basecase) - - pushl %eax C dummy push for allocating stack slot - pushl %esi - pushl %ebp - pushl %edi -deflit(`FRAME',16) - - movl PARAM_XP,%esi - movl PARAM_WP,%edi - movl PARAM_YP,%ebp - - movl (%esi),%eax C load xp[0] - mull (%ebp) C multiply by yp[0] - movl %eax,(%edi) C store to wp[0] - movl PARAM_XSIZE,%ecx C xsize - decl %ecx C If xsize = 1, ysize = 1 too - jz L(done) - - movl PARAM_XSIZE,%eax - pushl %ebx -FRAME_pushl() - movl %edx,%ebx - leal (%esi,%eax,4),%esi C make xp point at end - leal (%edi,%eax,4),%edi C offset wp by xsize - negl %ecx C negate j size/index for inner loop - xorl %eax,%eax C clear carry - - ALIGN(8) -L(oop1): adcl $0,%ebx - movl (%esi,%ecx,4),%eax C load next limb at xp[j] - mull (%ebp) - addl %ebx,%eax - movl %eax,(%edi,%ecx,4) - incl %ecx - movl %edx,%ebx - jnz L(oop1) - - adcl $0,%ebx - movl PARAM_YSIZE,%eax - movl %ebx,(%edi) C most significant limb of product - addl $4,%edi C increment wp - decl %eax - jz L(skip) - movl %eax,VAR_COUNTER C set index i to ysize - -L(outer): - addl $4,%ebp C make ebp point to next y limb - movl PARAM_XSIZE,%ecx - negl %ecx - xorl %ebx,%ebx - - C code at 0x61 here, close enough to aligned -L(oop2): - adcl $0,%ebx - movl (%esi,%ecx,4),%eax - mull (%ebp) - addl %ebx,%eax - movl (%edi,%ecx,4),%ebx - adcl $0,%edx - addl %eax,%ebx - movl %ebx,(%edi,%ecx,4) - incl %ecx - movl %edx,%ebx - jnz L(oop2) - - adcl $0,%ebx - - movl %ebx,(%edi) - addl $4,%edi - movl VAR_COUNTER,%eax - decl %eax - movl %eax,VAR_COUNTER - jnz L(outer) - -L(skip): - popl %ebx - popl %edi - popl %ebp - popl %esi - addl $4,%esp - ret - -L(done): - movl %edx,4(%edi) C store to wp[1] - popl %edi - popl %ebp - popl %esi - popl %eax C dummy pop for deallocating stack slot - ret - -EPILOGUE() - diff --git a/rts/gmp/mpn/x86/pentium/rshift.asm b/rts/gmp/mpn/x86/pentium/rshift.asm deleted file mode 100644 index e8f5ae8ec8..0000000000 --- a/rts/gmp/mpn/x86/pentium/rshift.asm +++ /dev/null @@ -1,236 +0,0 @@ -dnl Intel Pentium mpn_rshift -- mpn right shift. -dnl -dnl cycles/limb -dnl P5,P54: 6.0 -dnl P55: 5.375 - - -dnl Copyright (C) 1992, 1994, 1995, 1996, 1999, 2000 Free Software -dnl Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_rshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); -C -C The main shift-by-N loop should run at 5.375 c/l and that's what P55 does, -C but P5 and P54 run only at 6.0 c/l, which is 4 cycles lost somewhere. - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(8) -PROLOGUE(mpn_rshift) - - pushl %edi - pushl %esi - pushl %ebx - pushl %ebp -deflit(`FRAME',16) - - movl PARAM_DST,%edi - movl PARAM_SRC,%esi - movl PARAM_SIZE,%ebp - movl PARAM_SHIFT,%ecx - -C We can use faster code for shift-by-1 under certain conditions. - cmp $1,%ecx - jne L(normal) - leal 4(%edi),%eax - cmpl %esi,%eax - jnc L(special) C jump if res_ptr + 1 >= s_ptr - leal (%edi,%ebp,4),%eax - cmpl %eax,%esi - jnc L(special) C jump if s_ptr >= res_ptr + size - -L(normal): - movl (%esi),%edx - addl $4,%esi - xorl %eax,%eax - shrdl( %cl, %edx, %eax) C compute carry limb - pushl %eax C push carry limb onto stack - - decl %ebp - pushl %ebp - shrl $3,%ebp - jz L(end) - - movl (%edi),%eax C fetch destination cache line - - ALIGN(4) -L(oop): movl 28(%edi),%eax C fetch destination cache line - movl %edx,%ebx - - movl (%esi),%eax - movl 4(%esi),%edx - shrdl( %cl, %eax, %ebx) - shrdl( %cl, %edx, %eax) - movl %ebx,(%edi) - movl %eax,4(%edi) - - movl 8(%esi),%ebx - movl 12(%esi),%eax - shrdl( %cl, %ebx, %edx) - shrdl( %cl, %eax, %ebx) - movl %edx,8(%edi) - movl %ebx,12(%edi) - - movl 16(%esi),%edx - movl 20(%esi),%ebx - shrdl( %cl, %edx, %eax) - shrdl( %cl, %ebx, %edx) - movl %eax,16(%edi) - movl %edx,20(%edi) - - movl 24(%esi),%eax - movl 28(%esi),%edx - shrdl( %cl, %eax, %ebx) - shrdl( %cl, %edx, %eax) - movl %ebx,24(%edi) - movl %eax,28(%edi) - - addl $32,%esi - addl $32,%edi - decl %ebp - jnz L(oop) - -L(end): popl %ebp - andl $7,%ebp - jz L(end2) -L(oop2): - movl (%esi),%eax - shrdl( %cl,%eax,%edx) C compute result limb - movl %edx,(%edi) - movl %eax,%edx - addl $4,%esi - addl $4,%edi - decl %ebp - jnz L(oop2) - -L(end2): - shrl %cl,%edx C compute most significant limb - movl %edx,(%edi) C store it - - popl %eax C pop carry limb - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - - -C We loop from least significant end of the arrays, which is only -C permissable if the source and destination don't overlap, since the -C function is documented to work for overlapping source and destination. - -L(special): - leal -4(%edi,%ebp,4),%edi - leal -4(%esi,%ebp,4),%esi - - movl (%esi),%edx - subl $4,%esi - - decl %ebp - pushl %ebp - shrl $3,%ebp - - shrl %edx - incl %ebp - decl %ebp - jz L(Lend) - - movl (%edi),%eax C fetch destination cache line - - ALIGN(4) -L(Loop): - movl -28(%edi),%eax C fetch destination cache line - movl %edx,%ebx - - movl (%esi),%eax - movl -4(%esi),%edx - rcrl %eax - movl %ebx,(%edi) - rcrl %edx - movl %eax,-4(%edi) - - movl -8(%esi),%ebx - movl -12(%esi),%eax - rcrl %ebx - movl %edx,-8(%edi) - rcrl %eax - movl %ebx,-12(%edi) - - movl -16(%esi),%edx - movl -20(%esi),%ebx - rcrl %edx - movl %eax,-16(%edi) - rcrl %ebx - movl %edx,-20(%edi) - - movl -24(%esi),%eax - movl -28(%esi),%edx - rcrl %eax - movl %ebx,-24(%edi) - rcrl %edx - movl %eax,-28(%edi) - - leal -32(%esi),%esi C use leal not to clobber carry - leal -32(%edi),%edi - decl %ebp - jnz L(Loop) - -L(Lend): - popl %ebp - sbbl %eax,%eax C save carry in %eax - andl $7,%ebp - jz L(Lend2) - addl %eax,%eax C restore carry from eax -L(Loop2): - movl %edx,%ebx - movl (%esi),%edx - rcrl %edx - movl %ebx,(%edi) - - leal -4(%esi),%esi C use leal not to clobber carry - leal -4(%edi),%edi - decl %ebp - jnz L(Loop2) - - jmp L(L1) -L(Lend2): - addl %eax,%eax C restore carry from eax -L(L1): movl %edx,(%edi) C store last limb - - movl $0,%eax - rcrl %eax - - popl %ebp - popl %ebx - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/pentium/sqr_basecase.asm b/rts/gmp/mpn/x86/pentium/sqr_basecase.asm deleted file mode 100644 index c8584df13c..0000000000 --- a/rts/gmp/mpn/x86/pentium/sqr_basecase.asm +++ /dev/null @@ -1,520 +0,0 @@ -dnl Intel P5 mpn_sqr_basecase -- square an mpn number. -dnl -dnl P5: approx 8 cycles per crossproduct, or 15.5 cycles per triangular -dnl product at around 20x20 limbs. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C void mpn_sqr_basecase (mp_ptr dst, mp_srcptr src, mp_size_t size); -C -C Calculate src,size squared, storing the result in dst,2*size. -C -C The algorithm is basically the same as mpn/generic/sqr_basecase.c, but a -C lot of function call overheads are avoided, especially when the size is -C small. - -defframe(PARAM_SIZE,12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(8) -PROLOGUE(mpn_sqr_basecase) -deflit(`FRAME',0) - - movl PARAM_SIZE, %edx - movl PARAM_SRC, %eax - - cmpl $2, %edx - movl PARAM_DST, %ecx - - je L(two_limbs) - - movl (%eax), %eax - ja L(three_or_more) - -C ----------------------------------------------------------------------------- -C one limb only - C eax src - C ebx - C ecx dst - C edx - - mull %eax - - movl %eax, (%ecx) - movl %edx, 4(%ecx) - - ret - -C ----------------------------------------------------------------------------- - ALIGN(8) -L(two_limbs): - C eax src - C ebx - C ecx dst - C edx size - - pushl %ebp - pushl %edi - - pushl %esi - pushl %ebx - - movl %eax, %ebx - movl (%eax), %eax - - mull %eax C src[0]^2 - - movl %eax, (%ecx) C dst[0] - movl %edx, %esi C dst[1] - - movl 4(%ebx), %eax - - mull %eax C src[1]^2 - - movl %eax, %edi C dst[2] - movl %edx, %ebp C dst[3] - - movl (%ebx), %eax - - mull 4(%ebx) C src[0]*src[1] - - addl %eax, %esi - popl %ebx - - adcl %edx, %edi - - adcl $0, %ebp - addl %esi, %eax - - adcl %edi, %edx - movl %eax, 4(%ecx) - - adcl $0, %ebp - popl %esi - - movl %edx, 8(%ecx) - movl %ebp, 12(%ecx) - - popl %edi - popl %ebp - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(8) -L(three_or_more): - C eax src low limb - C ebx - C ecx dst - C edx size - - cmpl $4, %edx - pushl %ebx -deflit(`FRAME',4) - - movl PARAM_SRC, %ebx - jae L(four_or_more) - - -C ----------------------------------------------------------------------------- -C three limbs - C eax src low limb - C ebx src - C ecx dst - C edx size - - pushl %ebp - pushl %edi - - mull %eax C src[0] ^ 2 - - movl %eax, (%ecx) - movl %edx, 4(%ecx) - - movl 4(%ebx), %eax - xorl %ebp, %ebp - - mull %eax C src[1] ^ 2 - - movl %eax, 8(%ecx) - movl %edx, 12(%ecx) - - movl 8(%ebx), %eax - pushl %esi C risk of cache bank clash - - mull %eax C src[2] ^ 2 - - movl %eax, 16(%ecx) - movl %edx, 20(%ecx) - - movl (%ebx), %eax - - mull 4(%ebx) C src[0] * src[1] - - movl %eax, %esi - movl %edx, %edi - - movl (%ebx), %eax - - mull 8(%ebx) C src[0] * src[2] - - addl %eax, %edi - movl %edx, %ebp - - adcl $0, %ebp - movl 4(%ebx), %eax - - mull 8(%ebx) C src[1] * src[2] - - xorl %ebx, %ebx - addl %eax, %ebp - - C eax - C ebx zero, will be dst[5] - C ecx dst - C edx dst[4] - C esi dst[1] - C edi dst[2] - C ebp dst[3] - - adcl $0, %edx - addl %esi, %esi - - adcl %edi, %edi - - adcl %ebp, %ebp - - adcl %edx, %edx - movl 4(%ecx), %eax - - adcl $0, %ebx - addl %esi, %eax - - movl %eax, 4(%ecx) - movl 8(%ecx), %eax - - adcl %edi, %eax - movl 12(%ecx), %esi - - adcl %ebp, %esi - movl 16(%ecx), %edi - - movl %eax, 8(%ecx) - movl %esi, 12(%ecx) - - adcl %edx, %edi - popl %esi - - movl 20(%ecx), %eax - movl %edi, 16(%ecx) - - popl %edi - popl %ebp - - adcl %ebx, %eax C no carry out of this - popl %ebx - - movl %eax, 20(%ecx) - - ret - - -C ----------------------------------------------------------------------------- - ALIGN(8) -L(four_or_more): - C eax src low limb - C ebx src - C ecx dst - C edx size - C esi - C edi - C ebp - C - C First multiply src[0]*src[1..size-1] and store at dst[1..size]. - -deflit(`FRAME',4) - - pushl %edi -FRAME_pushl() - pushl %esi -FRAME_pushl() - - pushl %ebp -FRAME_pushl() - leal (%ecx,%edx,4), %edi C dst end of this mul1 - - leal (%ebx,%edx,4), %esi C src end - movl %ebx, %ebp C src - - negl %edx C -size - xorl %ebx, %ebx C clear carry limb and carry flag - - leal 1(%edx), %ecx C -(size-1) - -L(mul1): - C eax scratch - C ebx carry - C ecx counter, negative - C edx scratch - C esi &src[size] - C edi &dst[size] - C ebp src - - adcl $0, %ebx - movl (%esi,%ecx,4), %eax - - mull (%ebp) - - addl %eax, %ebx - - movl %ebx, (%edi,%ecx,4) - incl %ecx - - movl %edx, %ebx - jnz L(mul1) - - - C Add products src[n]*src[n+1..size-1] at dst[2*n-1...], for - C n=1..size-2. - C - C The last two products, which are the end corner of the product - C triangle, are handled separately to save looping overhead. These - C are src[size-3]*src[size-2,size-1] and src[size-2]*src[size-1]. - C If size is 4 then it's only these that need to be done. - C - C In the outer loop %esi is a constant, and %edi just advances by 1 - C limb each time. The size of the operation decreases by 1 limb - C each time. - - C eax - C ebx carry (needing carry flag added) - C ecx - C edx - C esi &src[size] - C edi &dst[size] - C ebp - - adcl $0, %ebx - movl PARAM_SIZE, %edx - - movl %ebx, (%edi) - subl $4, %edx - - negl %edx - jz L(corner) - - -L(outer): - C ebx previous carry limb to store - C edx outer loop counter (negative) - C esi &src[size] - C edi dst, pointing at stored carry limb of previous loop - - pushl %edx C new outer loop counter - leal -2(%edx), %ecx - - movl %ebx, (%edi) - addl $4, %edi - - addl $4, %ebp - xorl %ebx, %ebx C initial carry limb, clear carry flag - -L(inner): - C eax scratch - C ebx carry (needing carry flag added) - C ecx counter, negative - C edx scratch - C esi &src[size] - C edi dst end of this addmul - C ebp &src[j] - - adcl $0, %ebx - movl (%esi,%ecx,4), %eax - - mull (%ebp) - - addl %ebx, %eax - movl (%edi,%ecx,4), %ebx - - adcl $0, %edx - addl %eax, %ebx - - movl %ebx, (%edi,%ecx,4) - incl %ecx - - movl %edx, %ebx - jnz L(inner) - - - adcl $0, %ebx - popl %edx C outer loop counter - - incl %edx - jnz L(outer) - - - movl %ebx, (%edi) - -L(corner): - C esi &src[size] - C edi &dst[2*size-4] - - movl -8(%esi), %eax - movl -4(%edi), %ebx C risk of data cache bank clash here - - mull -12(%esi) C src[size-2]*src[size-3] - - addl %eax, %ebx - movl %edx, %ecx - - adcl $0, %ecx - movl -4(%esi), %eax - - mull -12(%esi) C src[size-1]*src[size-3] - - addl %ecx, %eax - movl (%edi), %ecx - - adcl $0, %edx - movl %ebx, -4(%edi) - - addl %eax, %ecx - movl %edx, %ebx - - adcl $0, %ebx - movl -4(%esi), %eax - - mull -8(%esi) C src[size-1]*src[size-2] - - movl %ecx, 0(%edi) - addl %eax, %ebx - - adcl $0, %edx - movl PARAM_SIZE, %eax - - negl %eax - movl %ebx, 4(%edi) - - addl $1, %eax C -(size-1) and clear carry - movl %edx, 8(%edi) - - -C ----------------------------------------------------------------------------- -C Left shift of dst[1..2*size-2], high bit shifted out becomes dst[2*size-1]. - -L(lshift): - C eax counter, negative - C ebx next limb - C ecx - C edx - C esi - C edi &dst[2*size-4] - C ebp - - movl 12(%edi,%eax,8), %ebx - - rcll %ebx - movl 16(%edi,%eax,8), %ecx - - rcll %ecx - movl %ebx, 12(%edi,%eax,8) - - movl %ecx, 16(%edi,%eax,8) - incl %eax - - jnz L(lshift) - - - adcl %eax, %eax C high bit out - movl PARAM_SRC, %esi - - movl PARAM_SIZE, %ecx C risk of cache bank clash - movl %eax, 12(%edi) C dst most significant limb - - -C ----------------------------------------------------------------------------- -C Now add in the squares on the diagonal, namely src[0]^2, src[1]^2, ..., -C src[size-1]^2. dst[0] hasn't yet been set at all yet, and just gets the -C low limb of src[0]^2. - - movl (%esi), %eax C src[0] - leal (%esi,%ecx,4), %esi C src end - - negl %ecx - - mull %eax - - movl %eax, 16(%edi,%ecx,8) C dst[0] - movl %edx, %ebx - - addl $1, %ecx C size-1 and clear carry - -L(diag): - C eax scratch (low product) - C ebx carry limb - C ecx counter, negative - C edx scratch (high product) - C esi &src[size] - C edi &dst[2*size-4] - C ebp scratch (fetched dst limbs) - - movl (%esi,%ecx,4), %eax - adcl $0, %ebx - - mull %eax - - movl 16-4(%edi,%ecx,8), %ebp - - addl %ebp, %ebx - movl 16(%edi,%ecx,8), %ebp - - adcl %eax, %ebp - movl %ebx, 16-4(%edi,%ecx,8) - - movl %ebp, 16(%edi,%ecx,8) - incl %ecx - - movl %edx, %ebx - jnz L(diag) - - - adcl $0, %edx - movl 16-4(%edi), %eax C dst most significant limb - - addl %eax, %edx - popl %ebp - - movl %edx, 16-4(%edi) - popl %esi C risk of cache bank clash - - popl %edi - popl %ebx - - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/rshift.asm b/rts/gmp/mpn/x86/rshift.asm deleted file mode 100644 index c9881fd966..0000000000 --- a/rts/gmp/mpn/x86/rshift.asm +++ /dev/null @@ -1,92 +0,0 @@ -dnl x86 mpn_rshift -- mpn right shift. - -dnl Copyright (C) 1992, 1994, 1996, 1999, 2000 Free Software Foundation, -dnl Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_rshift (mp_ptr dst, mp_srcptr src, mp_size_t size, -C unsigned shift); - -defframe(PARAM_SHIFT,16) -defframe(PARAM_SIZE, 12) -defframe(PARAM_SRC, 8) -defframe(PARAM_DST, 4) - - .text - ALIGN(8) -PROLOGUE(mpn_rshift) - - pushl %edi - pushl %esi - pushl %ebx -deflit(`FRAME',12) - - movl PARAM_DST,%edi - movl PARAM_SRC,%esi - movl PARAM_SIZE,%edx - movl PARAM_SHIFT,%ecx - - leal -4(%edi,%edx,4),%edi - leal (%esi,%edx,4),%esi - negl %edx - - movl (%esi,%edx,4),%ebx C read least significant limb - xorl %eax,%eax - shrdl( %cl, %ebx, %eax) C compute carry limb - incl %edx - jz L(end) - pushl %eax C push carry limb onto stack - testb $1,%dl - jnz L(1) C enter loop in the middle - movl %ebx,%eax - - ALIGN(8) -L(oop): movl (%esi,%edx,4),%ebx C load next higher limb - shrdl( %cl, %ebx, %eax) C compute result limb - movl %eax,(%edi,%edx,4) C store it - incl %edx -L(1): movl (%esi,%edx,4),%eax - shrdl( %cl, %eax, %ebx) - movl %ebx,(%edi,%edx,4) - incl %edx - jnz L(oop) - - shrl %cl,%eax C compute most significant limb - movl %eax,(%edi) C store it - - popl %eax C pop carry limb - - popl %ebx - popl %esi - popl %edi - ret - -L(end): shrl %cl,%ebx C compute most significant limb - movl %ebx,(%edi) C store it - - popl %ebx - popl %esi - popl %edi - ret - -EPILOGUE() diff --git a/rts/gmp/mpn/x86/udiv.asm b/rts/gmp/mpn/x86/udiv.asm deleted file mode 100644 index 9fe022b107..0000000000 --- a/rts/gmp/mpn/x86/udiv.asm +++ /dev/null @@ -1,44 +0,0 @@ -dnl x86 mpn_udiv_qrnnd -- 2 by 1 limb division - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_udiv_qrnnd (mp_limb_t *remptr, mp_limb_t high, mp_limb_t low, -C mp_limb_t divisor); - -defframe(PARAM_DIVISOR, 16) -defframe(PARAM_LOW, 12) -defframe(PARAM_HIGH, 8) -defframe(PARAM_REMPTR, 4) - - TEXT - ALIGN(8) -PROLOGUE(mpn_udiv_qrnnd) -deflit(`FRAME',0) - movl PARAM_LOW, %eax - movl PARAM_HIGH, %edx - divl PARAM_DIVISOR - movl PARAM_REMPTR, %ecx - movl %edx, (%ecx) - ret -EPILOGUE() diff --git a/rts/gmp/mpn/x86/umul.asm b/rts/gmp/mpn/x86/umul.asm deleted file mode 100644 index 3d289d1784..0000000000 --- a/rts/gmp/mpn/x86/umul.asm +++ /dev/null @@ -1,43 +0,0 @@ -dnl mpn_umul_ppmm -- 1x1->2 limb multiplication - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -include(`../config.m4') - - -C mp_limb_t mpn_umul_ppmm (mp_limb_t *lowptr, mp_limb_t m1, mp_limb_t m2); -C - -defframe(PARAM_M2, 12) -defframe(PARAM_M1, 8) -defframe(PARAM_LOWPTR, 4) - - TEXT - ALIGN(8) -PROLOGUE(mpn_umul_ppmm) -deflit(`FRAME',0) - movl PARAM_LOWPTR, %ecx - movl PARAM_M1, %eax - mull PARAM_M2 - movl %eax, (%ecx) - movl %edx, %eax - ret -EPILOGUE() diff --git a/rts/gmp/mpn/x86/x86-defs.m4 b/rts/gmp/mpn/x86/x86-defs.m4 deleted file mode 100644 index 2dad698002..0000000000 --- a/rts/gmp/mpn/x86/x86-defs.m4 +++ /dev/null @@ -1,713 +0,0 @@ -divert(-1) - -dnl m4 macros for x86 assembler. - - -dnl Copyright (C) 1999, 2000 Free Software Foundation, Inc. -dnl -dnl This file is part of the GNU MP Library. -dnl -dnl The GNU MP Library is free software; you can redistribute it and/or -dnl modify it under the terms of the GNU Lesser General Public License as -dnl published by the Free Software Foundation; either version 2.1 of the -dnl License, or (at your option) any later version. -dnl -dnl The GNU MP Library is distributed in the hope that it will be useful, -dnl but WITHOUT ANY WARRANTY; without even the implied warranty of -dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU -dnl Lesser General Public License for more details. -dnl -dnl You should have received a copy of the GNU Lesser General Public -dnl License along with the GNU MP Library; see the file COPYING.LIB. If -dnl not, write to the Free Software Foundation, Inc., 59 Temple Place - -dnl Suite 330, Boston, MA 02111-1307, USA. - - -dnl Notes: -dnl -dnl m4 isn't perfect for processing BSD style x86 assembler code, the main -dnl problems are, -dnl -dnl 1. Doing define(foo,123) and then using foo in an addressing mode like -dnl foo(%ebx) expands as a macro rather than a constant. This is worked -dnl around by using deflit() from asm-defs.m4, instead of define(). -dnl -dnl 2. Immediates in macro definitions need a space or `' to stop the $ -dnl looking like a macro parameter. For example, -dnl -dnl define(foo, `mov $ 123, %eax') -dnl -dnl This is only a problem in macro definitions, not in ordinary text, -dnl nor in macro parameters like text passed to forloop() or ifdef(). - - -deflit(BYTES_PER_MP_LIMB, 4) - - -dnl -------------------------------------------------------------------------- -dnl Replacement PROLOGUE/EPILOGUE with more sophisticated error checking. -dnl Nesting and overlapping not allowed. -dnl - - -dnl Usage: PROLOGUE(functionname) -dnl -dnl Generate a function prologue. functionname gets GSYM_PREFIX added. -dnl Examples, -dnl -dnl PROLOGUE(mpn_add_n) -dnl PROLOGUE(somefun) - -define(`PROLOGUE', -m4_assert_numargs(1) -m4_assert_defined(`PROLOGUE_cpu') -`ifdef(`PROLOGUE_current_function', -`m4_error(`PROLOGUE'(`PROLOGUE_current_function') needs an `EPILOGUE'() before `PROLOGUE'($1) -)')dnl -m4_file_seen()dnl -define(`PROLOGUE_current_function',`$1')dnl -PROLOGUE_cpu(GSYM_PREFIX`'$1)') - - -dnl Usage: EPILOGUE() -dnl -dnl Notice the function name is passed to EPILOGUE_cpu(), letting it use $1 -dnl instead of the long PROLOGUE_current_function symbol. - -define(`EPILOGUE', -m4_assert_numargs(0) -m4_assert_defined(`EPILOGUE_cpu') -`ifdef(`PROLOGUE_current_function',, -`m4_error(`EPILOGUE'() with no `PROLOGUE'() -)')dnl -EPILOGUE_cpu(GSYM_PREFIX`'PROLOGUE_current_function)`'dnl -undefine(`PROLOGUE_current_function')') - -m4wrap_prepend( -`ifdef(`PROLOGUE_current_function', -`m4_error(`EPILOGUE() for PROLOGUE('PROLOGUE_current_function`) never seen -')')') - - -dnl Usage: PROLOGUE_assert_inside() -dnl -dnl Use this unquoted on a line on its own at the start of a macro -dnl definition to add some code to check the macro is only used inside a -dnl PROLOGUE/EPILOGUE pair, and that hence PROLOGUE_current_function is -dnl defined. - -define(PROLOGUE_assert_inside, -m4_assert_numargs(0) -``PROLOGUE_assert_inside_internal'(m4_doublequote($`'0))`dnl '') - -define(PROLOGUE_assert_inside_internal, -m4_assert_numargs(1) -`ifdef(`PROLOGUE_current_function',, -`m4_error(`$1 used outside a PROLOGUE / EPILOGUE pair -')')') - - -dnl Usage: L(labelname) -dnl LF(functionname,labelname) -dnl -dnl Generate a local label in the current or given function. For LF(), -dnl functionname gets GSYM_PREFIX added, the same as with PROLOGUE(). -dnl -dnl For example, in a function mpn_add_n (and with MPN_PREFIX __gmpn), -dnl -dnl L(bar) => L__gmpn_add_n__bar -dnl LF(somefun,bar) => Lsomefun__bar -dnl -dnl The funtion name and label name get two underscores between them rather -dnl than one to guard against clashing with a separate external symbol that -dnl happened to be called functionname_labelname. (Though this would only -dnl happen if the local label prefix is is empty.) Underscores are used so -dnl the whole label will still be a valid C identifier and so can be easily -dnl used in gdb. - -dnl LSYM_PREFIX can be L$, so defn() is used to prevent L expanding as the -dnl L macro and making an infinite recursion. -define(LF, -m4_assert_numargs(2) -m4_assert_defined(`LSYM_PREFIX') -`defn(`LSYM_PREFIX')GSYM_PREFIX`'$1`'__$2') - -define(`L', -m4_assert_numargs(1) -PROLOGUE_assert_inside() -`LF(PROLOGUE_current_function,`$1')') - - -dnl Called: PROLOGUE_cpu(gsym) -dnl EPILOGUE_cpu(gsym) - -define(PROLOGUE_cpu, -m4_assert_numargs(1) - `GLOBL $1 - TYPE($1,`function') -$1:') - -define(EPILOGUE_cpu, -m4_assert_numargs(1) -` SIZE($1,.-$1)') - - - -dnl -------------------------------------------------------------------------- -dnl Various x86 macros. -dnl - - -dnl Usage: ALIGN_OFFSET(bytes,offset) -dnl -dnl Align to `offset' away from a multiple of `bytes'. -dnl -dnl This is useful for testing, for example align to something very strict -dnl and see what effect offsets from it have, "ALIGN_OFFSET(256,32)". -dnl -dnl Generally you wouldn't execute across the padding, but it's done with -dnl nop's so it'll work. - -define(ALIGN_OFFSET, -m4_assert_numargs(2) -`ALIGN($1) -forloop(`i',1,$2,` nop -')') - - -dnl Usage: defframe(name,offset) -dnl -dnl Make a definition like the following with which to access a parameter -dnl or variable on the stack. -dnl -dnl define(name,`FRAME+offset(%esp)') -dnl -dnl Actually m4_empty_if_zero(FRAME+offset) is used, which will save one -dnl byte if FRAME+offset is zero, by putting (%esp) rather than 0(%esp). -dnl Use define(`defframe_empty_if_zero_disabled',1) if for some reason the -dnl zero offset is wanted. -dnl -dnl The new macro also gets a check that when it's used FRAME is actually -dnl defined, and that the final %esp offset isn't negative, which would -dnl mean an attempt to access something below the current %esp. -dnl -dnl deflit() is used rather than a plain define(), so the new macro won't -dnl delete any following parenthesized expression. name(%edi) will come -dnl out say as 16(%esp)(%edi). This isn't valid assembler and should -dnl provoke an error, which is better than silently giving just 16(%esp). -dnl -dnl See README.family for more on the suggested way to access the stack -dnl frame. - -define(defframe, -m4_assert_numargs(2) -`deflit(`$1', -m4_assert_defined(`FRAME') -`defframe_check_notbelow(`$1',$2,FRAME)dnl -defframe_empty_if_zero(FRAME+($2))(%esp)')') - -dnl Called: defframe_empty_if_zero(expression) -define(defframe_empty_if_zero, -`ifelse(defframe_empty_if_zero_disabled,1, -`eval($1)', -`m4_empty_if_zero($1)')') - -dnl Called: defframe_check_notbelow(`name',offset,FRAME) -define(defframe_check_notbelow, -m4_assert_numargs(3) -`ifelse(eval(($3)+($2)<0),1, -`m4_error(`$1 at frame offset $2 used when FRAME is only $3 bytes -')')') - - -dnl Usage: FRAME_pushl() -dnl FRAME_popl() -dnl FRAME_addl_esp(n) -dnl FRAME_subl_esp(n) -dnl -dnl Adjust FRAME appropriately for a pushl or popl, or for an addl or subl -dnl %esp of n bytes. -dnl -dnl Using these macros is completely optional. Sometimes it makes more -dnl sense to put explicit deflit(`FRAME',N) forms, especially when there's -dnl jumps and different sequences of FRAME values need to be used in -dnl different places. - -define(FRAME_pushl, -m4_assert_numargs(0) -m4_assert_defined(`FRAME') -`deflit(`FRAME',eval(FRAME+4))') - -define(FRAME_popl, -m4_assert_numargs(0) -m4_assert_defined(`FRAME') -`deflit(`FRAME',eval(FRAME-4))') - -define(FRAME_addl_esp, -m4_assert_numargs(1) -m4_assert_defined(`FRAME') -`deflit(`FRAME',eval(FRAME-($1)))') - -define(FRAME_subl_esp, -m4_assert_numargs(1) -m4_assert_defined(`FRAME') -`deflit(`FRAME',eval(FRAME+($1)))') - - -dnl Usage: defframe_pushl(name) -dnl -dnl Do a combination of a FRAME_pushl() and a defframe() to name the stack -dnl location just pushed. This should come after a pushl instruction. -dnl Putting it on the same line works and avoids lengthening the code. For -dnl example, -dnl -dnl pushl %eax defframe_pushl(VAR_COUNTER) -dnl -dnl Notice the defframe() is done with an unquoted -FRAME thus giving its -dnl current value without tracking future changes. - -define(defframe_pushl, -`FRAME_pushl()defframe(`$1',-FRAME)') - - -dnl -------------------------------------------------------------------------- -dnl Assembler instruction macros. -dnl - - -dnl Usage: emms_or_femms -dnl femms_available_p -dnl -dnl femms_available_p expands to 1 or 0 according to whether the AMD 3DNow -dnl femms instruction is available. emms_or_femms expands to femms if -dnl available, or emms if not. -dnl -dnl emms_or_femms is meant for use in the K6 directory where plain K6 -dnl (without femms) and K6-2 and K6-3 (with a slightly faster femms) are -dnl supported together. -dnl -dnl On K7 femms is no longer faster and is just an alias for emms, so plain -dnl emms may as well be used. - -define(femms_available_p, -m4_assert_numargs(-1) -`m4_ifdef_anyof_p( - `HAVE_TARGET_CPU_k62', - `HAVE_TARGET_CPU_k63', - `HAVE_TARGET_CPU_athlon')') - -define(emms_or_femms, -m4_assert_numargs(-1) -`ifelse(femms_available_p,1,`femms',`emms')') - - -dnl Usage: femms -dnl -dnl The gas 2.9.1 that comes with FreeBSD 3.4 doesn't support femms, so the -dnl following is a replacement using .byte. -dnl -dnl If femms isn't available, an emms is generated instead, for convenience -dnl when testing on a machine without femms. - -define(femms, -m4_assert_numargs(-1) -`ifelse(femms_available_p,1, -`.byte 15,14 C AMD 3DNow femms', -`emms`'dnl -m4_warning(`warning, using emms in place of femms, use for testing only -')')') - - -dnl Usage: jadcl0(op) -dnl -dnl Issue a jnc/incl as a substitute for adcl $0,op. This isn't an exact -dnl replacement, since it doesn't set the flags like adcl does. -dnl -dnl This finds a use in K6 mpn_addmul_1, mpn_submul_1, mpn_mul_basecase and -dnl mpn_sqr_basecase because on K6 an adcl is slow, the branch -dnl misprediction penalty is small, and the multiply algorithm used leads -dnl to a carry bit on average only 1/4 of the time. -dnl -dnl jadcl0_disabled can be set to 1 to instead issue an ordinary adcl for -dnl comparison. For example, -dnl -dnl define(`jadcl0_disabled',1) -dnl -dnl When using a register operand, eg. "jadcl0(%edx)", the jnc/incl code is -dnl the same size as an adcl. This makes it possible to use the exact same -dnl computed jump code when testing the relative speed of jnc/incl and adcl -dnl with jadcl0_disabled. - -define(jadcl0, -m4_assert_numargs(1) -`ifelse(jadcl0_disabled,1, - `adcl $`'0, $1', - `jnc 1f - incl $1 -1:dnl')') - - -dnl Usage: cmov_available_p -dnl -dnl Expand to 1 if cmov is available, 0 if not. - -define(cmov_available_p, -`m4_ifdef_anyof_p( - `HAVE_TARGET_CPU_pentiumpro', - `HAVE_TARGET_CPU_pentium2', - `HAVE_TARGET_CPU_pentium3', - `HAVE_TARGET_CPU_athlon')') - - -dnl Usage: x86_lookup(target, key,value, key,value, ...) -dnl x86_lookup_p(target, key,value, key,value, ...) -dnl -dnl Look for `target' among the `key' parameters. -dnl -dnl x86_lookup expands to the corresponding `value', or generates an error -dnl if `target' isn't found. -dnl -dnl x86_lookup_p expands to 1 if `target' is found, or 0 if not. - -define(x86_lookup, -`ifelse(eval($#<3),1, -`m4_error(`unrecognised part of x86 instruction: $1 -')', -`ifelse(`$1',`$2', `$3', -`x86_lookup(`$1',shift(shift(shift($@))))')')') - -define(x86_lookup_p, -`ifelse(eval($#<3),1, `0', -`ifelse(`$1',`$2', `1', -`x86_lookup_p(`$1',shift(shift(shift($@))))')')') - - -dnl Usage: x86_opcode_reg32(reg) -dnl x86_opcode_reg32_p(reg) -dnl -dnl x86_opcode_reg32 expands to the standard 3 bit encoding for the given -dnl 32-bit register, eg. `%ebp' turns into 5. -dnl -dnl x86_opcode_reg32_p expands to 1 if reg is a valid 32-bit register, or 0 -dnl if not. - -define(x86_opcode_reg32, -m4_assert_numargs(1) -`x86_lookup(`$1',x86_opcode_reg32_list)') - -define(x86_opcode_reg32_p, -m4_assert_onearg() -`x86_lookup_p(`$1',x86_opcode_reg32_list)') - -define(x86_opcode_reg32_list, -``%eax',0, -`%ecx',1, -`%edx',2, -`%ebx',3, -`%esp',4, -`%ebp',5, -`%esi',6, -`%edi',7') - - -dnl Usage: x86_opcode_tttn(cond) -dnl -dnl Expand to the 4-bit "tttn" field value for the given x86 branch -dnl condition (like `c', `ae', etc). - -define(x86_opcode_tttn, -m4_assert_numargs(1) -`x86_lookup(`$1',x86_opcode_ttn_list)') - -define(x86_opcode_tttn_list, -``o', 0, -`no', 1, -`b', 2, `c', 2, `nae',2, -`nb', 3, `nc', 3, `ae', 3, -`e', 4, `z', 4, -`ne', 5, `nz', 5, -`be', 6, `na', 6, -`nbe', 7, `a', 7, -`s', 8, -`ns', 9, -`p', 10, `pe', 10, `npo',10, -`np', 11, `npe',11, `po', 11, -`l', 12, `nge',12, -`nl', 13, `ge', 13, -`le', 14, `ng', 14, -`nle',15, `g', 15') - - -dnl Usage: cmovCC(srcreg,dstreg) -dnl -dnl Generate a cmov instruction if the target supports cmov, or simulate it -dnl with a conditional jump if not (the latter being meant only for -dnl testing). For example, -dnl -dnl cmovz( %eax, %ebx) -dnl -dnl cmov instructions are generated using .byte sequences, since only -dnl recent versions of gas know cmov. -dnl -dnl The source operand can only be a plain register. (m4 code implementing -dnl full memory addressing modes exists, believe it or not, but isn't -dnl currently needed and isn't included.) -dnl -dnl All the standard conditions are defined. Attempting to use one without -dnl the macro parentheses, such as just "cmovbe %eax, %ebx", will provoke -dnl an error. This ensures the necessary .byte sequences aren't -dnl accidentally missed. - -dnl Called: define_cmov_many(cond,tttn,cond,tttn,...) -define(define_cmov_many, -`ifelse(m4_length(`$1'),0,, -`define_cmov(`$1',`$2')define_cmov_many(shift(shift($@)))')') - -dnl Called: define_cmov(cond,tttn) -define(define_cmov, -m4_assert_numargs(2) -`define(`cmov$1', -m4_instruction_wrapper() -m4_assert_numargs(2) -`cmov_internal'(m4_doublequote($`'0),``$1',`$2'',dnl -m4_doublequote($`'1),m4_doublequote($`'2)))') - -define_cmov_many(x86_opcode_tttn_list) - - -dnl Called: cmov_internal(name,cond,tttn,src,dst) -define(cmov_internal, -m4_assert_numargs(5) -`ifelse(cmov_available_p,1, -`cmov_bytes_tttn(`$1',`$3',`$4',`$5')', -`m4_warning(`warning, simulating cmov with jump, use for testing only -')cmov_simulate(`$2',`$4',`$5')')') - -dnl Called: cmov_simulate(cond,src,dst) -dnl If this is going to be used with memory operands for the source it will -dnl need to be changed to do a fetch even if the condition is false, so as -dnl to trigger exceptions the same way a real cmov does. -define(cmov_simulate, -m4_assert_numargs(3) - `j$1 1f C cmov$1 $2, $3 - jmp 2f -1: movl $2, $3 -2:') - -dnl Called: cmov_bytes_tttn(name,tttn,src,dst) -define(cmov_bytes_tttn, -m4_assert_numargs(4) -`.byte dnl -15, dnl -eval(64+$2), dnl -eval(192+8*x86_opcode_reg32(`$4')+x86_opcode_reg32(`$3')) dnl - C `$1 $3, $4'') - - -dnl Usage: loop_or_decljnz label -dnl -dnl Generate either a "loop" instruction or a "decl %ecx / jnz", whichever -dnl is better. "loop" is better on K6 and probably on 386, on other chips -dnl separate decl/jnz is better. -dnl -dnl This macro is just for mpn/x86/divrem_1.asm and mpn/x86/mod_1.asm where -dnl this loop_or_decljnz variation is enough to let the code be shared by -dnl all chips. - -define(loop_or_decljnz, -`ifelse(loop_is_better_p,1, - `loop', - `decl %ecx - jnz')') - -define(loop_is_better_p, -`m4_ifdef_anyof_p(`HAVE_TARGET_CPU_k6', - `HAVE_TARGET_CPU_k62', - `HAVE_TARGET_CPU_k63', - `HAVE_TARGET_CPU_i386')') - - -dnl Usage: Zdisp(inst,op,op,op) -dnl -dnl Generate explicit .byte sequences if necessary to force a byte-sized -dnl zero displacement on an instruction. For example, -dnl -dnl Zdisp( movl, 0,(%esi), %eax) -dnl -dnl expands to -dnl -dnl .byte 139,70,0 C movl 0(%esi), %eax -dnl -dnl If the displacement given isn't 0, then normal assembler code is -dnl generated. For example, -dnl -dnl Zdisp( movl, 4,(%esi), %eax) -dnl -dnl expands to -dnl -dnl movl 4(%esi), %eax -dnl -dnl This means a single Zdisp() form can be used with an expression for the -dnl displacement, and .byte will be used only if necessary. The -dnl displacement argument is eval()ed. -dnl -dnl Because there aren't many places a 0(reg) form is wanted, Zdisp is -dnl implemented with a table of instructions and encodings. A new entry is -dnl needed for any different operation or registers. - -define(Zdisp, -`define(`Zdisp_found',0)dnl -Zdisp_match( movl, %eax, 0,(%edi), `137,71,0', $@)`'dnl -Zdisp_match( movl, %ebx, 0,(%edi), `137,95,0', $@)`'dnl -Zdisp_match( movl, %esi, 0,(%edi), `137,119,0', $@)`'dnl -Zdisp_match( movl, 0,(%ebx), %eax, `139,67,0', $@)`'dnl -Zdisp_match( movl, 0,(%ebx), %esi, `139,115,0', $@)`'dnl -Zdisp_match( movl, 0,(%esi), %eax, `139,70,0', $@)`'dnl -Zdisp_match( movl, 0,(%esi,%ecx,4), %eax, `0x8b,0x44,0x8e,0x00', $@)`'dnl -Zdisp_match( addl, %ebx, 0,(%edi), `1,95,0', $@)`'dnl -Zdisp_match( addl, %ecx, 0,(%edi), `1,79,0', $@)`'dnl -Zdisp_match( addl, %esi, 0,(%edi), `1,119,0', $@)`'dnl -Zdisp_match( subl, %ecx, 0,(%edi), `41,79,0', $@)`'dnl -Zdisp_match( adcl, 0,(%edx), %esi, `19,114,0', $@)`'dnl -Zdisp_match( sbbl, 0,(%edx), %esi, `27,114,0', $@)`'dnl -Zdisp_match( movq, 0,(%eax,%ecx,8), %mm0, `0x0f,0x6f,0x44,0xc8,0x00', $@)`'dnl -Zdisp_match( movq, 0,(%ebx,%eax,4), %mm0, `0x0f,0x6f,0x44,0x83,0x00', $@)`'dnl -Zdisp_match( movq, 0,(%ebx,%eax,4), %mm2, `0x0f,0x6f,0x54,0x83,0x00', $@)`'dnl -Zdisp_match( movq, 0,(%esi), %mm0, `15,111,70,0', $@)`'dnl -Zdisp_match( movq, %mm0, 0,(%edi), `15,127,71,0', $@)`'dnl -Zdisp_match( movq, %mm2, 0,(%ecx,%eax,4), `0x0f,0x7f,0x54,0x81,0x00', $@)`'dnl -Zdisp_match( movq, %mm2, 0,(%edx,%eax,4), `0x0f,0x7f,0x54,0x82,0x00', $@)`'dnl -Zdisp_match( movq, %mm0, 0,(%edx,%ecx,8), `0x0f,0x7f,0x44,0xca,0x00', $@)`'dnl -Zdisp_match( movd, 0,(%eax,%ecx,8), %mm1, `0x0f,0x6e,0x4c,0xc8,0x00', $@)`'dnl -Zdisp_match( movd, 0,(%edx,%ecx,8), %mm0, `0x0f,0x6e,0x44,0xca,0x00', $@)`'dnl -Zdisp_match( movd, %mm0, 0,(%eax,%ecx,4), `0x0f,0x7e,0x44,0x88,0x00', $@)`'dnl -Zdisp_match( movd, %mm0, 0,(%ecx,%eax,4), `0x0f,0x7e,0x44,0x81,0x00', $@)`'dnl -Zdisp_match( movd, %mm2, 0,(%ecx,%eax,4), `0x0f,0x7e,0x54,0x81,0x00', $@)`'dnl -ifelse(Zdisp_found,0, -`m4_error(`unrecognised instruction in Zdisp: $1 $2 $3 $4 -')')') - -define(Zdisp_match, -`ifelse(eval(m4_stringequal_p(`$1',`$6') - && m4_stringequal_p(`$2',0) - && m4_stringequal_p(`$3',`$8') - && m4_stringequal_p(`$4',`$9')),1, -`define(`Zdisp_found',1)dnl -ifelse(eval(`$7'),0, -` .byte $5 C `$1 0$3, $4'', -` $6 $7$8, $9')', - -`ifelse(eval(m4_stringequal_p(`$1',`$6') - && m4_stringequal_p(`$2',`$7') - && m4_stringequal_p(`$3',0) - && m4_stringequal_p(`$4',`$9')),1, -`define(`Zdisp_found',1)dnl -ifelse(eval(`$8'),0, -` .byte $5 C `$1 $2, 0$4'', -` $6 $7, $8$9')')')') - - -dnl Usage: shldl(count,src,dst) -dnl shrdl(count,src,dst) -dnl shldw(count,src,dst) -dnl shrdw(count,src,dst) -dnl -dnl Generate a double-shift instruction, possibly omitting a %cl count -dnl parameter if that's what the assembler requires, as indicated by -dnl WANT_SHLDL_CL in config.m4. For example, -dnl -dnl shldl( %cl, %eax, %ebx) -dnl -dnl turns into either -dnl -dnl shldl %cl, %eax, %ebx -dnl or -dnl shldl %eax, %ebx -dnl -dnl Immediate counts are always passed through unchanged. For example, -dnl -dnl shrdl( $2, %esi, %edi) -dnl becomes -dnl shrdl $2, %esi, %edi -dnl -dnl -dnl If you forget to use the macro form "shldl( ...)" and instead write -dnl just a plain "shldl ...", an error results. This ensures the necessary -dnl variant treatment of %cl isn't accidentally bypassed. - -define(define_shd_instruction, -`define($1, -m4_instruction_wrapper() -m4_assert_numargs(3) -`shd_instruction'(m4_doublequote($`'0),m4_doublequote($`'1),dnl -m4_doublequote($`'2),m4_doublequote($`'3)))') - -dnl Effectively: define(shldl,`shd_instruction(`$0',`$1',`$2',`$3')') etc -define_shd_instruction(shldl) -define_shd_instruction(shrdl) -define_shd_instruction(shldw) -define_shd_instruction(shrdw) - -dnl Called: shd_instruction(op,count,src,dst) -define(shd_instruction, -m4_assert_numargs(4) -m4_assert_defined(`WANT_SHLDL_CL') -`ifelse(eval(m4_stringequal_p(`$2',`%cl') && !WANT_SHLDL_CL),1, -``$1' `$3', `$4'', -``$1' `$2', `$3', `$4'')') - - -dnl Usage: ASSERT(cond, instructions) -dnl -dnl If WANT_ASSERT is 1, output the given instructions and expect the given -dnl flags condition to then be satisfied. For example, -dnl -dnl ASSERT(ne, `cmpl %eax, %ebx') -dnl -dnl The instructions can be omitted to just assert a flags condition with -dnl no extra calculation. For example, -dnl -dnl ASSERT(nc) -dnl -dnl When `instructions' is not empty, a pushf/popf is added to preserve the -dnl flags, but the instructions themselves must preserve any registers that -dnl matter. FRAME is adjusted for the push and pop, so the instructions -dnl given can use defframe() stack variables. - -define(ASSERT, -m4_assert_numargs_range(1,2) -`ifelse(WANT_ASSERT,1, - `C ASSERT -ifelse(`$2',,,` pushf ifdef(`FRAME',`FRAME_pushl()')') - $2 - j`$1' 1f - ud2 C assertion failed -1: -ifelse(`$2',,,` popf ifdef(`FRAME',`FRAME_popl()')') -')') - - -dnl Usage: movl_text_address(label,register) -dnl -dnl Get the address of a text segment label, using either a plain movl or a -dnl position-independent calculation, as necessary. For example, -dnl -dnl movl_code_address(L(foo),%eax) -dnl -dnl This macro is only meant for use in ASSERT()s or when testing, since -dnl the PIC sequence it generates will want to be done with a ret balancing -dnl the call on CPUs with return address branch predition. -dnl -dnl The addl generated here has a backward reference to 1b, and so won't -dnl suffer from the two forwards references bug in old gas (described in -dnl mpn/x86/README.family). - -define(movl_text_address, -`ifdef(`PIC', - `call 1f -1: popl $2 C %eip - addl `$'$1-1b, $2', - `movl `$'$1, $2')') - - -divert`'dnl |