//------------------------------------------------------------------ // file: readme.txt // Readme to accompany libmotovec.a //------------------------------------------------------------------ Rev 0.30 release - 5/28/2003 by Chuck Corley This release includes two new files, string_vec.S and checksum_vec.s, which you could paste into the Linux kernel files: /arch/ppc/lib/string.S and /arch/ppc/lib/checksum.S if you wanted to employ AltiVec in the Linux kernel. We used the memcpy_vec and csum_partial_copy_generic_vec functions from these files only in the modified versions of /net/core/skbuf.c and /net/core/iovec.c to give us the networking performance boost in Linux described in the SNDF presentation "Accelerating Networking Data Movement Using the AltiVecĀ® Technology" at www.motorola.com/sndf under Dallas-2003/Host Processors (H1110). Also see the white paper "Enhanced TCP/IP Performance with AltiVec Technology" at e-www.motorola.com/brdata/PDFDB/docs/ALTIVECTCPIPWP.pdf These files contain the following functions string.S contains: string_vec.S contains: memcpy memcpy_vec bcopy bcopy_vec memmove memmove_vec backwards_memcpy backwards_memcpy_vec memset memset_vec memcmp memcmp_vec memchr (coming soon) cacheable_memcpy cacheable_memcpy_vec cacheable_memzero cacheable_memzero_vec strcpy strcpy_vec strncpy (coming soon) strcat (coming soon) strcmp strcmp_vec strlen strlen_vec __copy_tofrom_user* __copy_tofrom_user_vec* __clear_user* __clear_user_vec* __strncpy_from_user* (coming soon) __strnlen_user* (coming soon) checksum.S contains: checksum_vec.S contains: csum_partial csum_partial_vec csum_partial_copy_generic* csum_partial_copy_generic_vec ip_fast_csum (unlikely to benefit) csum_tcpudp_magic (unlikely to benefit) *these functions have ex_table entries for handling memory access exceptions in the kernel. The AltiVec versions were functionally tested by hand. csum_partial_copy_generic_vec and csum_partial_vec previously assembled into libmotovec.a have been removed since they are in the file above. We are finding that selective use of the *_vec functions in the OS kernel is much "safer" than wholescale replacement of the libc library. libmotovec.a returns to being exclusively a performance-enhancing library of libc functions that can be safely linked with user application code to test the performance of AltiVec. My presentation for SDNF-Europe includes performance comparisons of the scalar versus vector versions of the above functions. It should be available on the SNDF website soon. It also includes an updated explanation of memcpy without the potential incoherency problem discussed below. So this release contains in libmotovec.a: memcpy.o from vec_memcpy.S Rev 0.30 dated 4/02/2003 bcopy.o from vec_memcpy.S Rev 0.30 dated 4/02/2003 memmove.o from vec_memcpy.S Rev 0.30 dated 4/02/2003 memset.o from vec_memset.S Rev 0.10 dated 5/01/2003 bzero.o from vec_memset.S Rev 0.10 dated 5/01/2003 strcmp.o from vec_strcmp.S Rev 0.00 dated 3/03/2002 strlen.o from vec_strlen.S Rev 0.00 dated 12/26/2002 And in string.s: memcpy_vec derived from vec_memcpy.S Rev 0.30 dated 4/02/2003 bcopy_vec derived from vec_memcpy.S Rev 0.30 memmove_vec derived from vec_memcpy.S Rev 0.30 backwards_memcpy_vec derived from vec_memcpy.S Rev 0.30 memset_vec derived from vec_memset.S Rev 0.10 dated 5/01/2003 memcmp_vec derived from vec_memcmp.S Rev 0.00 memchr (coming soon) cacheable_memcpy_vec derived from vec_memcpy.S Rev 0.30 cacheable_memzero_vec derived from vec_memset.S Rev 0.10 strcpy_vec derived from vec_strcpy.S Rev 0.10 strncpy_vec (coming soon) strcat_vec (coming soon) strcmp_vec derived from vec_strcmp.S Rev 0.00 (not released) strlen_vec derived from vec_strlen.S Rev 0.00 (not released) __copy_tofrom_user_vec* derived from vec_memcpy.S Rev 0.30 __clear_user_vec* derived from vec_memcpy.S Rev 0.30 __strncpy_from_user_vec* (coming soon) __strnlen_user_vec* (coming soon) *with ex_table and exception code And in checksum.s: csum_partial_vec derived from vec_csum.S Rev 0.0 dated 4/19/03 csum_partial_copy_generic_vec from vec_csum.S Rev 0.0 string_vec.S and checksum_vec.S are only known to assemble with gcc 2.95 and gcc 3.3+. Should work with other gcc compilers but may need editing to be compatible with non-gcc compilers. Rev 0.20 release - 5/12/2003 by Chuck Corley Thanks to all of you who attended SNDF. My presentation "Implementing and Using the Motorola AltiVec Libraries" is available for downloading at www.motorola.com/sndf under Dallas-2003/Host Processors (H1109). During the presentation DS from Lucent pointed out that the way I was bringing the beginning and ending destination Quad Words (vectors) into the registers for merging with the permuted source made the "uninvolved" destination bytes vulnerable to potential incoherency if some interrupting process changed those bytes while I was holding them in a register. While the possibility seemed small, I have rewritten the code to avoid this potential problem. The result actually is slightly faster than the original for small buffers. So this release contains: memcpy.o from vec_memcpy.S Rev 0.30 dated 4/02/2003 bcopy.o from vec_memcpy.S Rev 0.30 dated 4/02/2003 memmove.o from vec_memcpy.S Rev 0.30 dated 4/02/2003 memset.o from vec_memset.S Rev 0.10 dated 5/01/2003 bzero.o from vec_memset.S Rev 0.10 dated 5/01/2003 csum_partial_copy_generic_vec from vec_csum.S Rev 0.0 dated 4/19/03 csum_partial_vec from vec_csum.S Rev 0.0 dated 4/19/03 The latter two additions were assembled into libmotovec.a despite the fact they are not standard libc functions. Rather they are the Altivec enabled equivalents of functions by the same name from the linux source tree (Linux 2.4.17). While we are pursuing how to get these functions incorporated into Linux, here they are assembled and in source form if you are building your own version of linux. The use of an earlier version of csum_partial_copy_generic_vec and memcpy_vec is documented to speed up TCP/IP and UDP transfers in Jacob Pan's SNDF presentation "Accelerating Networking Data Movement Using AltiVec Technology" (H1110) available at the website above. csum_partial does not appear to be called with large enough buffer sizes in linux to warrant using the vectorized version. I am also releasing the source for memset and bzero in this release. strcpy, strlen, strncpy, strcmp, memcmp, strcat, and memchr are still on my list to do - soon. Rev 0.10 release - 3/13/2003 by Chuck Corley The presence of dcbz in the 32 byte loop of memcpy (or memmove) causes an alignment exception to non-cacheable memory (MPC7410 User's Manual p. 4-20 and MPC7450 User's Manual p. 4-25) so it was removed in this release. dcbz instructions were not present in memset in any of these releases. That fixed the alignment problem but hurt the performance some; then it was "rediscovered" that dcba would have been a better choice anyway as it does not cause an exception; it would just be noop'ed. So this release substitutes dcba for dcbz. This release contains improvements in memcpy that should be documented in an application note which is still not finished but are being pretty nicely documented for SNDF presentation H1109. The memcpy was further loop unrolled to provide a 128B loop for large buffers (>256 bytes) and the data stream touch instruction was added. It may still be possible to improve the tuning of the dst instruction, particularly in memmove, but this release is worthy of reving the number to the next significant revision. I've developed a new metric which will be explained at SNDF in Dallas, TX, March 23-26, 2003. As the number of bytes in a buffer gets larger, the memcpy routine settles into repetitions of the inner loop. 32 bytes were moved in the inner loop of Rev 0.0x and 128 bytes are moved in the inner loop of Rev 0.10. And the number of processor clocks per inner loop can be shown to approach the minimum possible. Therefore the new metric measures the incremental transfer rate for the inner loop after a reasonable number (>512) of bytes have been moved. This will not be the bytes transferred per second because there were some less efficient transfers at start-up but this is the transfer rate that the routine is asymptotically approaching as the buffer gets big (regularly testing to 1460 bytes). Here is that metric for several cases: Case 1: For gcc's lib c memcpy when buffers are not word aligned Case 2: For gcc's lib c memcpy when buffers are word aligned Case 3: For Rev 0.01 of memcpy with Altivec irrespective of alignment Case 4: For Rev 0.10 of memcpy with Altivec irrespective of alignment Numbers are provided for the cold DCache and warm DCache. Code is assumed to always be resident in the ICache as would be expected here where the inner loop has run multiple times. COLD DCACHE WARM DCACHE FOR THE MPC7410@400/100 Insts Clks MB/Sec Insts Clks MB/Sec Case 1: gcc_NWA (1 byte/loop) 6 6 71 6 3 133 Case 2: gcc_WA (16 B/loop) 12 62 103 12 8 800 Case 3: vec_memcpy Rev 0.01 12 60 213 12 7 1961 Case 4: vec_memcpy Rev 0.10 46 125 410 46 41 1250 COLD DCACHE WARM DCACHE FOR THE MPC7445@1GHz/133 Insts Clks MB/Sec Insts Clks MB/Sec Case 1: gcc_NWA 6 8 122 6 3 350 Case 2: gcc_WA 12 104 153 12 12 1333 Case 3: vec_memcpy Rev 0.01 12 110 292 12 7 4413 Case 4: vec_memcpy Rev 0.10 46 247 518 46 35 3666 Perhaps you notice that we are trading off Warm DCache performance to improve the Cold DCache case. There are other interesting tradeoffs in going from 32 byte inner loop to 128 bytes. And in using the dcba instruction - or not. In other words, the numbers for vec_memcpy above are not the highest possible in the Warm DCache case but they look like a good compromise which most benefits the Cold DCache case. More at SNDF (or eventually in the app note) ... I am releasing the source code to vec_memcpy.S with this release so if if you don't like the tradeoff above you can make your own selection. It successfully assembles for me with Codewarrior, Diab, Green Hills, gcc, and Metaware. It is nicely commented but could use more documentation. I will specifically be explaining it in SNDF presentation H1109. ************************************************************************* Rev 0.01 release - 2/17/2003 by Chuck Corley Fixed a problem at Last_ld_fwd: that caused a load beyond a page boundary and resulting segment fault in Linux. Last source load of SRC+BK in vec_memcpy could be > SRC+BC-1. Also found and fixed an error where the Quick and Dirty (QND) code that was in there for dst wasn't completely commented out. Plan to enable dst soon. Probably loop unroll to 128 bytes first though. ********************************************************************** Initial Release - 2/10/2003 by Chuck Corley Contains the libc functions: memcpy.o from vec_memcpy.S Rev 0.0 dated 2/09/2003 bcopy.o from vec_memcpy.S Rev 0.0 dated 2/09/2003 memmove.o from vec_memcpy.S Rev 0.0 dated 2/09/2003 memset.o from vec_memset.S Rev 0.0 dated 2/09/2003 bzero.o from vec_memset.S Rev 0.0 dated 2/09/2003 These functions are implemented in AltiVec but are still not as fast as we know how to make them. Watch this site for frequent revisions over the next several months. We are in the process of creating application notes to explain the source code and the performance associated with these library functions; watch this site for those application notes to be added. A logical deadline for completion of this work is the Smart Network Developers Forum in Dallas, TX, March 23-26, 2003, where we will be discussing this library, its performance, and application. We will also be adding the following libc functions in the very near future: strcpy strcmp strlen memcmp memchr strncpy We also have preliminary work completed on the following functions found in Linux and have to figure out how to distribute them: csum_partial csum_partial_generic __copy_tofrom_user page_copy We believe that these libraries will improve performance on Motorola G4 processors for applications that make heavy use of the included functions. On non-G4 microprocessors they will cause illegal operation exceptions because those processors do not support AltiVec. To use this library, you must: 1. Include it on the linker command line prior to the compiler's libc library. Examples: For gcc: powerpc-eabisim-ld -T../../spprt/gcc_dink.script -Qy -dn -Bstatic ../../spprt/gcc_obj/gcc_crt0.o ../../spprt/gcc_obj/dtime.o ../../spprt/gcc_obj/cache.o ../../spprt/gcc_obj/Support.o ../../spprt/gcc_obj/dinkusr.o ../../spprt/gcc_obj/perfmon.o gcc_obj/test_memmove.o c:\BMS\vec_lib\libmotovec\libmotovec.a c:/cygwin/Altivec/powerpc-eabisim\lib\libm.a --start-group -lsim -lc --end-group -o gccBM.elf For Diab: dld ../../spprt/diab_dink.dld ../../spprt/diab_obj/diab_crt0.o ../../spprt/diab_obj/dtime.o ../../spprt/diab_obj/cache.o ../../spprt/diab_obj/Support.o ../../spprt/diab_obj/dinkusr.o ../../spprt/diab_obj/perfmon.o diab_obj/test_memmove.o c:\BMS\vec_lib\libmotovec\libmotovec.a -Y P,c:/diab/5.0.3/PPCEH:c:/diab/5.0.3/PPCE/simple:c:/diab/5.0.3/PPCE:c:/diab/5.0.3/PPCEN -lc -lm -o diabBM.elf For Green Hills: elxr -T../../spprt/ghs_dink.lnk ../../spprt/ghs_obj/ghs_crt0.o ../../spprt/ghs_obj/dtime.o ../../spprt/ghs_obj/cache.o ../../spprt/ghs_obj/Support.o ../../spprt/ghs_obj/dinkusr.o ../../spprt/ghs_obj/perfmon.o ghs_obj/test_memmove.o c:\BMS\vec_lib\libmotovec\libmotovec.a -Lc:\GHS\ppc36\ppc -lansi -lsys -larch -lind -o ghsBM.elf For CodeWarrior: mwldeppc -lcf ../../spprt/cw_dink.lcf -nostdlib -fp fmadd -proc 7450 ../../spprt/cw_obj/cw_crt0.o ../../spprt/cw_obj/dtime.o ../../spprt/cw_obj/cache.o ../../spprt/cw_obj/Support.o ../../spprt/cw_obj/dinkusr.o ../../spprt/cw_obj/perfmon.o cw_obj/test_memmove.o c:\BMS\vec_lib\libmotovec\libmotovec.a -Lc:/"Program Files"/Metrowerks/CodeWarrior/PowerPC_EABI_Support/Runtime/Lib/ -lRuntime.PPCEABI.H.a -Lc:/"Program Files"/Metrowerks/CodeWarrior/PowerPC_EABI_Support/Msl/MSL_C/Ppc_eabi/Lib/ -lMSL_C.PPCEABI.bare.H.a -o cwBM.elf For Metaware: ldppc ../../spprt/mw_link.txt -Bnoheader -Bhardalign -dn -q -Qn ../../spprt/mw_obj/mw_crt0.o ../../spprt/mw_obj/dtime.o ../../spprt/mw_obj/cache.o ../../spprt/mw_obj/Support.o ../../spprt/mw_obj/dinkusr.o ../../spprt/mw_obj/perfmon.o mw_obj/test_memmove.o c:\BMS\vec_lib\libmotovec\libmotovec.a -Y P,c:/hcppc/lib/be/fp -lct -lmwt -o mwBM.elf 2. Enable AltiVec in the Machine State Processor (MSR) register of the target machine. Example: AltiVec_enable: mfmsr r4 // Get current MSR contents oris r4,r4,0x0200 // Set the AltiVec enable bit MSR[6] mtmsr r4 // Write to MSR isync // Context synchronizing instr after mtmsr 3. If the AltiVec vector register set is used in more than one context, the AltiVec registers must be saved and restored on context switches. The AltiVec EABI extensions define a register (SPR 256 - the VRSAVE register) which can be used to reduce the number of vector registers which have to be saved to only those in use. This library is currently compiled without that VRSAVE feature enabled, so all 32 vector registers will have to be saved and restored. We are currently thinking that this is a more efficient practice anyway and note that Linux and several RTOSes are taking that approach in saving and restoring the vector registers. We have observed very little performance difference in Linux for saving all of the AltiVec registers on a context switch versus saving only 8. And saving all of the registers is a less than 1% total impact on performance. 4. There is one worrisome problem with this library when run on the MPC745X microprocessors in the 60x bus mode. The MPC7450 Family User's Manual (Section 7.3) states that "The 60x bus protocol does not support a 16-byte bus transaction. Therefore, cache-inhibited AltiVec loads, stores, and write-through stores take an alignment exception. This requires a re-write of the alignment exception routines in software that supports AltiVec quad word access in 60x bus mode on the MPC745X." This says that if the user is attempting to use these routines in a cache-inhibited area of memory on a MPC745X in 60x bus mode, it will require special alignment exception handling software. We are currently implementing that software for the Linux OS. Alternatively, the user can restrict this library's use to areas of memory known to be cacheable. This library was built using gcc, but as shown in the examples of step 1 above, links and executes with Diab5.0, Green Hills 3.6, Codewarrior EPPC 6.1, and Metaware 4.5. The gcc archiver was used to create it in the following command lines: powerpc-eabisim-gcc -c -s -fvec -mcpu=750 -mregnames -I. -I./source -I../../spprt -Ic:/cygwin/Altivec\powerpc-eabisim\include -Ic:/cygwin/Altivec\lib\gcc-lib\powerpc-eabisim\gcc-2.95.2\include -o gcc_obj/vec_memcpy.o -D__GNUC__ -DLIBMOTOVEC ../vec_memcpy/Source/vec_memcpy.S -o gcc_obj/vec_memcpy.o powerpc-eabisim-gcc -c -s -fvec -mcpu=750 -mregnames -I. -I./source -I../../spprt -Ic:/cygwin/Altivec\powerpc-eabisim\include -Ic:/cygwin/Altivec\lib\gcc-lib\powerpc-eabisim\gcc-2.95.2\include -o gcc_obj/vec_memset.o -D__GNUC__ -DLIBMOTOVEC ../vec_memset/source/vec_memset.S -o gcc_obj/vec_memset.o powerpc-eabisim-ar -ru libmotovec.a gcc_obj/vec_memcpy.o gcc_obj/vec_memset.o Email questions or suggestions to risc10@email.sps.mot.com