Copyright 2000, 2001, 2002 Free Software Foundation, Inc.This file is part of the GNU MP Library.The GNU MP Library is free software; you can redistribute it and/or modifyit under the terms of the GNU Lesser General Public License as published bythe Free Software Foundation; either version 2.1 of the License, or (at youroption) any later version.The GNU MP Library is distributed in the hope that it will be useful, butWITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITYor FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General PublicLicense for more details.You should have received a copy of the GNU Lesser General Public Licensealong with the GNU MP Library; see the file COPYING.LIB. If not, write tothe Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA02110-1301, USA.The code in this directory works for Cray vector systems such as C90,J90, T90 (both the CFP variant and the IEEE variant) and SV1. (Forthe T3E and T3D systems, see the `alpha' subdirectory at the samelevel as the directory containing this file.)The cfp subdirectory is for systems utilizing the traditional Crayfloating-point format, and the ieee subdirectory is for the newersystems that use the IEEE floating-point format.There are several issues that reduces speed on Cray systems. Forsystems with cfp floating point, the main obstacle is the forming of128-bit products. For IEEE systems, adding, and in particularcomputing carry is the main issue. There are no vectorizingunsigned-less-than instructions, and the sequence that implement thatopetration is very long.Shifting is the only operation that is simple to make fast. All Craysystems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) thatshould be really useful.For best speed for cfp systems, we need a mul_basecase, since thatreduces the need for carry propagation to a minimum. Depending on thesize (vn) of the smaller of the two operands (V), we should split U and Vin different chunk sizes:U split in 2 32-bit partsV split according to the table:parts 4 5 6 7 8bits/part 16 13 11 10 8max allowed vn 1 8 32 64 256number of multiplies 8 10 12 14 16peak cycles/limb 4 5 6 7 8U split in 3 22-bit partsV split according to the table:parts 3 4 5bits/part 22 16 13max allowed vn 16 1024 8192number of multiplies 9 12 15peak cycles/limb 4.5 6 7.5U split in 4 16-bit partsV split according to the table:parts 4bits/part 16max allowed vn 65536number of multiplies 16peak cycles/limb 8(A T90 CPU can accumulate two products per cycle.)IDEA:* Rewrite mpn_add_n:short cy[n + 1];#pragma _CRI ivdepfor (i = 0; i < n; i++){ s = up[i] + vp[i];rp[i] = s;cy[i + 1] = s < up[i]; }more_carries = 0;#pragma _CRI ivdepfor (i = 1; i < n; i++){ s = rp[i] + cy[i];rp[i] = s;more_carries += s < cy[i]; }cys = 0;if (more_carries){cys = rp[1] < cy[1];for (i = 2; i < n; i++){ rp[i] += cys;cys = rp[i] < cys; }}return cys + cy[n];* Write mpn_add3_n for adding three operands. First add operands 1and 2, and generate cy[]. Then add operand 3 to the partial result,and accumulate carry into cy[]. Finally propagate carry just likein the new mpn_add_n.IDEA:Store fewer bits, perhaps 62, per limb. That brings mpn_add_n timedown to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb. Bystoring even fewer bits per limb, perhaps 56, it would be possible towrite a mul_mul_basecase that would run at effectively 1 cycle/limb.(Use VM here to better handle the romb-shaped multiply area, perhapsrouding operand sizes up to the next power of 2.)