[Scilab-Dev] BLAS use in Scilab

Thu Feb 15 17:11:05 CET 2018

Hello Stéphane,

Interesting ...

In release, we don't ship the header of BLAS/LAPACK functions.
But you can define them in your C file as extern. ( and let the linker 
do his job )

extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN, 
int *_piM, int *_piK, double *_pdblAlpha, double *_pdblA, int *_piLdA,
                        double *_pdblB, int *_piLdB, double *_pdblBeta, 
double *_pdblC, int *_piLdC);
and

extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest, 
int *_iInc);

Others BLAS/LAPACK prototypes can be found at 
http://cgit.scilab.org/scilab/tree/scilab/modules/elementary_functions/includes/elem_common.h?h=6.0

Regards,
Antoine
Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :
> Hello all,
>
> Following the recent discussion with fujimoto, I discovered that 
> Scilab does not (seem to) fully use SIMD operation in  BLAS as it 
> should. Besides the bottlenecks of its code, there are also many 
> operations of the kind
>
> scalar*matrix
>
> Althoug this operation is correctly delegated to the DSCAL BLAS 
> function (can be seen in C function iMultiRealMatrixByRealMatrix in 
> modules/ast/src/c/operations/matrix_multiplication.c) :
>
>> int iMultiRealScalarByRealMatrix(
>>     double _dblReal1,
>>     double *_pdblReal2,    int _iRows2, int _iCols2,
>>     double *_pdblRealOut)
>> {
>>     int iOne    = 1;
>>     int iSize2    = _iRows2 * _iCols2;
>>
>>     C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
>>     C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
>>     return 0;
>> }
> in the code below the product "A*1" is likely using only one processor 
> core, as seen on the cpu usage graph and on the elapsed time,
>
> A=rand(20000,20000);
> tic; for i=1:10; A*1; end; toc
>
>  ans  =
>
>    25.596843
>
> but this second piece of code is more than 8 times faster and uses 
> 100% of the cpu,
>
> ONE=ones(20000,1);
> tic; for i=1:10; A*ONE; end; toc
>
>  ans  =
>
>    2.938314
>
> with roughly the same number of multiplications. This second 
> computation is delegated to DGEMM (C<-alpha*A*B + beta*C, here with 
> alpha=1 and beta=0)
>
>> int iMultiRealMatrixByRealMatrix(
>>     double *_pdblReal1,    int _iRows1, int _iCols1,
>>     double *_pdblReal2,    int _iRows2, int _iCols2,
>>     double *_pdblRealOut)
>> {
>>     double dblOne        = 1;
>>     double dblZero        = 0;
>>
>>     C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
>>                _pdblReal1 , &_iRows1 ,
>>                _pdblReal2, &_iRows2, &dblZero,
>>                _pdblRealOut , &_iRows1);
>>     return 0;
>> }
> Maybe my intuition is wrong, but I have the feeling that using dgemm 
> with alpha=0 will be faster than dscal. I plan to test this by making 
> a quick and dirty code linked to Scilab so my question to devs is : 
> which are the #includes to add on top of the source (C) to be able to 
> call dgemm and dscal ?
>
> Thanks for your help
>
> S.
>
>