[Scilab-Dev] BLAS use in Scilab

Thu Feb 22 14:18:49 CET 2018

Really, nobody knows ?

S.

Le 20/02/2018 à 11:57, Stéphane Mottelet a écrit :
> Hello,
>
> Continuing on this subject, Hello, I discovered that the new Scilab 
> API allows to modify input parameters of a function (in-place 
> assignment), e.g. I have modified the previous daxpy such that the 
> expression
>
> daxpy(2,X,Y)
>
> has no output but formally does "Y+=2*X" if such an operator would 
> exist in Scilab. In this case there is no matrix copy at all, hence no 
> memory overhead.
>
> Was it possible to do this with the previous API ?
>
> S.
>
> Le 19/02/2018 à 19:15, Stéphane Mottelet a écrit :
>> Hello,
>>
>> After some tests, for the intended use (multiply a matrix by a 
>> scalar), dgemm is not faster that dscal, but in the C code of 
>> "iMultiRealScalarByRealMatrix", the part which takes the most of the 
>> CPU time is the call to "dcopy". For example, on my machine,  for a 
>> 10000x10000 matrix, the call to dcopy takes 540 milliseconds and the 
>> call to dscal 193 milliseconds. Continuing my explorations today, I 
>> tried to see how Scilab expressions such as
>>
>> Y+2*X
>>
>> are parsed and executed. To this purpose I have written an interface 
>> (daxpy.sci and daxpy.c attached) to the BLAS function "daxpy" which 
>> does "y<-y+a*x" and a script comparing the above expression to
>>
>> daxpy(2,X,Y)
>>
>> for two 10000x10000 matrices. Here are the results (MacBook air core 
>> i7 at 1,7GHz):
>>
>>  daxpy(2,X,Y)
>>  (dcopy: 582 ms)
>>  (daxpy: 211 ms)
>>
>>  elapsed time: 793 ms
>>
>>  Y+2*X
>>
>>  elapsed time: 1574 ms
>>
>> Considered the above value, the explanation is that in "Y+2*X" there 
>> are *two* copies of a 10000x10000 matrix instead of only one in 
>> "daxpy(2,X,Y)". In "Y+2*X+3*Z" there will be three copies, although 
>> there could be only one if daxpy was used twice.
>>
>> I am not blaming Scilab here, I am just blaming "vectorization", 
>> which can be inefficient when large objects are used. That's why 
>> explicits loops can sometimes be faster than vectorized operations in 
>> Matlab or Julia (which both use JIT compilation).
>>
>> S.
>>
>>
>> Le 15/02/2018 à 17:11, Antoine ELIAS a écrit :
>>> Hello Stéphane,
>>>
>>> Interesting ...
>>>
>>> In release, we don't ship the header of BLAS/LAPACK functions.
>>> But you can define them in your C file as extern. ( and let the 
>>> linker do his job )
>>>
>>> extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int 
>>> *_piN, int *_piM, int *_piK, double *_pdblAlpha, double *_pdblA, int 
>>> *_piLdA,
>>>                        double *_pdblB, int *_piLdB, double 
>>> *_pdblBeta, double *_pdblC, int *_piLdC);
>>> and
>>>
>>> extern int C2F(dscal) (int *_iSize, double *_pdblVal, double 
>>> *_pdblDest, int *_iInc);
>>>
>>> Others BLAS/LAPACK prototypes can be found at 
>>> http://cgit.scilab.org/scilab/tree/scilab/modules/elementary_functions/includes/elem_common.h?h=6.0
>>>
>>> Regards,
>>> Antoine
>>> Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :
>>>> Hello all,
>>>>
>>>> Following the recent discussion with fujimoto, I discovered that 
>>>> Scilab does not (seem to) fully use SIMD operation in BLAS as it 
>>>> should. Besides the bottlenecks of its code, there are also many 
>>>> operations of the kind
>>>>
>>>> scalar*matrix
>>>>
>>>> Althoug this operation is correctly delegated to the DSCAL BLAS 
>>>> function (can be seen in C function iMultiRealMatrixByRealMatrix in 
>>>> modules/ast/src/c/operations/matrix_multiplication.c) :
>>>>
>>>>> int iMultiRealScalarByRealMatrix(
>>>>>     double _dblReal1,
>>>>>     double *_pdblReal2,    int _iRows2, int _iCols2,
>>>>>     double *_pdblRealOut)
>>>>> {
>>>>>     int iOne    = 1;
>>>>>     int iSize2    = _iRows2 * _iCols2;
>>>>>
>>>>>     C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
>>>>>     C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
>>>>>     return 0;
>>>>> }
>>>> in the code below the product "A*1" is likely using only one 
>>>> processor core, as seen on the cpu usage graph and on the elapsed 
>>>> time,
>>>>
>>>> A=rand(20000,20000);
>>>> tic; for i=1:10; A*1; end; toc
>>>>
>>>>  ans  =
>>>>
>>>>    25.596843
>>>>
>>>> but this second piece of code is more than 8 times faster and uses 
>>>> 100% of the cpu,
>>>>
>>>> ONE=ones(20000,1);
>>>> tic; for i=1:10; A*ONE; end; toc
>>>>
>>>>  ans  =
>>>>
>>>>    2.938314
>>>>
>>>> with roughly the same number of multiplications. This second 
>>>> computation is delegated to DGEMM (C<-alpha*A*B + beta*C, here with 
>>>> alpha=1 and beta=0)
>>>>
>>>>> int iMultiRealMatrixByRealMatrix(
>>>>>     double *_pdblReal1,    int _iRows1, int _iCols1,
>>>>>     double *_pdblReal2,    int _iRows2, int _iCols2,
>>>>>     double *_pdblRealOut)
>>>>> {
>>>>>     double dblOne        = 1;
>>>>>     double dblZero        = 0;
>>>>>
>>>>>     C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
>>>>>                _pdblReal1 , &_iRows1 ,
>>>>>                _pdblReal2, &_iRows2, &dblZero,
>>>>>                _pdblRealOut , &_iRows1);
>>>>>     return 0;
>>>>> }
>>>> Maybe my intuition is wrong, but I have the feeling that using 
>>>> dgemm with alpha=0 will be faster than dscal. I plan to test this 
>>>> by making a quick and dirty code linked to Scilab so my question to 
>>>> devs is : which are the #includes to add on top of the source (C) 
>>>> to be able to call dgemm and dscal ?
>>>>
>>>> Thanks for your help
>>>>
>>>> S.
>>>>
>>>>
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev at lists.scilab.org
>>> http://lists.scilab.org/mailman/listinfo/dev
>>
>>
>> -- 
>> Stéphane Mottelet
>> Ingénieur de recherche
>> EA 4297 Transformations Intégrées de la Matière Renouvelable
>> Département Génie des Procédés Industriels
>> Sorbonne Universités - Université de Technologie de Compiègne
>> CS 60319, 60203 Compiègne cedex
>> Tel : +33(0)344234688
>> http://www.utc.fr/~mottelet
>>
>>
>> _______________________________________________
>> dev mailing list
>> dev at lists.scilab.org
>> http://lists.scilab.org/mailman/listinfo/dev
>
>
> -- 
> Stéphane Mottelet
> Ingénieur de recherche
> EA 4297 Transformations Intégrées de la Matière Renouvelable
> Département Génie des Procédés Industriels
> Sorbonne Universités - Université de Technologie de Compiègne
> CS 60319, 60203 Compiègne cedex
> Tel : +33(0)344234688
> http://www.utc.fr/~mottelet
>
>
> _______________________________________________
> dev mailing list
> dev at lists.scilab.org
> http://lists.scilab.org/mailman/listinfo/dev

-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.scilab.org/pipermail/dev/attachments/20180222/67043054/attachment.htm>