[Scilab-Dev] BLAS use in Scilab

Mon Feb 26 12:36:33 CET 2018

Hello Stéphane,

AFAIK, as in the old interface the new C++ one allow you to get pointers to the raw double* values
and so allow you to write anything to the inputs. However this kind of behavior is considered a bug
as from the user point of view (eg. Scilab beginners) each assignation is supposed to have inputs
(unmodified data) and outputs (computed data).

Note: about the += operator, this is indeed a way to tell the user that the argument is modified in
place with a specific + operation.

Thanks,

--
Clément

Le jeudi 22 février 2018 à 14:18 +0100, Stéphane Mottelet a écrit :
> Really, nobody knows ?
> 
> S.
> 
> Le 20/02/2018 à 11:57, Stéphane Mottelet a écrit :
> > Hello,
> > 
> > Continuing on this subject, Hello, I discovered that the new Scilab API allows to modify input
> > parameters of a function (in-place assignment), e.g. I have modified the previous daxpy such
> > that the expression
> > 
> > daxpy(2,X,Y)
> > 
> > has no output but formally does "Y+=2*X" if such an operator would exist in Scilab. In this case
> > there is no matrix copy at all, hence no memory overhead.
> > 
> > Was it possible to do this with the previous API ?
> > 
> > S.
> > 
> > Le 19/02/2018 à 19:15, Stéphane Mottelet a écrit :
> > > Hello,
> > > 
> > > After some tests, for the intended use (multiply a matrix by a scalar), dgemm is not faster
> > > that dscal, but in the C code of "iMultiRealScalarByRealMatrix", the part which takes the most
> > > of the CPU time is the call to "dcopy". For example, on my machine,  for a 10000x10000 matrix,
> > > the call to dcopy takes 540 milliseconds and the call to dscal 193 milliseconds. Continuing my
> > > explorations today, I tried to see how Scilab expressions such as
> > > 
> > > Y+2*X
> > > 
> > > are parsed and executed. To this purpose I have written an interface (daxpy.sci and daxpy.c
> > > attached) to the BLAS function "daxpy" which does "y<-y+a*x" and a script comparing the above
> > > expression to 
> > > 
> > > daxpy(2,X,Y)
> > > 
> > > for two 10000x10000 matrices. Here are the results (MacBook air core i7 at 1,7GHz):
> > > 
> > >  daxpy(2,X,Y)
> > >  (dcopy: 582 ms)
> > >  (daxpy: 211 ms)
> > > 
> > >  elapsed time: 793 ms
> > > 
> > >  Y+2*X
> > > 
> > >  elapsed time: 1574 ms
> > > 
> > > Considered the above value, the explanation is that in "Y+2*X" there are *two* copies of a
> > > 10000x10000 matrix instead of only one in "daxpy(2,X,Y)". In "Y+2*X+3*Z" there will be three
> > > copies, although there could be only one if daxpy was used twice. 
> > > 
> > > I am not blaming Scilab here, I am just blaming "vectorization", which can be inefficient when
> > > large objects           are used. That's why explicits loops can sometimes be faster than
> > > vectorized operations in Matlab or Julia (which both use JIT compilation).
> > > 
> > > S.
> > > 
> > > 
> > > Le 15/02/2018 à 17:11, Antoine ELIAS a écrit :
> > > > Hello Stéphane, 
> > > > 
> > > > Interesting ... 
> > > > 
> > > > In release, we don't ship the header of BLAS/LAPACK functions. 
> > > > But you can define them in your C file as extern. ( and let the linker do his job ) 
> > > > 
> > > > extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN, int *_piM, int *_piK,
> > > > double *_pdblAlpha, double *_pdblA, int *_piLdA, 
> > > >                        double *_pdblB, int *_piLdB, double *_pdblBeta, double *_pdblC, int
> > > > *_piLdC); 
> > > > and 
> > > > 
> > > > extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest, int *_iInc); 
> > > > 
> > > > Others BLAS/LAPACK prototypes can be found at http://cgit.scilab.org/scilab/tree/scilab/modu
> > > > les/elementary_functions/includes/elem_common.h?h=6.0
> > > > 
> > > > Regards, 
> > > > Antoine 
> > > > Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit : 
> > > > > Hello all, 
> > > > > 
> > > > > Following the recent discussion with fujimoto, I discovered that Scilab does not (seem to)
> > > > > fully use SIMD operation in  BLAS as it should. Besides the bottlenecks of its code, there
> > > > > are also many operations of the kind 
> > > > > 
> > > > > scalar*matrix 
> > > > > 
> > > > > Althoug this operation is correctly delegated to the DSCAL BLAS function (can be seen in C
> > > > > function iMultiRealMatrixByRealMatrix in
> > > > > modules/ast/src/c/operations/matrix_multiplication.c) : 
> > > > > 
> > > > > > int iMultiRealScalarByRealMatrix( 
> > > > > >     double _dblReal1, 
> > > > > >     double *_pdblReal2,    int _iRows2, int _iCols2, 
> > > > > >     double *_pdblRealOut) 
> > > > > > { 
> > > > > >     int iOne    = 1; 
> > > > > >     int iSize2    = _iRows2 * _iCols2; 
> > > > > > 
> > > > > >     C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne); 
> > > > > >     C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne); 
> > > > > >     return 0; 
> > > > > > } 
> > > > >  in the code below the product "A*1" is likely using only one processor core, as seen on
> > > > > the cpu usage graph and on the elapsed time, 
> > > > > 
> > > > > A=rand(20000,20000); 
> > > > > tic; for i=1:10; A*1; end; toc 
> > > > > 
> > > > >  ans  = 
> > > > > 
> > > > >    25.596843 
> > > > > 
> > > > > but this second piece of code is more than 8 times faster and uses 100% of the cpu, 
> > > > > 
> > > > > ONE=ones(20000,1); 
> > > > > tic; for i=1:10; A*ONE; end; toc 
> > > > > 
> > > > >  ans  = 
> > > > > 
> > > > >    2.938314 
> > > > > 
> > > > > with roughly the same number of multiplications. This second computation is delegated to
> > > > > DGEMM (C<-alpha*A*B + beta*C, here with alpha=1 and beta=0) 
> > > > > 
> > > > > > int iMultiRealMatrixByRealMatrix( 
> > > > > >     double *_pdblReal1,    int _iRows1, int _iCols1, 
> > > > > >     double *_pdblReal2,    int _iRows2, int _iCols2, 
> > > > > >     double *_pdblRealOut) 
> > > > > > { 
> > > > > >     double dblOne        = 1; 
> > > > > >     double dblZero        = 0; 
> > > > > > 
> > > > > >     C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne, 
> > > > > >                _pdblReal1 , &_iRows1 , 
> > > > > >                _pdblReal2, &_iRows2, &dblZero, 
> > > > > >                _pdblRealOut , &_iRows1); 
> > > > > >     return 0; 
> > > > > > } 
> > > > >  Maybe my intuition is wrong, but I have the feeling that using dgemm with alpha=0 will be
> > > > > faster than dscal. I plan to test this by making a quick and dirty code linked to Scilab
> > > > > so my question to devs is : which are the #includes to add on top of the source (C) to be
> > > > > able to call dgemm and dscal ? 
> > > > > 
> > > > > Thanks for your help 
> > > > > 
> > > > > S. 
> > > > > 
> > > > > 
> > > >  
> > > > _______________________________________________ 
> > > > dev mailing list 
> > > > dev at lists.scilab.org 
> > > > http://lists.scilab.org/mailman/listinfo/dev 
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > dev mailing list
> > > dev at lists.scilab.org
> > > http://lists.scilab.org/mailman/listinfo/dev
> > 
> > -- 
> > Stéphane Mottelet
> > Ingénieur de recherche
> > EA 4297 Transformations Intégrées de la Matière Renouvelable
> > Département Génie des Procédés Industriels
> > Sorbonne Universités - Université de Technologie de Compiègne
> > CS 60319, 60203 Compiègne cedex
> > Tel : +33(0)344234688
> > http://www.utc.fr/~mottelet
> > 
> > 
> > _______________________________________________
> > dev mailing list
> > dev at lists.scilab.org
> > http://lists.scilab.org/mailman/listinfo/dev
> 
> -- 
> Stéphane Mottelet
> Ingénieur de recherche
> EA 4297 Transformations Intégrées de la Matière Renouvelable
> Département Génie des Procédés Industriels
> Sorbonne Universités - Université de Technologie de Compiègne
> CS 60319, 60203 Compiègne cedex
> Tel : +33(0)344234688
> http://www.utc.fr/~mottelet
> _______________________________________________
> dev mailing list
> dev at lists.scilab.org
> http://lists.scilab.org/mailman/listinfo/dev