[Scilab-Dev] BLAS use in Scilab
Clément David
Clement.David at esi-group.com
Mon Feb 26 12:36:33 CET 2018
Hello Stéphane,
AFAIK, as in the old interface the new C++ one allow you to get pointers to the raw double* values
and so allow you to write anything to the inputs. However this kind of behavior is considered a bug
as from the user point of view (eg. Scilab beginners) each assignation is supposed to have inputs
(unmodified data) and outputs (computed data).
Note: about the += operator, this is indeed a way to tell the user that the argument is modified in
place with a specific + operation.
Thanks,
--
Clément
Le jeudi 22 février 2018 à 14:18 +0100, Stéphane Mottelet a écrit :
> Really, nobody knows ?
>
> S.
>
> Le 20/02/2018 à 11:57, Stéphane Mottelet a écrit :
> > Hello,
> >
> > Continuing on this subject, Hello, I discovered that the new Scilab API allows to modify input
> > parameters of a function (in-place assignment), e.g. I have modified the previous daxpy such
> > that the expression
> >
> > daxpy(2,X,Y)
> >
> > has no output but formally does "Y+=2*X" if such an operator would exist in Scilab. In this case
> > there is no matrix copy at all, hence no memory overhead.
> >
> > Was it possible to do this with the previous API ?
> >
> > S.
> >
> > Le 19/02/2018 à 19:15, Stéphane Mottelet a écrit :
> > > Hello,
> > >
> > > After some tests, for the intended use (multiply a matrix by a scalar), dgemm is not faster
> > > that dscal, but in the C code of "iMultiRealScalarByRealMatrix", the part which takes the most
> > > of the CPU time is the call to "dcopy". For example, on my machine, for a 10000x10000 matrix,
> > > the call to dcopy takes 540 milliseconds and the call to dscal 193 milliseconds. Continuing my
> > > explorations today, I tried to see how Scilab expressions such as
> > >
> > > Y+2*X
> > >
> > > are parsed and executed. To this purpose I have written an interface (daxpy.sci and daxpy.c
> > > attached) to the BLAS function "daxpy" which does "y<-y+a*x" and a script comparing the above
> > > expression to
> > >
> > > daxpy(2,X,Y)
> > >
> > > for two 10000x10000 matrices. Here are the results (MacBook air core i7 at 1,7GHz):
> > >
> > > daxpy(2,X,Y)
> > > (dcopy: 582 ms)
> > > (daxpy: 211 ms)
> > >
> > > elapsed time: 793 ms
> > >
> > > Y+2*X
> > >
> > > elapsed time: 1574 ms
> > >
> > > Considered the above value, the explanation is that in "Y+2*X" there are *two* copies of a
> > > 10000x10000 matrix instead of only one in "daxpy(2,X,Y)". In "Y+2*X+3*Z" there will be three
> > > copies, although there could be only one if daxpy was used twice.
> > >
> > > I am not blaming Scilab here, I am just blaming "vectorization", which can be inefficient when
> > > large objects are used. That's why explicits loops can sometimes be faster than
> > > vectorized operations in Matlab or Julia (which both use JIT compilation).
> > >
> > > S.
> > >
> > >
> > > Le 15/02/2018 à 17:11, Antoine ELIAS a écrit :
> > > > Hello Stéphane,
> > > >
> > > > Interesting ...
> > > >
> > > > In release, we don't ship the header of BLAS/LAPACK functions.
> > > > But you can define them in your C file as extern. ( and let the linker do his job )
> > > >
> > > > extern int C2F(dgemm) (char *_pstTransA, char *_pstTransB, int *_piN, int *_piM, int *_piK,
> > > > double *_pdblAlpha, double *_pdblA, int *_piLdA,
> > > > double *_pdblB, int *_piLdB, double *_pdblBeta, double *_pdblC, int
> > > > *_piLdC);
> > > > and
> > > >
> > > > extern int C2F(dscal) (int *_iSize, double *_pdblVal, double *_pdblDest, int *_iInc);
> > > >
> > > > Others BLAS/LAPACK prototypes can be found at http://cgit.scilab.org/scilab/tree/scilab/modu
> > > > les/elementary_functions/includes/elem_common.h?h=6.0
> > > >
> > > > Regards,
> > > > Antoine
> > > > Le 15/02/2018 à 16:50, Stéphane Mottelet a écrit :
> > > > > Hello all,
> > > > >
> > > > > Following the recent discussion with fujimoto, I discovered that Scilab does not (seem to)
> > > > > fully use SIMD operation in BLAS as it should. Besides the bottlenecks of its code, there
> > > > > are also many operations of the kind
> > > > >
> > > > > scalar*matrix
> > > > >
> > > > > Althoug this operation is correctly delegated to the DSCAL BLAS function (can be seen in C
> > > > > function iMultiRealMatrixByRealMatrix in
> > > > > modules/ast/src/c/operations/matrix_multiplication.c) :
> > > > >
> > > > > > int iMultiRealScalarByRealMatrix(
> > > > > > double _dblReal1,
> > > > > > double *_pdblReal2, int _iRows2, int _iCols2,
> > > > > > double *_pdblRealOut)
> > > > > > {
> > > > > > int iOne = 1;
> > > > > > int iSize2 = _iRows2 * _iCols2;
> > > > > >
> > > > > > C2F(dcopy)(&iSize2, _pdblReal2, &iOne, _pdblRealOut, &iOne);
> > > > > > C2F(dscal)(&iSize2, &_dblReal1, _pdblRealOut, &iOne);
> > > > > > return 0;
> > > > > > }
> > > > > in the code below the product "A*1" is likely using only one processor core, as seen on
> > > > > the cpu usage graph and on the elapsed time,
> > > > >
> > > > > A=rand(20000,20000);
> > > > > tic; for i=1:10; A*1; end; toc
> > > > >
> > > > > ans =
> > > > >
> > > > > 25.596843
> > > > >
> > > > > but this second piece of code is more than 8 times faster and uses 100% of the cpu,
> > > > >
> > > > > ONE=ones(20000,1);
> > > > > tic; for i=1:10; A*ONE; end; toc
> > > > >
> > > > > ans =
> > > > >
> > > > > 2.938314
> > > > >
> > > > > with roughly the same number of multiplications. This second computation is delegated to
> > > > > DGEMM (C<-alpha*A*B + beta*C, here with alpha=1 and beta=0)
> > > > >
> > > > > > int iMultiRealMatrixByRealMatrix(
> > > > > > double *_pdblReal1, int _iRows1, int _iCols1,
> > > > > > double *_pdblReal2, int _iRows2, int _iCols2,
> > > > > > double *_pdblRealOut)
> > > > > > {
> > > > > > double dblOne = 1;
> > > > > > double dblZero = 0;
> > > > > >
> > > > > > C2F(dgemm)("n", "n", &_iRows1, &_iCols2, &_iCols1, &dblOne,
> > > > > > _pdblReal1 , &_iRows1 ,
> > > > > > _pdblReal2, &_iRows2, &dblZero,
> > > > > > _pdblRealOut , &_iRows1);
> > > > > > return 0;
> > > > > > }
> > > > > Maybe my intuition is wrong, but I have the feeling that using dgemm with alpha=0 will be
> > > > > faster than dscal. I plan to test this by making a quick and dirty code linked to Scilab
> > > > > so my question to devs is : which are the #includes to add on top of the source (C) to be
> > > > > able to call dgemm and dscal ?
> > > > >
> > > > > Thanks for your help
> > > > >
> > > > > S.
> > > > >
> > > > >
> > > >
> > > > _______________________________________________
> > > > dev mailing list
> > > > dev at lists.scilab.org
> > > > http://lists.scilab.org/mailman/listinfo/dev
> > >
> > >
> > >
> > > _______________________________________________
> > > dev mailing list
> > > dev at lists.scilab.org
> > > http://lists.scilab.org/mailman/listinfo/dev
> >
> > --
> > Stéphane Mottelet
> > Ingénieur de recherche
> > EA 4297 Transformations Intégrées de la Matière Renouvelable
> > Département Génie des Procédés Industriels
> > Sorbonne Universités - Université de Technologie de Compiègne
> > CS 60319, 60203 Compiègne cedex
> > Tel : +33(0)344234688
> > http://www.utc.fr/~mottelet
> >
> >
> > _______________________________________________
> > dev mailing list
> > dev at lists.scilab.org
> > http://lists.scilab.org/mailman/listinfo/dev
>
> --
> Stéphane Mottelet
> Ingénieur de recherche
> EA 4297 Transformations Intégrées de la Matière Renouvelable
> Département Génie des Procédés Industriels
> Sorbonne Universités - Université de Technologie de Compiègne
> CS 60319, 60203 Compiègne cedex
> Tel : +33(0)344234688
> http://www.utc.fr/~mottelet
> _______________________________________________
> dev mailing list
> dev at lists.scilab.org
> http://lists.scilab.org/mailman/listinfo/dev
More information about the dev
mailing list