[Scilab-users] opportunity of merging cov() and covar()

Thu Feb 20 02:58:01 CET 2020

Stéphane,

My first argument in favor of keeping covar and cov as separate 
functions is that often what one needs is the covariance between two 
potentially correlated signals regardless of their individual variances, 
so it seems somewhat inefficient to compute essentially three 
covariances (two of them between two equal signals) when it is only one 
of them what one wants to calculate.

However, the syntax of covar should have an option to process the 
signals directly instead of their statistics (values and joint 
frequencies): covar(x,y)

My second argument seems to be the opposite of my pevious request: 
covar(x,y,fre) is a quite different function since the input information 
is presented in a different way, which is valuable when one happens to 
have the information in such fashion.

Regards,

Federico Miyara

On 19/02/2020 17:14, Stéphane Mottelet wrote:
>
> Hi all,
>
> Within the development team we recently had a discussion about the 
> improvement of cov() in terms of speed and memory requirement and 
> about the opportunity of merging cov() and covar() wich are two 
> disctinct macros. Since we did not manage to reach a consensus we 
> thought it could be the occasion to have the opinion of members of 
> this list which have a recognized academical/research knowledge in 
> probability and statistics. Here are some elements to start the 
> discussion. Let us start with covar() macro and what it actually computes:
>
> * covar()
>
> Let us start with a definition of covariance in general:
>
> https://fr.wikipedia.org/wiki/Covariance#D%C3%A9finition_de_la_covariance
>
> and with an example there:
>
> https://en.wikipedia.org/wiki/Covariance#Example
>
> In the two above links scalar/real variables are considered and in the 
> second link discrete random variables are considered. In the example 
> the covariance is computed knowing the possible values and their joint 
> density. You can easily check in the source of covar() (type "edit 
> covar") that, after normalizing the matrix of joint probabilities 
> (named "frequencies" in the source), the macro computes the same 
> value, which is confirmed by the result of the following statements:
>
> --> x=[1 2];y=[1 2 3];fre = [1/4 1/4 0;0 1/4 1/4];covar(x,y,fre)
>  ans  =
>
>    0.25
>
> Please note that covar() output is always a scalar. Now let us 
> consider cov():
>
> * cov()
>
> Here is a definition of the covariance matrix:
>
> https://en.wikipedia.org/wiki/Covariance_matrix
>
> Here we consider vectors of random variables (not scalar random 
> variables) and in this case the covariance is a matrix. When there is 
> no a priori knowledge on these variables (when the joint density is 
> not known, typically), the best you can do is, when you have samples 
> of this random vector, is to compute an estimation of the covariance 
> matrix, see e.g. he following page:
>
> https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices
>
> You can verify in actual code of cov() that this macro computes the 
> same estimation (sums are vectorized).
>
> We can summarize these facts this way:
>
> * covar(x,y,fre) computes the scalar covariance of two discrete random 
> variables knowing their possible values x(:) and y(:) and their joint 
> probability density
>
> * When x is a matrix, cov(x) computes an estimator of the covariance 
> matrix of a vector X of size(x,2) random variables by using size(x,1) 
> samples of this vector (each x(i,:) is a sample). if x and y are 
> vectors of the same size, cov(x,y) is computed as cov([x(:) y(:)]).
>
> To me, the main difference is that covar(x,y,fre) does not compute an 
> _estimator_but a _exact value_. Of course, the vectors x and y can be 
> the unique value of two random variables, gathered from samples (x,y) 
> and "fre" be the empirical frequency of samples (x_i,y_j). In this 
> case covar() will compute an estimation. For example, consider the two 
> random variables X and Y, where X takes values {1,2} with equal 
> probability, and Y=X+U where U takes values {0,1} with equal 
> probability. We can use covar() to compute the exact covariance of X 
> and Y, but if we only have samples, like in the below script, if we 
> want to estimate the covariance with the same macro, then unique pairs 
> have to be found and occurences counted in order to estimate the 
> frequency :
>
> N=1000;
> x=ceil(rand(N,1)*2);
> y=x+floor(rand(N,1)*2);
>
> [pairs,k]=unique(gsort([x y],'lr','i'),'r');
> f=diff([k;N+1])/N;
>
> freq=sparse(pairs,f)
> N/(N-1)*covar(1:2,1:3,freq)
> cov(x,y)
>
> If you have a look to the results,
>
> --> freq
>  freq  =
>
>    0.2526   0.2489   0.
>    0.       0.2453   0.2532
>
> --> N/(N-1)*covar(1:2,1:3,freq)
>  ans  =
>
>    0.249769
>
> --> cov(x,y)
>  ans  =
>
>    0.2500182   0.249769
>    0.249769    0.4995447
>
> you can see that
>
> 1. we have considered the same random variables as in the example 
> https://en.wikipedia.org/wiki/Covariance#Example
> 2. covar's output (up to the normalization to correct the bias) gives 
> the off diagonal term of cov(x,y)
>
> So, yes, off diagonal term of cov(x,y) and covar(x,y,fre) (up to 
> unique pairs determination, computation of "fre" and bias correction) 
> have the same value, but is it a reason to merge the two functions. I 
> think that the answer is NO.
>
> If you agree or disagree, feel free to continue the discussion in this 
> thread.
>
> S.
>
> -- 
> Stéphane Mottelet
> Ingénieur de recherche
> EA 4297 Transformations Intégrées de la Matière Renouvelable
> Département Génie des Procédés Industriels
> Sorbonne Universités - Université de Technologie de Compiègne
> CS 60319, 60203 Compiègne cedex
> Tel : +33(0)344234688
> http://www.utc.fr/~mottelet
>
> _______________________________________________
> users mailing list
> users at lists.scilab.org
> http://lists.scilab.org/mailman/listinfo/users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.scilab.org/pipermail/users/attachments/20200219/e137f2cb/attachment.htm>