[Scilab-users] opportunity of merging cov() and covar()

Wed Feb 19 21:14:17 CET 2020

Hi all,

Within the development team we recently had a discussion about the 
improvement of cov() in terms of speed and memory requirement and about 
the opportunity of merging cov() and covar() wich are two disctinct 
macros. Since we did not manage to reach a consensus we thought it could 
be the occasion to have the opinion of members of this list which have a 
recognized academical/research knowledge in probability and statistics. 
Here are some elements to start the discussion. Let us start with 
covar() macro and what it actually computes:

* covar()

Let us start with a definition of covariance in general:

https://fr.wikipedia.org/wiki/Covariance#D%C3%A9finition_de_la_covariance

and with an example there:

https://en.wikipedia.org/wiki/Covariance#Example

In the two above links scalar/real variables are considered and in the 
second link discrete random variables are considered. In the example the 
covariance is computed knowing the possible values and their joint 
density. You can easily check in the source of covar() (type "edit 
covar") that, after normalizing the matrix of joint probabilities (named 
"frequencies" in the source), the macro computes the same value, which 
is confirmed by the result of the following statements:

--> x=[1 2];y=[1 2 3];fre = [1/4 1/4 0;0 1/4 1/4];covar(x,y,fre)
  ans  =

    0.25

Please note that covar() output is always a scalar. Now let us consider 
cov():

* cov()

Here is a definition of the covariance matrix:

https://en.wikipedia.org/wiki/Covariance_matrix

Here we consider vectors of random variables (not scalar random 
variables) and in this case the covariance is a matrix. When there is no 
a priori knowledge on these variables (when the joint density is not 
known, typically), the best you can do is, when you have samples of this 
random vector, is to compute an estimation of the covariance matrix, see 
e.g. he following page:

https://en.wikipedia.org/wiki/Estimation_of_covariance_matrices

You can verify in actual code of cov() that this macro computes the same 
estimation (sums are vectorized).

We can summarize these facts this way:

* covar(x,y,fre) computes the scalar covariance of two discrete random 
variables knowing their possible values x(:) and y(:) and their joint 
probability density

* When x is a matrix, cov(x) computes an estimator of the covariance 
matrix of a vector X of size(x,2) random variables by using size(x,1) 
samples of this vector (each x(i,:) is a sample). if x and y are vectors 
of the same size, cov(x,y) is computed as cov([x(:) y(:)]).

To me, the main difference is that covar(x,y,fre) does not compute an 
_estimator_but a _exact value_. Of course, the vectors x and y can be 
the unique value of two random variables, gathered from samples (x,y) 
and "fre" be the empirical frequency of samples (x_i,y_j). In this case 
covar() will compute an estimation. For example, consider the two random 
variables X and Y, where X takes values {1,2} with equal probability, 
and Y=X+U where U takes values {0,1} with equal probability. We can use 
covar() to compute the exact covariance of X and Y, but if we only have 
samples, like in the below script, if we want to estimate the covariance 
with the same macro, then unique pairs have to be found and occurences 
counted in order to estimate the frequency :

N=1000;
x=ceil(rand(N,1)*2);
y=x+floor(rand(N,1)*2);

[pairs,k]=unique(gsort([x y],'lr','i'),'r');
f=diff([k;N+1])/N;

freq=sparse(pairs,f)
N/(N-1)*covar(1:2,1:3,freq)
cov(x,y)

If you have a look to the results,

--> freq
  freq  =

    0.2526   0.2489   0.
    0.       0.2453   0.2532

--> N/(N-1)*covar(1:2,1:3,freq)
  ans  =

    0.249769

--> cov(x,y)
  ans  =

    0.2500182   0.249769
    0.249769    0.4995447

you can see that

1. we have considered the same random variables as in the example 
https://en.wikipedia.org/wiki/Covariance#Example
2. covar's output (up to the normalization to correct the bias) gives 
the off diagonal term of cov(x,y)

So, yes, off diagonal term of cov(x,y) and covar(x,y,fre) (up to unique 
pairs determination, computation of "fre" and bias correction) have the 
same value, but is it a reason to merge the two functions. I think that 
the answer is NO.

If you agree or disagree, feel free to continue the discussion in this 
thread.

S.

-- 
Stéphane Mottelet
Ingénieur de recherche
EA 4297 Transformations Intégrées de la Matière Renouvelable
Département Génie des Procédés Industriels
Sorbonne Universités - Université de Technologie de Compiègne
CS 60319, 60203 Compiègne cedex
Tel : +33(0)344234688
http://www.utc.fr/~mottelet

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.scilab.org/pipermail/users/attachments/20200219/9ac7f817/attachment.htm>