[Scilab-users] datafit() : upgrade proposal (SEP)

Thu Jan 11 13:43:53 CET 2018

Dear co-scilabers,

The non-linear fitting function datafit() could be very useful. However, 
it has presently two important drawbacks that hinder it and somewhat 
prevent actually using it. Beyond fixing this, other improvements are 
possible.

We propose here to upgrade datafit() in the way described here-below and 
in the bug report 15344 <http://bugzilla.scilab.org/15344>. All these 
propositions are back-compatible:

 1. *Data weights s**h**ould be accepted*.
    Presently, it is not possible to specify some data weights. Only the
    different ways to assess the Model-to-data distance can be weighted.
    Yet, in the true life, experimental data are almost always qualified
    with respective uncertainties. The invert of uncertainties are a
    good first assessment of possible data weights.
    This SEP proposes to add a new *Wz option*, to be provided just
    after the data points Z :
    datafit([iprint,] G[,DG],*Z**[,**Wz**] *[,Wg][,'b',p_inf,p_sup],
    p0[,algo][,stop]).
    It must be a row of size(Z,2) of real numbers. Data weights Wz are
    taken into account in the cost function to be minimized, in the
    following way (Wg being the matrix of gaps weights, named W in the
    current documentation):

    f = (G(p, Z(:,1))' * Wg * G(p,Z(:,1)))* Wz(1) + ..
         (G(p, Z(:,2))' * Wg * G(p,Z(:,2)))* Wz(2) + ..
         ...
         (G(p, Z(:,nz))' * Wg * G(p,Z(:,nz)))* Wz(nz)

    If only one gap definition is implemented, this cost is simplified
    into the common weighted least squares

    f = G(p, Z(:,1))^2 * Wz(1) + ..
         G(p, Z(:,2))^2 * Wz(2) + ..
         ... + ..
         G(p, Z(:,nz))^2 * Wz(nz)

    In the overhauled datafit help page, the first example illustrates
    the gain in fitting accuracy when data are weighted.

 2. *datafit() should be vectorized for the vector of data points*.
    Presently, if the gap function G(p,Z) computing the
    model-to-data-point distance(s) is vectorized for the data points Z,
    i.e. if it is able to compute distances to all data points in one
    single call in a fast vectorized way, datafit() is unable to use
    this feature: It calls G() as many times as there are data points,
    explicitly looping over points, in Scilab language. This clearly
    makes datafit() (very) slow. We propose here to make datafit() able
     1. to detect whether G(p,Z) is vectorized or not
     2. to call G(p,Z) only once whether G(p,Z) is vectorized
     3. to still loop over Z(:,i) points whether G(p,Z) is not
        vectorized (back-compatibility)

        The gap function G(p,Z) must then return a (ng x nz) matrix
        (instead of (ng x 1)), where nz = size(Z,2) is the number of
        data points.
        A first test for 200 data points has shown that datafit() is
        fastered by a factor of ~100 (from >50 s to 0.5 s).
        This kind of acceleration unlocks the function for actual usages.

 3. *For the "qn" quasi-newton algorithm, the termination status should
    be returned*.
    In contrary to optim() that datafit() uses, datafit() does not
    return any termination status about the minimization convergence.
    This prevents to qualify the result, while the convergence may be
    bad due to initial parameters too far from actual ones. This makes
    suspectable all datafit() results, even when they are excellent.
    We propose to return the termination /status/ as returned by optim()
    for the "qn" algo, as a third optional output argument. When another
    algorithm is used ("gc" or "nd"), /status/ will be set to %nan
    instead (no actual value returned by optim()).
    *
    *
 4. *It should be possible to specify initial parameters p0 and their
    lower and upper bounds pinf and psup as a matrix, instead of only a
    mandatory column*.
    Actually, a matrix may be sometime more readable and handy. For
    instance, if the model is the sum of N normal laws, p as
    [mean1  mean2  .. meanN
      stDev1 stDev2 .. stDevN
       y1     y2    ..  yN ]
    is more handy than
    [mean1 stDev1 y1 mean2 stDev2 y2 .. meanN stDevN yN]'
    noticeably in the G() gap function where a loop over columns/laws
    then becomes possible and welcome.

 5. *The overall least square "error" fmin = value of the minimized cost
    function should be replaced with the average model-to-data-points
    distance.
    *Presently, the returned "err" = fmin 2nd output argument is the
    value of the minimized cost function f. This raw unnormalized
    algorithmical output is not really explanatory and practical. It
    measures the Model-to-data distance only in a quite indirect way:
      * if we double (clone) each data point, err is doubled
      * if we double (clone) the used gap criteria, err is doubled

        althought in both cases the average Model-to-data distance is
        the same.
        The true average minimal  Model-to-data distance dmin reached
        can be computed as |sqrt(fmin/ng/nz)| or
        |sqrt(fmin/ng/sum(Wz))|  where ng is the number of gap criteria,
        nz the number of unweighted data points, and Wz the vector of
        data weights|.| Since it is normalized against ng, nz or Wz,
        this |out|put is more relevant and handy to qualify the
        performed fitting and to compare several fittings.||To provide
        it when datafit() returns, two ways are possible:

      * *Either dmin replaces err=fmin as the 2nd output argument*. This
        is what we propose here. But this is not a back-compatible
        evolution. We think that, because datafit() is presently very
        slow and quite poor, and this current err not handy, it has been
        rarely used up to now. The grey impact of this improvement
        should then only be light, much smaller than the positive gain
        we would get from it.

      * *Or dmin could be added as a new third output* (and the /status/
        as a fourth output).

 6. Finally, *the help page should be overhauled* (bug 7732
    <http://bugzilla.scilab.org/7732> and more)

The overhauled help page for this upgrade is there 
<https://codereview.scilab.org/cat/19661%2C2%2Cscilab/modules/optimization/help/en_US/nonlinearleastsquares/datafit.pdf%5E0>.
This upgrade targets Scilab 6.1.0 (~2020 ?)

Do you need or already use datafit() ?

Hope reading your comments, noticeably about the choice about the 5th item,

Best regards
Samuel Gougeon

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.scilab.org/pipermail/users/attachments/20180111/3f246a5f/attachment.htm>