[Scilab-users] using csvRead vs mfscanf and fscanfMat

Sat Oct 15 20:49:27 CEST 2016

Hi Samuel,

Please check test code here below, comparing csvRead vs mfscanf and fscanfMat for the asci format used by Philipp and a file with 50,000 lines of data.
On my laptop it takes about 35 s to run mainly because of evstr function, which is avoided in the mfscanf and fscanfMat methods as shown.

// Simple test of mfscanf, fscanfMat and csvRead methods
//START OF CODE
clear;
txt = [ "HEADER-Line",
"01.12.2015, 01:15:00.12, 1.1, -2.2"];

u = mopen("myfile.txt","w");
mfprintf(u,"%s\n",txt(1));
mfprintf(u,"%s\n",repmat(txt(2),50000,1));  //output file with 50,000 lines
mclose(u)

timer();

// SOLUTION#1: mfscanf
u = mopen("myfile.txt","r");
h = mfscanf(1,u,"%s\n");
r = mfscanf(-1,u,"%d.%d.%d, %d:%d:%d.%d, %f, %f\n");
mclose(u)
r = r(:,:);  //to convert from mlist of ctype to matrix of constant type
t1 = timer();

// SOLUTION#2: fscanfMat
u = mopen("myfile.txt","r");
tx = mgetl(u,-1)
mclose(u);
tx = tx(2:$);              // get rid of header line
tx1 = part(tx,1:24);  // get date and time
tx2 = part(tx,25:$);  // get numeric data values
// Now get rid of separators:
tx1 = strsubst(tx1,'.',' ');
tx1 = strsubst(tx1,':',' ');
tx1 = strsubst(tx1,',',' ');
tx2 = strsubst(tx2,',',' ');
tx = tx1 + tx2; // regroups all data but now with numeric values only
fd = mopen("temp.bak","w");
mputl(tx,fd);
mclose(fd);
m = fscanfMat('temp.bak')
mdelete('temp.bak');
t2 = timer();

// SOLUTION#3: csvRead
q = csvRead("myfile.txt",",",[],"string",[":", ","],[],[],1);
tx1 = q(:,1);
tx2 = q(:,2:$);
q2 = evstr(tx2);  // Most time consuming step
// (plus, work will still be required to handle dates in txt1)
t3 = timer();

disp( [tx1(1,:) string(q2(1,:))], r(1,:), m(1,:) )
printf("\ntime1= %g\ntime2= %g\ntime3= %g",t1,t2,t3)
//END OF CODE

The results for a 50,000-lines input ASCII file are:
   time1= 0.686404
   time2= 0.499203
   time3= 35.3966

Regards,
Rafael

From: users [mailto:users-bounces at lists.scilab.org] On Behalf Of Samuel Gougeon
Sent: Saturday, October 15, 2016 7:36 PM
To: Users mailing list for Scilab <users at lists.scilab.org>
Subject: Re: [Scilab-users] using csvRead

Le 15/10/2016 19:16, Rafael Guerra a écrit :
Hi Samuel,

As the data is loaded by csvRead as strings in the example below (if loading as doubles then we get NaN's), it will require further processing to convert it to numeric (using evstr, tokens or other).
For very large data files, this seems to be rather slow compared to the mfscanf or fscanfMat solutions.

What do you think?
.
AFAIK, fscanfMat() is very stiff. It can parse files only for numbers, with no interstitial contents.
I know no benchmark comparing csvRead() + evstr() vs mfscanf(). Despite evstr() is vectorized, you may be right. Explicit results would be interesting.
mfscanf() requires the structure of a row been explicitly known. But then it looks certainly the most versatile and adaptable solution to read and split it.
csvRead() requires just to know the separator.

Samuel